This is a long post. I don’t like long posts, but it’s research, so it is motivated.
In 2 days I’ll fly to Salt Lake City [Ut] to join the 2016 ASA meeting and talk about our ongoing work on 2D Vowel synthesis. The presentation will discuss the results from a set of experiments on articulatory vocal synthesis that I ran with Arvind Vasudevan, the student I am supervising at the HCT lab. The message that I want to send forth is quite simple and a bit contentious: it’s time to move beyond the usage of mere area functions for speech simulations.
Here is how we matured this thought.
We digitally re-enacted the famous experiment on vowel synthesis that Brad Story published in 1996, using the same measured area functions to simulate in real-time the 3 corner vowels [‘a’, ‘i’, ‘u’]. We used our GPU based air-propagation model to first calculate the frequency responses; then we coupled the system with a body-cover glottal model, to actually synthesize the vowels as voice. This glottal model is again based on a work by Story, that Arvind transformed into a shader compatible with our propagation model [well done Arvind].
We compared our results with other 3 cases: first of all with the real resonances measured by Story on the subject whose MRI scans were used to populate the area function dataset; second, with the results obtained by Story using his off-line 1D propagation model, also presented in his 1996 paper; finally, with the output of JASS, a real-time 1D model developed at UBC some years ago. As opposed to the last 2 cases, our propagation model solves pressure and velocity equations in 2D and this is an example of how area functions look like when transformed into a 2D domain [or contour]:
This is an ‘a’. Beautiful, isn’t it? The 3 models behave quite similarly in terms of frequency response, with our 2D approach in general performing a little worse than the off-line model and a little better than JASS. Furthermore, the acoustic result of the full vocal synthesis using the glottal model is quite good. These are all fair results, but not extremely interesting. The real contribution of this work is another one.
Area functions have been adopted in articulatory vocal synthesis to run fast! In acoustics, 3D propagation models still take hours to calculate the impulse response of a 3D geometry; as opposed, 1D models can run in real-time but they need a simplified version of the geometry. And here it comes the area function, that turns a shape into a series of cylinders with variable radii. However, below layers of flesh and tissues, the human vocal tract forms very weird shapes when we speak, with parallel lobes, forks and asymmetries. Have a look:
In a case like this one, which is the most harmless ‘a’, an area function approach [i.e., cutting the mesh into slices, save the areas and build a series of cylinders out of it] produces a geometry that generally resembles the acoustics of the original shape only up until 5K Hz. This is ok to more or less simulate the first 3-4 formants that characterize a vowel, but all the timbrical content that defines the naturalness of voice is missing. Yeah, articulatory vocal synthesis is still quite tough!
Well, if we wanna make voice in real-time and 1D models are the only ones that can do it, some may think that this is it! This is the only way to go, let’s cut down 3D shapes into 1D area functions and tweak the simulation to get the best results possible! Many valuable research lines follow this approach, but I always tend to think that a rigorous physical representation of area functions will never sound better than this [perfect real-time 3D propagation simulation!].
But FINALLY something has changed. Indeed, we’re now able to run a 2D simulation in real-time and this changes quite a bit the scenario! So, what’s the difference? They are subtle, but they are there. Let’s start with something interesting. As showed in our experiment, we can easily use area functions in a 2D simulation, producing fairly good results. In the first part of our work we visualized area functions as symmetric contours and ran our tests. But no one prevents us to go for an asymmetric representation of area functions…here is what I mean:
These are 2 different representations of the same area function, a ‘u’. On the left, the classic symmetric representation, as the cross section of a symmetric tube; on the right, the lower contour has been flattened out, but preserving all the distances between the pixels. Same vowel, same area function, but different 2D contours. They look different and they SOUND different. So what’s the point? The point here is that in a 2D simulation area functions are somewhat inconsistent. We can’t say that one representation sounds better than the other, also because the question is, sounds better compared to what?! Compared to the 3D vocal tract whose area function this is, or compared to a set of cylinders that represent that area function?
So, if we wanna pursue this 2D approach [and I wanna do it], we’d better come up with some new and clever ideas to describe a 3D vocal tract in 2 dimensions. The target is to condense as much information as possible, despite the loss of one dimension. A 2D representation allows for the direct inclusion of curvature and a certain degree of asymmetries found in the tract. But there are many other ideas to explore, including flattening lobes and parallel tracts found on different layers and create custom asymmetries to simulate some of the effects of the third dimension. And just to be clear, the crisp contour you can find at the top of this post looks beautiful, but it will never sound good…it is a mid-sagittal slice that carries for sure less info than an area function. It is just there to remark that we have to move towards something more detailed, more realistic, more creative.
So, who needs area functions?
Well, I do! All these new ideas still rely on area functions and are extension that rely on the new possibilities made available but the innovative 2D system we have been working on over the last year.
There’s lots of space for exploration here, it’s a virgin territory.
Absolutely more to come on this, more to come_