Rebecca Kleinberger MAS862.13 Project
Questions
a. What is the goal?
Showing the interest of considering the physical and biological aspects of vocal production when digitally transmitting vocal information
b. To accomplish this, what question will you answer?
Anatomy and mechanisms involved in voice production?
What are the difficulties in studying articulators that can not be accessed or measured easily?
How can they be modeled and what are their (slowly varying) parameters?
What is theorically the computational cost saving for lossless reconstruciton of a voice signal with using a biomechanically informed codec?
c. What technique(s) will you use to answer them?
Matlab coding
- signal processing
- physical modeling
- entropy computation
d. What is the prior art?
e. How will you evaluate the results?
Organisation
I - Intro: The physics of vocal production
1) Anatomy
2) Mechanisms
3) Filter-Source model
II - From signal to physics
1) Air flow : envelope extraction
2) Glottal signal : estimation of F0
3) Filtration by the cavities
4) Exemples
III - From physics to signal
1) Vocal tract shape
2) Scattering equation
3) Glottal signal model
4) Results
IV - Information theory point of view
1) Entropy in speech in the audio signal paradygm
2) State of the art of low rate of low bit rate coding
3) Encoding of the physical model
I - Introduction: The physics of vocal production
I-1) Anatomy
One word on evolution: laryngeal descent -> origin of speech (species evolution and also human child evolution)
Voice recognition <-> face recognition (3)
Different physiological elements conditionne the human voice compared to other auditory signals
- Loudness in the range of 55 to 80 dB
- Fundamental frequency from 85 to 180 Hz for an adult male and from 165 to 300Hz for an adult female
- The frequency decomposition is dependent of the vocal tract contraction and thus limited by his shape
Vocoder : Homer Dudley 1935 Bell Lab dived voice signal in 12 frequency band between 400Hz and 3400 Hz and save 90% of bandwith
I-2) Mechanisms
from (4)
We can consider that the voice production results from three phenomena
The air flow
- comes from the diaphragm contraction
- the energy that enables self sustained vibration of the vocal cords
- envelope of the sound signal
The vocal cords vibrations
- self sustained by air flow
- pitch F0
from (5)
The vocal tract shape
- filters the glottal signal by damping or increasing certain frequencies
vowels A and E
formant patern
I-3) Filter-Source model
models from (6)
II - From signal to physics :
BioMechanically meaningful slowly varying parameters
Inverse problem
Learning from the voice about the voice production
Learning from the voice about the voice itself
Matlab Code
II - 1) Air flow : envelope extraction
Using a detection function
II - 2) Glottal signal : estimation of F0
Estimation of F0 glottal source frequency Method 1: by analyse of the cepstrum
F0=275.453Hz
Estimation of F0 glottal source frequency Method 2: by use of autocorrelation function
rmax=0.87549 Fx=270.221Hz
F0 against time
II - 3) Filtration by the cavities
Formant pattern
Formant pattern over time
II - 4) Exemples
A_E_E sound
O crescendo sound
Several vowels pronounced quickly
III - From physics to signal
Forward problem
Waveguide modelisation
Shape of the guide and scattering equation (reflexions)
Model of the imput signal
Matlab Code
III - 1) Vocal tract shape
Discretisation of the different cavities from (7)
from (8)
Based on MRI measurments (data from (9))
III - 2) Scattering equation
Scattering
III - 3) Glottal signal model
Difficulties in studying articulators that can not be accessed or measured easily
Two mass model
Different inverse filtering glottal flow models from (10)
- The Rosenberg trigonometric source model
- The LF model with 5 parameters
- Model based on High-speed imaging of the vocal folds with synchronous audio recordings (Yen-Liang Shue)
III - 4) Results
A sound
Play Result
I sound
Play Result
U sound
Play Result
E sound
Play Result
IV - Information theory point of view
IV - 1) Entropy in speech in the audio signal paradygm
Shannon entropy provides an absolute limit on the best possible lossless encoding or compression of any communication, assuming that the communication may be represented as a sequence of independent and identically distributed random variables
Gives us the minimal theorical number of bits/ audio sample
An Introduction to Information Theory: Symbols, Signals and Noise By John Robinson Pierce Chapter VII Efficient coding
- Continuous signal fidelity criterium -> 128 values (hyperquantization)
- Efficiency is not everything, vocodeur can transmit only one voice -> waveform decoding requieres 15,000 bit/s
- Pulse Code Modulation 30,000 to 60,000 bit/s
- Vocodeur 2,400 bit/s
- Linear predictive Machines gives very good speech at 9,600 bit/s, intelligible speech at 2,400 bit/s, barely intelligible speech at 600 bit/s.
Matlab code
1D entropy
- french literature raw file (Proust) : entropy=7.40137 bit/sample
- english talk : entropy=8.43616 bit/sample
2D entropy
Theory minimum of 7*44100 = 308,700 bit/s
IV - 2) State of the art of low bit rate coding
For comparison an audio raw file (avi) 16 bits/sample and generaly 44100 Hz
Low bit rate coding = compressing according to perceptual acourstic characteristics, lookup table and fft and remove the frequencies that human can not hear = 128kbits/seconds
We can translate everything in bits/seconds
- Raw audio = 705,600 bit/s
- mp3 = 128,000 bit/s
- very low bot rate coding (11)
go down to 64,000 bit/s
- CTaac-Plus = 48,000 bit/s (12)
According to Karlheinz Brandenburg (Ilmenau Technical University & Fraunhofer IIS Arbeitsgruppe Elektronische Medientechnologie Ilmenau, Germany) "Current work on audio compression concentrates more on flexibility as needed for Internet multimedia or new multichannel applications than on improving on coding efficiency. "
IV - 3) Bits/seconds in physical modeling
In the hypothesis of a perfect physical model
Number of parameters (mechanical, slowly varying)
- 44 for vocal tract
- Tension of glottis = F0
- Breathiness
- Lips motion, area 2 param
Variation rate ~20Hz
1,100 bit/s