Rebecca Kleinberger MAS862.13 Project

Questions

a. What is the goal?

Showing the interest of considering the physical and biological aspects of vocal production when digitally transmitting vocal information

b. To accomplish this, what question will you answer?

Anatomy and mechanisms involved in voice production?
What are the difficulties in studying articulators that can not be accessed or measured easily?
How can they be modeled and what are their (slowly varying) parameters?
What is theorically the computational cost saving for lossless reconstruciton of a voice signal with using a biomechanically informed codec?

c. What technique(s) will you use to answer them?

Matlab coding
         - signal processing
         - physical modeling
         - entropy computation

d. What is the prior art?

literature review

e. How will you evaluate the results?

Organisation

I - Intro: The physics of vocal production

     1) Anatomy
     2) Mechanisms
     3) Filter-Source model

II - From signal to physics

     1) Air flow : envelope extraction
     2) Glottal signal : estimation of F0
     3) Filtration by the cavities
     4) Exemples

III - From physics to signal

     1) Vocal tract shape
     2) Scattering equation
     3) Glottal signal model
     4) Results

IV - Information theory point of view

     1) Entropy in speech in the audio signal paradygm
     2) State of the art of low rate of low bit rate coding
     3) Encoding of the physical model

I - Introduction: The physics of vocal production

I-1) Anatomy

from (1) and (2)

One word on evolution: laryngeal descent -> origin of speech (species evolution and also human child evolution)

Voice recognition <-> face recognition (3)

Different physiological elements conditionne the human voice compared to other auditory signals
        - Loudness in the range of 55 to 80 dB
        - Fundamental frequency from 85 to 180 Hz for an adult male and from 165 to 300Hz for an adult female
        - The frequency decomposition is dependent of the vocal tract contraction and thus limited by his shape

Vocoder : Homer Dudley 1935 Bell Lab dived voice signal in 12 frequency band between 400Hz and 3400 Hz and save 90% of bandwith

I-2) Mechanisms

from (4)

We can consider that the voice production results from three phenomena

The air flow
         - comes from the diaphragm contraction
         - the energy that enables self sustained vibration of the vocal cords
         - envelope of the sound signal

The vocal cords vibrations
         - self sustained by air flow
         - pitch F0

from (5)

The vocal tract shape
         - filters the glottal signal by damping or increasing certain frequencies

vowels A and E

formant patern

I-3) Filter-Source model

models from (6)

II - From signal to physics :

BioMechanically meaningful slowly varying parameters

Inverse problem

Learning from the voice about the voice production

Learning from the voice about the voice itself

Matlab Code

II - 1) Air flow : envelope extraction

Using a detection function

II - 2) Glottal signal : estimation of F0

Estimation of F0 glottal source frequency Method 1: by analyse of the cepstrum

F0=275.453Hz

Estimation of F0 glottal source frequency Method 2: by use of autocorrelation function

rmax=0.87549 Fx=270.221Hz

F0 against time

II - 3) Filtration by the cavities

Formant pattern

Formant pattern over time

II - 4) Exemples

A_E_E sound

O crescendo sound

Several vowels pronounced quickly

III - From physics to signal

Forward problem

Waveguide modelisation

Shape of the guide and scattering equation (reflexions)

Model of the imput signal

Matlab Code

III - 1) Vocal tract shape

Discretisation of the different cavities from (7)

from (8)

Based on MRI measurments (data from (9))

III - 2) Scattering equation

Scattering

III - 3) Glottal signal model

Difficulties in studying articulators that can not be accessed or measured easily

Two mass model

Different inverse filtering glottal flow models from (10)

         - The Rosenberg trigonometric source model

         - The LF model with 5 parameters

         - Model based on High-speed imaging of the vocal folds with synchronous audio recordings (Yen-Liang Shue)

III - 4) Results

A sound

Play Result

I sound

Play Result

U sound

Play Result

E sound

Play Result

IV - Information theory point of view

IV - 1) Entropy in speech in the audio signal paradygm

Shannon entropy provides an absolute limit on the best possible lossless encoding or compression of any communication, assuming that the communication may be represented as a sequence of independent and identically distributed random variables

Gives us the minimal theorical number of bits/ audio sample

An Introduction to Information Theory: Symbols, Signals and Noise By John Robinson Pierce Chapter VII Efficient coding
         - Continuous signal fidelity criterium -> 128 values (hyperquantization)
         - Efficiency is not everything, vocodeur can transmit only one voice -> waveform decoding requieres 15,000 bit/s
         - Pulse Code Modulation 30,000 to 60,000 bit/s
         - Vocodeur 2,400 bit/s
         - Linear predictive Machines gives very good speech at 9,600 bit/s, intelligible speech at 2,400 bit/s, barely intelligible speech at 600 bit/s.

Matlab code

1D entropy
         - french literature raw file (Proust) : entropy=7.40137 bit/sample
         - english talk : entropy=8.43616 bit/sample

2D entropy


Theory minimum of 7*44100 = 308,700 bit/s

IV - 2) State of the art of low bit rate coding

For comparison an audio raw file (avi) 16 bits/sample and generaly 44100 Hz

Low bit rate coding = compressing according to perceptual acourstic characteristics, lookup table and fft and remove the frequencies that human can not hear = 128kbits/seconds

We can translate everything in bits/seconds
         - Raw audio = 705,600 bit/s
         - mp3 = 128,000 bit/s
         - very low bot rate coding (11) go down to 64,000 bit/s
         - CTaac-Plus = 48,000 bit/s (12)

According to Karlheinz Brandenburg (Ilmenau Technical University & Fraunhofer IIS Arbeitsgruppe Elektronische Medientechnologie Ilmenau, Germany) "Current work on audio compression concentrates more on flexibility as needed for Internet multimedia or new multichannel applications than on improving on coding efficiency. "

IV - 3) Bits/seconds in physical modeling

In the hypothesis of a perfect physical model

Number of parameters (mechanical, slowly varying)
         - 44 for vocal tract
         - Tension of glottis = F0
         - Breathiness
         - Lips motion, area 2 param

Variation rate ~20Hz

1,100 bit/s