Rebecca Kleinberger MAS862.13 Project



Questions

a. What is the goal?

Showing the interest of considering the physical and biological aspects of vocal production when digitally transmitting vocal information

b. To accomplish this, what question will you answer?

Anatomy and mechanisms involved in voice production?
What are the difficulties in studying articulators that can not be accessed or measured easily?
How can they be modeled and what are their (slowly varying) parameters?
What is theorically the computational cost saving for lossless reconstruciton of a voice signal with using a biomechanically informed codec?

c. What technique(s) will you use to answer them?

Matlab coding
         - signal processing
         - physical modeling
         - entropy computation

d. What is the prior art?

literature review

e. How will you evaluate the results?



Organisation


I - Intro: The physics of vocal production

     1) Anatomy
     2) Mechanisms
     3) Filter-Source model

II - From signal to physics

     1) Air flow : envelope extraction
     2) Glottal signal : estimation of F0
     3) Filtration by the cavities
     4) Exemples

III - From physics to signal

     1) Vocal tract shape
     2) Scattering equation
     3) Glottal signal model
     4) Results

IV - Information theory point of view

     1) Entropy in speech in the audio signal paradygm
     2) State of the art of low rate of low bit rate coding
     3) Encoding of the physical model







I - Introduction: The physics of vocal production


I-1) Anatomy

some text   some text
from (1) and (2)


  • One word on evolution: laryngeal descent -> origin of speech (species evolution and also human child evolution)

  • Voice recognition <-> face recognition (3)

  • Different physiological elements conditionne the human voice compared to other auditory signals
            - Loudness in the range of 55 to 80 dB
            - Fundamental frequency from 85 to 180 Hz for an adult male and from 165 to 300Hz for an adult female
            - The frequency decomposition is dependent of the vocal tract contraction and thus limited by his shape

  • Vocoder : Homer Dudley 1935 Bell Lab dived voice signal in 12 frequency band between 400Hz and 3400 Hz and save 90% of bandwith

  • I-2) Mechanisms

    some text
    from (4)



    We can consider that the voice production results from three phenomena

  • The air flow
             - comes from the diaphragm contraction
             - the energy that enables self sustained vibration of the vocal cords
             - envelope of the sound signal

  • The vocal cords vibrations
             - self sustained by air flow
             - pitch F0
    some text
    from (5)


  • The vocal tract shape
             - filters the glottal signal by damping or increasing certain frequencies

    some text
    vowels A and E


    some text
    formant patern

  • I-3) Filter-Source model

    some text
    models from (6)








    II - From signal to physics :


  • BioMechanically meaningful slowly varying parameters

  • Inverse problem

  • Learning from the voice about the voice production

  • Learning from the voice about the voice itself

  • Matlab Code


  • II - 1) Air flow : envelope extraction

  • Using a detection function

    some text


  • II - 2) Glottal signal : estimation of F0

  • Estimation of F0 glottal source frequency Method 1: by analyse of the cepstrum

    some text

    F0=275.453Hz


  • Estimation of F0 glottal source frequency Method 2: by use of autocorrelation function

    some text

    rmax=0.87549 Fx=270.221Hz


  • F0 against time

    some text




  • II - 3) Filtration by the cavities

  • Formant pattern

    some text




  • Formant pattern over time

    some text

  • II - 4) Exemples

  • A_E_E sound

    some text some text
    some text




  • O crescendo sound

    some text some text
    some text




  • Several vowels pronounced quickly

    some text some text
    some text









  • III - From physics to signal

  • Forward problem

  • Waveguide modelisation

  • Shape of the guide and scattering equation (reflexions)

  • Model of the imput signal

  • Matlab Code

  • III - 1) Vocal tract shape

  • Discretisation of the different cavities from (7)

    some text
    from (8)


  • Based on MRI measurments (data from (9))


  • III - 2) Scattering equation

  • Scattering

    some text
  • III - 3) Glottal signal model

  • Difficulties in studying articulators that can not be accessed or measured easily

  • Two mass model

    some text


  • Different inverse filtering glottal flow models from (10)

  •          - The Rosenberg trigonometric source model
    some text
    some text


             - The LF model with 5 parameters
    some text
    some text


             - Model based on High-speed imaging of the vocal folds with synchronous audio recordings (Yen-Liang Shue)
    some text
    some text

    III - 4) Results



  • A sound

    some text
    some text some text
    Play Result




  • I sound

    some text
    some text some text
    Play Result




  • U sound

    some text
    some text some text
    Play Result




  • E sound

    some text
    some text some text
    Play Result








  • IV - Information theory point of view


    IV - 1) Entropy in speech in the audio signal paradygm

  • Shannon entropy provides an absolute limit on the best possible lossless encoding or compression of any communication, assuming that the communication may be represented as a sequence of independent and identically distributed random variables

  • Gives us the minimal theorical number of bits/ audio sample

  • An Introduction to Information Theory: Symbols, Signals and Noise By John Robinson Pierce Chapter VII Efficient coding
             - Continuous signal fidelity criterium -> 128 values (hyperquantization)
             - Efficiency is not everything, vocodeur can transmit only one voice -> waveform decoding requieres 15,000 bit/s
             - Pulse Code Modulation 30,000 to 60,000 bit/s
             - Vocodeur 2,400 bit/s
             - Linear predictive Machines gives very good speech at 9,600 bit/s, intelligible speech at 2,400 bit/s, barely intelligible speech at 600 bit/s.



  • Matlab code

  • 1D entropy
             - french literature raw file (Proust) : entropy=7.40137 bit/sample
             - english talk : entropy=8.43616 bit/sample

  • 2D entropy
            

  • Theory minimum of 7*44100 = 308,700 bit/s



  • IV - 2) State of the art of low bit rate coding

  • For comparison an audio raw file (avi) 16 bits/sample and generaly 44100 Hz

  • Low bit rate coding = compressing according to perceptual acourstic characteristics, lookup table and fft and remove the frequencies that human can not hear = 128kbits/seconds

  • We can translate everything in bits/seconds
             - Raw audio = 705,600 bit/s
             - mp3 = 128,000 bit/s
             - very low bot rate coding (11) go down to 64,000 bit/s
             - CTaac-Plus = 48,000 bit/s (12)

  • According to Karlheinz Brandenburg (Ilmenau Technical University & Fraunhofer IIS Arbeitsgruppe Elektronische Medientechnologie Ilmenau, Germany) "Current work on audio compression concentrates more on flexibility as needed for Internet multimedia or new multichannel applications than on improving on coding efficiency. "

  • IV - 3) Bits/seconds in physical modeling

  • In the hypothesis of a perfect physical model

  • Number of parameters (mechanical, slowly varying)
             - 44 for vocal tract
             - Tension of glottis = F0
             - Breathiness
             - Lips motion, area 2 param

  • Variation rate ~20Hz

  • 1,100 bit/s

    some text