+ - 0:00:00
Notes for current slide
Notes for next slide

\global\def\myx#1{{\color{green}\mathbf{x}_{#1}}} \global\def\mys#1{{\color{green}\mathbf{s}_{#1}}} \global\def\myz#1{{\color{brown}\mathbf{z}_{#1}}} \global\def\myhnmf#1{{\color{brown}\mathbf{h}_{#1}}} \global\def\myztilde#1{{\color{brown}\tilde{\mathbf{z}}_{#1}}} \global\def\myu#1{\mathbf{u}_{#1}} \global\def\mya#1{\mathbf{a}_{#1}} \global\def\myv#1{\mathbf{v}_{#1}} \global\def\mythetaz{\theta_\myz{}} \global\def\mythetax{\theta_\myx{}} \global\def\mythetas{\theta_\mys{}} \global\def\mythetaa{\theta_\mya{}} \global\def\bs#1{{\boldsymbol{#1}}} \global\def\diag{\text{diag}} \global\def\mbf{\mathbf} \global\def\myh#1{{\color{purple}\mbf{h}_{#1}}} \global\def\myhfw#1{{\color{purple}\overrightarrow{\mbf{h}}_{#1}}} \global\def\myhbw#1{{\color{purple}\overleftarrow{\mbf{h}}_{#1}}} \global\def\myg#1{{\color{purple}\mbf{g}_{#1}}} \global\def\mygfw#1{{\color{purple}\overrightarrow{\mbf{g}}_{#1}}} \global\def\mygbw#1{{\color{purple}\overleftarrow{\mbf{g}}_{#1}}} \global\def\neq{\mathrel{\char`≠}}

Speech production and modeling

Simon Leglaive

CentraleSupélec

1

Today

  • Speech production mechanisms
  • Characteristics of speech signals
2

Speech production

3

Speech signal

Can you guess to what “speech sound” each bloc corresponds?

4

Phonemes

Elementary speech sounds are called phonemes.

  • 44 phonemes in English.
  • 10-15 phonemes per second in normal English speech.
  • We are going to see what are the key differences in the production of the different phonemes.

Image credit: https://www.englishclub.com/pronunciation/phonemic-chart.htm

5

Speech production – the global view

  • The energy comes from air expelled from the lungs.
  • At the larynx, this airflow passes between the vocal folds.
  • Then it goes through the vocal tract, which is made of three cavities:
    1. the pharynx
    2. the oral cavity
    3. the nasal cavity
  • Finally, sound goes out of the mouth and nose openings.

6

Articulators

We consider as articulator any mobile part of the vocal tract on which we can act voluntarily and which is used in the production of speech sounds.

Real-time MRI scan of a person talking.

What are the three main speech articulators?

7

Tongue

  • Very mobile and flexible
  • Very important for phonation

Jaw

  • Little degrees of freedoms and rigid
  • Less important for phonation

Lips

  • Very mobile and flexible
  • Important movements for phonation:
    • occlusion
    • protrusion
    • raising and lowering
    • stretching, raising and lowering of lip corners
8

Speech sound sources

We distinguish 3 types of sound sources, which can be combined or occur individually:

  • Quasi-periodic source resulting from the vibration of the vocal folds.

    We say that the sound is voiced.

    It can be arbitrarily long (in the limits of an exhalation).

  • Fricative noise source produced by a turbulent airflow with a constriction in the vocal tract.

    It can also be arbitrarily long.

  • Plosive noise source produced by quick occlusions of the vocal tract and generating an acoustic impulse.

    Here the duration is short.

9

Vocal folds and pitch

  • The vibration of the vocal folds defines the pitch of the speech signal (i.e. its fundamental frequency).
  • Variations of pitch along time define the melody of the voice.

Average pitch (Hz) Pitch range (Hz)
Male 100 - 130 90 - 270
Female 150 - 300 120 - 360
Child 350 - 400 200 - 600
11

Vocal tract and formants

  • The three elementary sound sources are modified by the vocal tract, before propagating out of the phonatory system, through the mouth and nose openings.
  • The vocal tract actually corresponds to an acoustic filtering of the source signal.
  • The cavities in the vocal tract give rise to resonances, that are called the formants.
  • By modifying the shape of the vocal tract, we change the acoustic filter and the associated resonances.
  • We can change the formants independently of the pitch, or in signal processing terms, we can change the filter independently of the source
13

Distinctive articulatory features of vowels

  • Opening of the mouth

    • Opened vowel [a] in “hat"
    • Closed vowel [i] in “meet”
  • “Frontness” of the tongue

    • Front vowel [i] in “meet”
    • Back vowel [u] in “boot”
  • Rounding of the lips

    • Rounded vowel [ɔ] in “not”
    • Not rounded vowel [i] in “meet”
  • Nasalization: sound comes out of the mouth only, or out of the mouth and nose.

    • Nasal vowel [ɑ̃] in “pente” in French
    • Oral vowel [a] in “hat”
15

Vowels and formants

We can distinguish between vowels using the position of the first formants

  • high/low F1 ↔ opened/closed
  • high/low F2 ↔ front/back
  • high/low F3 ↔ not rounded/rounded lips

By moving articulators, the shape of the vocal tract varies, formants move in frequency, and vowels change.

Image credit: L.R. Rabiner and R.W. Shafer, “Digital Processing of Speech Signals”, Prentice-Hall, 1978

16

Vowels clustering in the formants space

male speakers

male and children speakers

Image credit: L.R. Rabiner and R.W. Shafer, “Digital Processing of Speech Signals”, Prentice-Hall, 1978
17

Consonants

Fricatives

  • fricative noise source
  • voiced [v, z, j] or unvoiced [?, ?, ?]
  • locally stationary

Plosives

  • plosive noise source
  • voiced [?, ?, ?] or unvoiced [p, t, k]
  • highly non-stationary

Nasal

  • voiced
  • sound comes mostly from the nose
  • examples: [m, n]

Liquids

  • voiced
  • the vocal tract changes rapidly, especially using the tongue
  • examples: [l, r]

Consonant chart with audio: https://en.wikipedia.org/wiki/IPA_pulmonic_consonant_chart_with_audio

18

Consonants

Fricatives

  • fricative noise source
  • voiced [v, z, j] or unvoiced [f, s, ch]
  • locally stationary

Plosives

  • plosive noise source
  • voiced [b, d, g] or unvoiced [p, t, k]
  • highly non-stationary

Nasal

  • voiced
  • sound comes mostly from the nose
  • examples: [m, n]

Liquids

  • voiced
  • the vocal tract changes rapidly, especially using the tongue
  • examples: [l, r]

Consonant chart with audio: https://en.wikipedia.org/wiki/IPA_pulmonic_consonant_chart_with_audio

18

Prosody

  • Prosody is on top of the flow of phonemes.
  • Prosodic variables:

    • pitch (fundamental frequency)
    • speech rate (number of speech units, e.g. phonemes, per second)
    • loudness (intensity)
    • timbre (spectral characteristics such as amplitude of harmonics)
  • Different combinations of these variables are exploited for intonation and accentuation.

  • Prosody may reflect various features of the speaker or the utterance:

    • the identity of the speaker
    • the emotional state of the speaker
    • the form of the utterance (statement, question, or command)
    • the presence of irony or sarcasm
    • emphasis
20

Spectrum/spectrogram reading

21

The spectral envelope

  • Black curve: power spectrum (in dB) of the recording of a vowel, computed with the DFT.
  • Blue curve: spectral envelope showing the formant resonances, computed with linear predictive coding (will be discussed in the lab session).
22

The spectral envelope

  • Black curve: power spectrum (in dB) of the recording of a vowel, computed with the DFT.
  • Blue curve: spectral envelope showing the formant resonances, computed with linear predictive coding (will be discussed in the lab session).

Male or female speaker?

23

Go to https://app.wooclap.com/CXIOJL and find the vowel that corresponds to each spectrum, using the above French vocal triangle.

24

Spectrogram reading - "aeiou"

We could have done the same from a spectrogram representation.

25

Spectrogram reading - "assa - azza"

26

Spectrogram reading - "atta - adda"

27

With a bit of practice you could be able to decode this mystery spectrogram.

1 bonus point if you decode the message 😉.

28

Further reading

Introduction to voice acoustics by Joe Wolfe, Emeritus Professor at the University of New South Wales (Syndney, Australia):

https://newt.phys.unsw.edu.au/jw/voice.html

29

Lab session

Analysis, transformation and synthesis of speech signals with the source-filter model and linear predictive coding

30

Solution to the wooclap

31

32

Today

  • Speech production mechanisms
  • Characteristics of speech signals
2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow