Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals.

  • Published on

  • View

  • Download

Embed Size (px)



Fundamentals of Speech Signal Processing11.0 Speech SignalsWaveform plots of typical vowel sounds - Voiced

tone 1tone 2tone 4 tSpeech Production and Source Model

Human vocal mechanismSpeech Source Model



Voiced and Unvoiced Speechu(t)x(t)pitchpitchvoicedunvoicedWaveform plots of typical consonant soundsUnvoiced Voiced

Waveform plot of a sentence

Frequency domain spectra of speech signalsVoicedUnvoiced

Frequency Domain

Voicedformant frequencies

Unvoicedformant frequenciesFrequency Domain



Formant Frequencies

Formant frequency contours

He will allow a rare lie.Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang2.0 Speech Signal ProcessingSpeech Signal ProcessingMajor Application AreasSpeech Coding:Digitization and Compression

Considerations : 1) bit rate (bps) 2) recovered quality 3) computation complexity/feasibilityVoice-based Network Access User Interface, Content Analysis, User-content InteractionLPFoutputProcessing Algorithmsx(t)x[n]Processingxk110101Inverse Processingx[n]x[n]^Storage/transmissionSpeech SignalsCarrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic LevelProcessing and Interaction of the Double-level Information162015/4/27Sampling of Signals

X(t)X[n]tnDouble Levels of Information

(Character)(Word)(Sentence)Speech Signal Processing Processing of Double-Level Information Speech Signal Sampling Processing Linguistic Structure Linguistic Knowledge

LexiconGrammar AlgorithmChips or Computers

192015/4/27Voice-based Network Access

Content Analysis

User InterfaceInternetUser-Content InteractionUser Interface when keyboards/mice inadequateContent Analysis help in browsing/retrieval of multimedia contentUser-Content Interaction all text-based interaction can be accomplished by spoken language202015/4/27User Interface Wireless Communications Technologies are Creating a Whole Variety of User Terminalsat Any Time, from AnywhereSmart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable DevicesSmall in Size, Light in Weight, Ubiquitous, InvisiblePost-PC EraKeyboard/Mouse Most Convenient for PCs not Convenient any longer human fingers never shrink, and application environment is changedService Requirements Growing ExponentiallyVoice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere, and to the point in one utteranceSpeech Processing is the only less mature part in the Technology Chain

Internet NetworksText ContentMultimedia Content212015/4/27Content AnalysisMultimedia Technologies are Creating a New World of Multimedia ContentMost Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to BrowseThe Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and RetrievalMultimedia Content Analysis based on Speech InformationFuture Integrated NetworksRealtime Information weather, traffic flight schedule stock price sports scores

Special Services Google FaceBookYouTube Amazon

Knowledge Archieves digital libraries virtual museumsIntelligent Working Environment email processors intelligent agents teleconferencing distant learning electric commerce

Private Services personal notebook business databases home appliances network entertainments

222015/4/27User-Content Interaction Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processingvoice informationMultimedia ContentInternet

voice input/ outputtext informationNetwork Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by SpeechUser-Content Interaction can be Accomplished by Spoken and Multi-modal DialoguesHand-held Devices with Multimedia Functionalities Commonly used TodayUsing Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech InformationMultimedia Content AnalysisText Information RetrievalText ContentVoice-based Information RetrievalText-to-Speech SynthesisSpoken and multi-modal Dialogue232015/4/273.0 Speech CodingWaveform-based Approaches Pulse-Coded Modulation (PCM)binary representation for each sample x[n] by quantizationDifferential PCM (DPCM)encoding the differences d[n] = x[n] x[n1] d[n] = x[n] ak x[nk]Adaptive DPCM (ADPCM)with adaptive algorithms Ref : Haykin, Communication Systems, 4-th Ed. 3.7, 3.13, 3.14, 3.15 P

k=1252015/4/27Speech Source Model and Source Coding Speech Source Modeldigitization and transmission of the parameters will be adequateat receiver the parameters can produce x[n] with the modelmuch less parameters with much slower variation in time lead to much less bits requiredthe key for low bit rate speech coding x[n]u[n]parametersparametersExcitation GeneratorVocal Tract ModelExG(),G(z), g[n]x[n]=u[n]g[n]X()=U()G()X(z)=U(z)G(z)U ()U (z)

Speech Source Modelx(t)a[n]tnSpeech Source Model and Source CodingAnalysis and Synthesis

High computation requirements are the price for low bit rate

282015/4/27Simplified Speech Source ModelExcitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n] unvoiced

voiced randomsequencegeneratorperiodic pulse traingeneratorx[n]G(z) = 11 akz-k P

k = 1ExcitationG(z), G(), g[n]Vocal Tract Modelu[n]Gv/uNVocal Tract parameters {ak} : LPC coefficients formant structure of speech signalsA good approximation, though not precise enoughReference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of HuangLPC Vocoder(Voice Coder)

N by pitch detectionv/u by voicing detection

{ak} can be non-uniform or vector quantized to reduce bit rate furtherRef : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993302015/4/27Multipulse LPC poor modeling of u(n) is the main source of quality degradation in LPC vocoderu[n] replaced by a sequence of pulses u[n] = bk [nnk] roughly 8 pulses per pitch periodu[n] close to periodic for voicedu[n] close to random for unvoicedEstimating (bk , nk) is a difficult problemk312015/4/27Multipulse LPC Estimating (bk , nk) is a difficult problemanalysis by synthesis

large amount of computation is the price paid for better speech quality

322015/4/27Multipulse LPC Perceptual WeightingW(z) = (1 ak z -k)(1 ak ck z -k)0 < c < 1for perceptual sensitivityW(z) = 1 , if c = 1W(z) = 1 ak z -k , if c = 0practically c 0.8Error Evaluation E = |X() X() |2 W()d P

k = 1P

k = 1

^332015/4/27Multipulse LPC Error Minimization and Pulse searchu[n] = bk [nnk]x[n] = bk g[nnk]E = E ( b1 , n1 ,b2 , n2)sub-optional solution finding 1 pulse at a time^kk342015/4/27Code-Excited Linear Prediction (CELP) Use of VQ to Construct a Codebook of Excitation Sequencesa sequence consists of roughly 40 samplesa codebook of 512 ~ 1024 patterns is constructed with VQroughly 512 ~ 1024 excitation patterns are perceptually adequateExcitation Search analysis by synthesis

9 ~ 10 bits are needed for the excitation of 40 samples , while {ak} parameters in G(z) also vector quantized

352015/4/27Code-Excited Linear Prediction (CELP) Receiver

{ak} codewords can be transmitted less frequently than excitation codewordRef : Gold and Morgan, Speech and Audio Signal Processing, John Wiley & Sons, 2000, Chap 33

362015/4/274.0 Speech Recognition andVoice-based Network AccessFeature Extractionunknown speech signalPattern MatchingDecision Makingx(t)WXoutput wordfeature vector sequenceReference PatternsFeature Extractiony(t)Ytraining speechSpeech Recognition as a pattern recognition problem382015/4/27A Simplified Block Diagram

Example Input Sentence this is speechAcoustic Models () (th-ih-s-ih-z-s-p-ih-ch)Lexicon (th-ih-s) this (ih-z) is (s-p-iy-ch) speechLanguage Model () (this) (is) (speech) P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model P(wi|wi-1,wi-2) tri-gram language model,etcBasic Approach for Large Vocabulary Speech RecognitionFront-endSignal Processing

AcousticModelsLexiconFeatureVectorsLinguistic Decoding and Search AlgorithmOutput SentenceSpeechCorporaAcousticModelTrainingLanguageModelConstructionTextCorporaLanguageModelInput Speech


Observation Sequences1-dim Gaussian Mixtures

State Transition Probabilities

Simplified HMM

RGBGGBBGRRRPeripheral Processing for Human Perception (P.34 of 7.0 )

Mel-scale Filter Bank

N-gram tri-gramW1 W2 W3 W4 W5 W6 ...... WR

W1 W2 W3 W4 W5 W6 ...... WR

this 50000 this is 500 this is a 5

Text-to-speech SynthesisText Analysis and Letter-to-sound ConversionProsody GenerationSignal Processingand ConcatenationLexicon and RulesProsodic ModelVoice Unit DatabaseInput TextOutput Speech SignalTransforming any input text into corresponding speech signals E-mail/Web page reading Prosodic modeling Basic voice units/rule-based, non-uniform units/corpus-based, model-based532015/4/27text analysis|||| | | || |||Cbb, Nd, Nb, Na, Dfa,VH,Di,D, D, D, T,| | | | | | ||Cbb, Nd, Nb, Na, Dfa,VH,Di,D, D, D, T,5 1 3 1 2 1 1 4 1 2 1 5prosody generationEnergyPauseIntonationToneDurationspeech synthesis

Three Major Stepstext analysisprosody generationspeech synthesisText-to-Speech Synthesis Automatic Prosodic Analysis for an Arbitrary Text Sentence

Predict B1,B2,B3 (B4,B5 determined by Punctuation marks)minor phrase patterns: (Cbb+Nd,Nb+Na,Dfa+VH+Di,D+D,D+T,etc)| | | ||| ||Cbb,Nd, Nb, Na,Dfa,VH,Di,D,D, D, T,minor phrasesprosodic groupsmajor phrasesbreath groupswordsbreak indices5 1 3 1 2 1 1 4 1 2 1 5Speech UnderstandingUnderstanding Speakers Intention rather than Transcribing into Word StringsLimited Domains/Finite Tasksacoustic modelsphrase lexiconSyllable RecognitionKey Phrase Matchinginput utterancesyllable latticephrase graphconcept graphconcept setphrase/concept language modelSemantic Decodingunderstanding resultsAn Example utterance: ? key phrases: () - ( ) - () concept: (inquiry) - (target) - (phone number)562015/4/27Speaker VerificationFeature ExtractionVerificationinput speechyes/noVerifying the speaker as claimedApplications requiring verification Text dependent/independentIntegrated with other verification schemesSpeaker Models572015/4/27Voice-based Information RetrievalSpeech InstructionsSpeech Documents (or Multi-media Documents including Speech Information)speech instructiontext instructiond1text documentsd2d3d1d2d3speech documents

582015/4/27Spoken Dialogue SystemsAlmost all human-network interactions can be accomplished by spoken dialogueSpeech understanding, speech synthesis, dialogue managementSystem/user/mixed initiativesReliability/efficiency, dialogue modeling/flow controlTransaction success rate/average number of dialogue turnsDatabasesSentence Generation and Speech SynthesisOutput SpeechInput SpeechDialogueManagerSpeech Recognition and UnderstandingUsers IntentionDiscourse ContextResponse to the userInternetNetworksUsersDialogue Server

592015/4/27Speech and Language Processing over the Web, IEEE Signal Processing Magazine, May 2008References602015/4/272.0

2.0 Fundamentals of Speech Recognition

Hidden Markov Models (HMM)


Ot= [x1, x2, xD]Tfeature vectors for frame at time tqt= 1,2,3N

state number for feature vector OtA =[ aij ] ,

aij = Prob[ qt = j | qt-1 = i ]

state transition probability

B =[ bj(o), j = 1,2,N]observation probabilitybj(o) = ( cjkbjk(o)

bjk(o): multi-variate Gaussian distribution

for the k-th mixture of the j-th state

M : total number of mixtures

( cjk = 1

( = [(1, (2, (N ]

initial probabilities

(i = Prob[q1= i]

HMM : ( A , B, ( ) = ( EMBED Word.Picture.8

o1 o2 o3

o4 o5 o6 o7


q1 q2 q3 q4 q5 q6 q7 q8

observation sequence state sequence

b1(o) b2(o)b3(o)









k = 1


k= 1




Hidden Markov Models (HMM)

Double Layers of Stochastic Processes

hidden states with random transitions for time warping

random output given state for random acoustic characteristics

Three Basic Problems

(1) Evaluation Problem:Given O =(o1, o2, otoT) and ( = (A, B, ()

find Prob [ O | ( ]

(2) Decoding Problem:Given O = (o1, o2, otoT) and (= (A, B, ()

find a best state sequence q = (q1,q2,qt,qT)

(3) Learning Problem:Given O, find best values for parameters in (such that Prob [ O | ( ] = maxPAGE 1


Feature Extraction (Front-end Signal Processing)

Mel Frequency Cepstral Coefficients (MFCC)

Mel-scale Filter Bank

triangular shape in frequency/overlappeduniformly spaced below 1 kHz

logarithmic scale above 1 kHz

Delta Coefficients

1st/2nd order differencesDiscrete Fourier Transform

windowed speech samples


Mel-scale Filter Bank

log( | |2 )

Inverse Discrete Fourier Transform



Language Modeling: N-gramW = (w1, w2, w3,,wi,wR)a word sequence

Evaluation of P(W)

P(W) = P(w1) II P(wi|w1, w2,wi-1)


P(wi|w1, w2,wi-1) = P(wi|wi-N+1,wi-N+2,wi-1)

Occurrence of a word depends on previous N(1 words only

N-gram language models

N = 2


P(wi | wi-1)

N = 3


P(wi | wi-2 , wi-1)

N = 4


P(wi | wi-3 , wi-2, wi-1)

N = 1



probabilities estimated from a training text database

example : tri-gram model

P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)R

i = 2


i = 3



Language Modeling

Evaluation of N-gram model parameters


P(wi) = ((((wi: a word in the vocabulary

V : total number of different words in the vocabulary

N( ( )number of counts in the training text database


P(wj|wk) = (((((< wk, wj > : a word pair


P(wj|wk,wm) = ((((((smoothing ( estimation of probabilities of rare events by statistical approachesN(wi)

( N (wj)


j = 1


N (wk)


N ()



Large Vocabulary Continuous Speech Recognition

W = (w1, w2,wR)

a word sequence

X = (x1, x2,xT)



View more >