Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

1

Fundamentals of Speech Signal Processing11.0 Speech SignalsWaveform plots of typical vowel sounds - Voiced

tone 1tone 2tone 4 tSpeech Production and Source Model

Human vocal mechanismSpeech Source Model

Vocaltract

u(t)x(t)

Voiced and Unvoiced Speechu(t)x(t)pitchpitchvoicedunvoicedWaveform plots of typical consonant soundsUnvoiced Voiced

Waveform plot of a sentence

Frequency domain spectra of speech signalsVoicedUnvoiced

Frequency Domain

Voicedformant frequencies

Unvoicedformant frequenciesFrequency Domain

Spectrogram

Spectrogram

Formant Frequencies

Formant frequency contours

He will allow a rare lie.Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang2.0 Speech Signal ProcessingSpeech Signal ProcessingMajor Application AreasSpeech Coding:Digitization and Compression

Considerations : 1) bit rate (bps) 2) recovered quality 3) computation complexity/feasibilityVoice-based Network Access User Interface, Content Analysis, User-content InteractionLPFoutputProcessing Algorithmsx(t)x[n]Processingxk110101Inverse Processingx[n]x[n]^Storage/transmissionSpeech SignalsCarrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic LevelProcessing and Interaction of the Double-level Information162015/4/27Sampling of Signals

X(t)X[n]tnDouble Levels of Information

(Character)(Word)(Sentence)Speech Signal Processing Processing of Double-Level Information Speech Signal Sampling Processing Linguistic Structure Linguistic Knowledge

LexiconGrammar AlgorithmChips or Computers

192015/4/27Voice-based Network Access

Content Analysis

User InterfaceInternetUser-Content InteractionUser Interface when keyboards/mice inadequateContent Analysis help in browsing/retrieval of multimedia contentUser-Content Interaction all text-based interaction can be accomplished by spoken language202015/4/27User Interface Wireless Communications Technologies are Creating a Whole Variety of User Terminalsat Any Time, from AnywhereSmart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable DevicesSmall in Size, Light in Weight, Ubiquitous, InvisiblePost-PC EraKeyboard/Mouse Most Convenient for PCs not Convenient any longer human fingers never shrink, and application environment is changedService Requirements Growing ExponentiallyVoice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere, and to the point in one utteranceSpeech Processing is the only less mature part in the Technology Chain

Internet NetworksText ContentMultimedia Content212015/4/27Content AnalysisMultimedia Technologies are Creating a New World of Multimedia ContentMost Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to BrowseThe Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and RetrievalMultimedia Content Analysis based on Speech InformationFuture Integrated NetworksRealtime Information weather, traffic flight schedule stock price sports scores

Special Services Google FaceBookYouTube Amazon

Knowledge Archieves digital libraries virtual museumsIntelligent Working Environment email processors intelligent agents teleconferencing distant learning electric commerce

Private Services personal notebook business databases home appliances network entertainments

222015/4/27User-Content Interaction Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processingvoice informationMultimedia ContentInternet

voice input/ outputtext informationNetwork Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by SpeechUser-Content Interaction can be Accomplished by Spoken and Multi-modal DialoguesHand-held Devices with Multimedia Functionalities Commonly used TodayUsing Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech InformationMultimedia Content AnalysisText Information RetrievalText ContentVoice-based Information RetrievalText-to-Speech SynthesisSpoken and multi-modal Dialogue232015/4/273.0 Speech CodingWaveform-based Approaches Pulse-Coded Modulation (PCM)binary representation for each sample x[n] by quantizationDifferential PCM (DPCM)encoding the differences d[n] = x[n] x[n1] d[n] = x[n] ak x[nk]Adaptive DPCM (ADPCM)with adaptive algorithms Ref : Haykin, Communication Systems, 4-th Ed. 3.7, 3.13, 3.14, 3.15 P

k=1252015/4/27Speech Source Model and Source Coding Speech Source Modeldigitization and transmission of the parameters will be adequateat receiver the parameters can produce x[n] with the modelmuch less parameters with much slower variation in time lead to much less bits requiredthe key for low bit rate speech coding x[n]u[n]parametersparametersExcitation GeneratorVocal Tract ModelExG(),G(z), g[n]x[n]=u[n]g[n]X()=U()G()X(z)=U(z)G(z)U ()U (z)

Speech Source Modelx(t)a[n]tnSpeech Source Model and Source CodingAnalysis and Synthesis

High computation requirements are the price for low bit rate

282015/4/27Simplified Speech Source ModelExcitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n] unvoiced

voiced randomsequencegeneratorperiodic pulse traingeneratorx[n]G(z) = 11 akz-k P

k = 1ExcitationG(z), G(), g[n]Vocal Tract Modelu[n]Gv/uNVocal Tract parameters {ak} : LPC coefficients formant structure of speech signalsA good approximation, though not precise enoughReference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of HuangLPC Vocoder(Voice Coder)

N by pitch detectionv/u by voicing detection

{ak} can be non-uniform or vector quantized to reduce bit rate furtherRef : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993302015/4/27Multipulse LPC poor modeling of u(n) is the main source of quality degradation in LPC vocoderu[n] replaced by a sequence of pulses u[n] = bk [nnk] roughly 8 pulses per pitch periodu[n] close to periodic for voicedu[n] close to random for unvoicedEstimating (bk , nk) is a difficult problemk312015/4/27Multipulse LPC Estimating (bk , nk) is a difficult problemanalysis by synthesis

large amount of computation is the price paid for better speech quality

322015/4/27Multipulse LPC Perceptual WeightingW(z) = (1 ak z -k)(1 ak ck z -k)0 < c < 1for perceptual sensitivityW(z) = 1 , if c = 1W(z) = 1 ak z -k , if c = 0practically c 0.8Error Evaluation E = |X() X() |2 W()d P

k = 1P

k = 1

^332015/4/27Multipulse LPC Error Minimization and Pulse searchu[n] = bk [nnk]x[n] = bk g[nnk]E = E ( b1 , n1 ,b2 , n2)sub-optional solution finding 1 pulse at a time^kk342015/4/27Code-Excited Linear Prediction (CELP) Use of VQ to Construct a Codebook of Excitation Sequencesa sequence consists of roughly 40 samplesa codebook of 512 ~ 1024 patterns is constructed with VQroughly 512 ~ 1024 excitation patterns are perceptually adequateExcitation Search analysis by synthesis

9 ~ 10 bits are needed for the excitation of 40 samples , while {ak} parameters in G(z) also vector quantized

352015/4/27Code-Excited Linear Prediction (CELP) Receiver

{ak} codewords can be transmitted less frequently than excitation codewordRef : Gold and Morgan, Speech and Audio Signal Processing, John Wiley & Sons, 2000, Chap 33

362015/4/274.0 Speech Recognition andVoice-based Network AccessFeature Extractionunknown speech signalPattern MatchingDecision Makingx(t)WXoutput wordfeature vector sequenceReference PatternsFeature Extractiony(t)Ytraining speechSpeech Recognition as a pattern recognition problem382015/4/27A Simplified Block Diagram

Example Input Sentence this is speechAcoustic Models () (th-ih-s-ih-z-s-p-ih-ch)Lexicon (th-ih-s) this (ih-z) is (s-p-iy-ch) speechLanguage Model () (this) (is) (speech) P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model P(wi|wi-1,wi-2) tri-gram language model,etcBasic Approach for Large Vocabulary Speech RecognitionFront-endSignal Processing

AcousticModelsLexiconFeatureVectorsLinguistic Decoding and Search AlgorithmOutput SentenceSpeechCorporaAcousticModelTrainingLanguageModelConstructionTextCorporaLanguageModelInput Speech

392015/4/27

Observation Sequences1-dim Gaussian Mixtures

State Transition Probabilities

Simplified HMM

RGBGGBBGRRRPeripheral Processing for Human Perception (P.34 of 7.0 )

Mel-scale Filter Bank

N-gram tri-gramW1 W2 W3 W4 W5 W6 ...... WR

W1 W2 W3 W4 W5 W6 ...... WR

this 50000 this is 500 this is a 5

Text-to-speech SynthesisText Analysis and Letter-to-sound ConversionProsody GenerationSignal Processingand ConcatenationLexicon and RulesProsodic ModelVoice Unit DatabaseInput TextOutput Speech SignalTransforming any input text into corresponding speech signals E-mail/Web page reading Prosodic modeling Basic voice units/rule-based, non-uniform units/corpus-based, model-based532015/4/27text analysis|||| | | || |||Cbb, Nd