SiliconIntelligence

# 4. Speech Recognition

Modern approaches to large vocabulary continuous speech recognition are surprisingly similar in terms of their high-level structure [111]. The work described herein is based on the CMU Sphinx 3.2 system, but the general approach is applicable to other speech recognizers [49,74]. The explanation of large vocabulary continuous speech recognition (LVCSR) in this chapter is based on a simple probabilistic model presented in [80,111]. The human vocal apparatus has mechanical limitations that prevent rapid changes to sound generated by the vocal tract. As a result, speech signals may be considered stationary, i.e., their spectral characteristics remain relatively unchanged for several milliseconds at a time. DSP techniques may be used to summarize the spectral characteristics of a speech signal into a sequence of acoustic observation vectors. Typically, 100 such vectors will be used to represent one second of speech. Speech recognition then becomes a statistical problem of deriving the word sequence that has the highest likelihood of corresponding to the observed sequence of acoustic vectors. This notion is captured by the equation:

 (4.1)

Here, is a sequence of words and is a sequence of acoustic observation vectors. Equation 4.1 may be read as is the particular word sequence which has maximum a posteriori probability given the observation sequence . Using Bayes' rule, this equation may be rewritten as:

 (4.2)

denotes the probability of the acoustic vector sequence given the word sequence . denotes the probability with which the word sequence occurs in the language. denotes the probability with which the acoustic vector sequence occurs in the spoken language. is independent of the word sequence, therefore can be computed without knowing . Thus Equation 4.2 may be rewritten as:
 (4.3)

The set of DSP algorithms that convert the speech signal into the acoustic vector sequence is commonly referred to as the front end. The quantity is generated by evaluating an acoustic model. The term is generated from a language model.

Subsections

Binu Mathew