4. Speech Recognition

Modern approaches to large vocabulary continuous speech recognition are surprisingly similar in terms of their high-level structure [111]. The work described herein is based on the CMU Sphinx 3.2 system, but the general approach is applicable to other speech recognizers [49,74]. The explanation of large vocabulary continuous speech recognition (LVCSR) in this chapter is based on a simple probabilistic model presented in [80,111]. The human vocal apparatus has mechanical limitations that prevent rapid changes to sound generated by the vocal tract. As a result, speech signals may be considered stationary, i.e., their spectral characteristics remain relatively unchanged for several milliseconds at a time. DSP techniques may be used to summarize the spectral characteristics of a speech signal into a sequence of acoustic observation vectors. Typically, 100 such vectors will be used to represent one second of speech. Speech recognition then becomes a statistical problem of deriving the word sequence that has the highest likelihood of corresponding to the observed sequence of acoustic vectors. This notion is captured by the equation:

\hat{W}=arg_{W}max\,P(W\vert Y)
\end{displaymath} (4.1)

Here, $W=w_{1},w_{2},...,w_{n}$ is a sequence of $n$ words and $Y=y_{1},y_{2},...,y_{T}$ is a sequence of $T$ acoustic observation vectors. Equation 4.1 may be read as $\hat{W}$ is the particular word sequence $W$ which has maximum a posteriori probability given the observation sequence $Y$. Using Bayes' rule, this equation may be rewritten as:

\hat{W}=arg_{W}max\frac{P(Y\vert W)P(W)}{P(Y)}
\end{displaymath} (4.2)

$P(Y\vert W)$ denotes the probability of the acoustic vector sequence $Y$ given the word sequence $W$. $P(W)$ denotes the probability with which the word sequence $W$ occurs in the language. $P(Y)$ denotes the probability with which the acoustic vector sequence $Y$ occurs in the spoken language. $P(Y)$ is independent of the word sequence, therefore $\hat{W}$ can be computed without knowing $P(Y)$. Thus Equation 4.2 may be rewritten as:
\hat{W}=arg_{W}max\,P(Y\vert W)P(W)
\end{displaymath} (4.3)

The set of DSP algorithms that convert the speech signal into the acoustic vector sequence $Y$ is commonly referred to as the front end. The quantity $P(Y\vert W)$ is generated by evaluating an acoustic model. The term $P(W)$ is generated from a language model.


Binu Mathew