SiliconIntelligence

4.2 Acoustic Model

Equation 4.3 needs the quantity $P(Y\vert W)$, the probability of an acoustic vector sequence $Y$ given a word sequence $W$ to find the most probable word sequence. A simplistic approach to achieve this would be to obtain several samples of each possible word sequence, convert each sample to the corresponding acoustic vector sequence and compute a statistical similarity metric for the given acoustic vector sequence $Y$ to the set of known samples. For large vocabulary speech recognition this is not feasible because the set of possible word sequences is very large. Instead words may be represented as sequences of basic sounds. Knowing the statistical correspondence between the basic sounds and acoustic vectors, the required probability can be computed.

The basic sounds from which word pronunciations can be composed are known as phones or phonemes. Approximately 50 phones may be used to pronounce any word in the English language. For example the CMU dictionary enlists the pronunciation for dissertation as:

While phones are an excellent means of encoding word pronunciation, they are less than ideal for recognizing speech. The mechanical limits of the human vocal apparatus leads to co-articulation effects where the beginning and end of a phone are modified by the preceding and succeeding phones. Recognizing multiple phone units in context tends to be more accurate than recognizing individual phones. Current speech recognition systems deal with three-tuples of phones called triphones. It is customary to denote triphones as $left\_context-current\_phone+right\_context$. For example SH-AH+N is a triphone that represents the context of the AH phone in the word dissertation. The final N phone in ``dissertation'' can be modeled with a cross-word triphone whose right context is the first phone in the next word or by the triphone AH-N+SIL where SIL is a special phone that denotes silence. Although there are approximately $50\times50\times50=125,000$ possible triphones, only about 60,000 actually occur in English.

The probability that an acoustic vector sequence corresponds to a particular triphone may be estimated using a Hidden Markov Model (HMM). Current speech recognizers use an HMM model with three internal states and an entry and an exit state. The topology of the HMM is shown in Figure 4.2. An HMM is a probabilistic finite state machine that generates observation sequences. If the model is in state $S_{i}$ at time step $t$, then it has a probability $B_{i}(Y_{t})$ of producing the acoustic vector $Y_{t}$ and it switches to state $S_{j}$ with probability $A_{ij}$. The problem of computing $P(Y\vert W)$ now becomes what is known as the evaluation problem for HMMs - the problem of estimating the probability with which a given HMM could have generated the observation sequence $Y$. The evaluation problem can be solved using the Forward/Backward algorithm for HMMs, but since the optimal state sequence is needed at a later stage, it is common to do a more expensive Viterbi search which can compute the probability and uncover the optimal state sequence simultaneously [80].

Figure 4.2: Triphone HMM
\includegraphics{figures/speech_algo/hmm_3state}



Binu Mathew