SiliconIntelligence

4.4 Overall Operation

HMMs are constructed for all known triphones. A pronunciation dictionary is used to convert words into triphone sequences with overlapping contexts. For example the isolated word dissertation whose pronunciation is the phone sequence D IH S ER T EY SH AH N is expanded to SIL-D+IH, IH-S+ER, S-ER+T, ER-T+EY, T-EY+SH, EY-SH+AH, SH-AH+N, AH-N+SIL. There are many more expansions corresponding to all words that could possibly precede or succeed this word in a sentence. These are words that could end in D+IH or start with AH-N. A data-structure known as a lexical tree (Sphinx terminology) is constructed, and all words in the dictionary are entered in the lexical tree. The roots of the tree correspond to the set of all triphones that start any word in the dictionary. Each node in the tree points to the next triphone in the expanded pronunciation of a word. Common triphone sequences may be shared within the tree. The overall effect is that of combining all the triphone HMMs by adding null transitions between the final states of one triphone HMM to the initial state of its successor. To model continuous speech, null transitions are added from the final state of each word to the initial state of all words. Triphones that occur at the end of a word are specially marked so that a language model may be consulted at those points. Thus the lexical tree is a multirooted tree where each node points to an HMM and a successor node. In the case of word exit triphones there are multiple successors. Given an acoustic vector sequence $Y$, each vector in the sequence is applied successively to the HMMs and the probability that the HMM generated that vector is noted. Transitions are made in each step to successor nodes. On reaching a word exit triphone, the state sequence history is consulted to find the word that has been recognized. The last n words (usually n=3) are checked against a language model for further analysis. The search is done by means of a well known dynamic programming algorithm known as Viterbi beam search [74]. The acoustic and language models are strongly coupled, though language model evaluation may be deferred until the acoustic model has been evaluated. Together, they consume almost 99% of the run time of Sphinx.



Binu Mathew