SiliconIntelligence


4.5 Architectural Implications

A basic understanding of the acoustic and language models is necessary to understand the architectural implications and scaling characteristics of speech recognition. The lexical tree is a complex data structure that results in considerable pointer chasing at run time. The nodes that will be accessed depend very much on the sentences being spoken. The size of the tree depends on the vocabulary size. However there is scope for architectural optimization. The opportunity stems from the fact that acoustic vectors are evaluated successively and on evaluating an HMM for the current vector, if the HMM generates a probability above a certain threshold, the successors of the HMM will be evaluated in the next time step. Thus there is always a list of currently active HMMs/lextree nodes and a list of nodes that will be active next. Evaluating each HMM takes a deterministic number of operations and thus a fixed number of clock cycles. This information can be used to prefetch nodes ahead of when they are evaluated.

Given the fact that the number of triphones and words in a language are relatively stable, it might appear that the workload will never expand. In reality this is not the case due to the probability density function $B_{i}(Y_{t})$. In the past, speech recognizers used subvector quantized models, which are easy to compute. These methods use a code book to store reference acoustic vectors. Acoustic vectors obtained from the front end are compared against the code book to find the index $c$ of the closest match. The probability density function then reduces to a table lookup of the form $B[i][c]$. While this is computationally efficient, the discretization of observation probability leads to excessive quantization error and thereby poor recognition accuracy.

To obtain better accuracy, modern systems use a continuous probability density function and the common choice is a multivariate mixture Gaussian in which case the computation may be represented as:


\begin{displaymath}
B_{i}(Y_{t})=\sum_{m=1}^{M}c_{im}\sum_{n=1}^{N}(Y_{t}[n]-\mu_{im}[n])^{2}\times V_{im}[n]
\end{displaymath} (4.4)

Here, $\mu_{im}$ is the mean and $V_{im}$ the variance of the Gaussian mixture and $c_{im}$ is the weight of the mixture. For The Hub-4 speech database used for this research was obtained from CMU and they chose $M$ and $N$ to be 8 and 39 respectively. Note that the outer $\sum_{m=1}^{M}$ denotes an addition in the logarithmic domain. Normally the inner term involves exponentiation to compute a weighted Mahalanobis-like distance, but it is reduced to simple arithmetic operators by keeping all the parameters in the logarithmic domain [91,111]. Therefore the outer summation needs to be done in the logarithmic domain. This may be implemented using table lookup based extrapolation. This strategy is troublesome if the processor's L1 D-cache is not large enough to contain the lookup table.

If each HMM state uses a separate probability density function, then the system is said to be fully continuous. Thus the peak workload for an English speech recognizer would correspond to the evaluation of about 60,000 probability density functions and HMMs, as well as an associated lextree traversal that is proportional to the number of words in the vocabulary. Fully continuous models are not popular for two reasons:

  1. Their computational complexity makes them orders of magnitude slower than real time on current processors.
  2. Their parameter estimation problem and sparse training sets lead to low recognition accuracy.
The parameter estimation problem is particularly difficult. For $M=8$ and $N=39$ Equation 4.4 needs $39\times2\times8+8=632$ parameters for the values of $\mu_{im}$, $V_{im}$ and $c_{im}$. For a total of 60,000 triphones this adds up to 113.7 million parameters. The training data is often insufficient to estimate that many parameters, so the use of continuous models leads to increased word error rate. The usual solution is to cluster together HMM states and share a probability density function among several states. Such clustering methods are an area of active research. A speech recognition system that uses clustered probability density functions is called a semicontinuous or tied-mixture system. Almost all advanced large vocabulary speech recognizers currently fall in this category. The Hub-4 speech model used to evaluate Sphinx 3.2 contains approximately 6000 probability density functions representing an average of 30 HMM states sharing a single function. This ratio could change when a model is trained on a larger data set leading to proportionately increased compute complexity. Another possibility is an increase in $M$, the number of mixtures per function, which will again proportionately increase the compute cycles. A third possibility is increasing the size of the context from triphones to quinphones (five phones, one current phone and two left and two right neighbors). The use of quinphones will lead to an increase in the number of probability density functions that need to be evaluated. This will be further multiplied by the number of quinphones in the language vs the number of triphones.

Though traditional speech recognizers couple the evaluation of HMMs and Gaussians tightly, in the interest of extracting greater levels of thread parallelism, it is possible to decouple HMM and Gaussian evaluation, an approach that will be further investigated in Chapter 5.



Binu Mathew