SiliconIntelligence

4.1 Front End

The signal processing front end summarizes the spectral characteristics of the speech waveform into a sequence of acoustic vectors that are suitable for processing by the acoustic model. Figure 4.1 shows the stages of this transformation.

Figure 4.1: Signal Processing Front End
\includegraphics[scale=0.9]{figures/speech_algo/fe}

Frame Blocking: The digitized speech signal is blocked into overlapping frames. It is common to have 100 frames per second, so a new frame is started every 10 ms. A new frame contains the last 7.5 ms of the previous frame's data and the first 7.5 ms of the next frame's data. Thus, even though a new frame is made every 10 ms, each frame is 25 ms in duration. The overlap decreases problems that might otherwise occur due to signal data discontinuity.

Preemphasis: This stage spectrally flattens the frame using a first order filter. The transformation may be described as:

\begin{displaymath}
Y_{0}[n]=x[n]-\alpha x[n-1],\,\,\,\,\,\,0.9\leq\alpha\leq1,\,\,\,\,\,\,0<n<Samples\,per\,frame
\end{displaymath}

Here, $x[n]$ refers to the $n^{th}$ speech sample in the frame. Sphinx uses $\alpha=0.97$ and the sampling rate is typically 8K or 16K 16-bit samples per second.

Hamming Window: In this stage a Hamming window is applied to the frame to minimize the effect of discontinuities at the edges of the frame during FFT. The transformation is:


\begin{displaymath}
Y_{1}[n]=x[n]\times H[n],\,\,\,\,\,\,0<n<Frame\,size
\end{displaymath}

The vector $H[n]$ is computed using the following equation.


\begin{displaymath}
H[n]=0.54-0.46\times cos(\frac{2\pi n}{Frame\,size-1})
\end{displaymath}

The constants used in the $H[n]$ transform were obtained from the Sphinx source code.

FFT: The frame is padded with enough zeroes to make the frame size a power of two (call this $N$) and a Fourier transform is used to convert the frame from the time domain to the frequency domain.

\begin{displaymath}
Y_{2}=DFT(Y_{1})
\end{displaymath}

The square of the magnitude is then computed for each frequency component. Thus the results are real numbers rather than the complex output produced by a discrete Fourier transform.


\begin{displaymath}
Y_{3}[n]=real(Y_{2}[n])^{2}+imag(Y_{2}[n])^{2},\,\,\,\,\,\,0<n\leq N/2
\end{displaymath}

Mel Filter Bank: A set of triangular filter banks is used to approximate the frequency resolution of the human ear. The Mel frequency scale is linear up to 1000 Hz and logarithmic thereafter. A set of overlapping Mel filters are made such that their center frequencies are equidistant on the Mel scale. The transformation is:

\begin{displaymath}
Y_{4}[n]=\sum_{i=0}^{N/2}Y_{3}[i]\times MelWeight[n][i],\,\,\,\,\,\,0<n<Number\,of\,filters
\end{displaymath}

For 16 KHz sampling rate, Sphinx uses a set of 40 Mel filters.

Log Compression: The range of the values generated by the Mel filter bank is reduced by replacing each value by its natural logarithm. This is done to make the statistical distribution of the spectrum approximately Gaussian - a requirement for the subsequent acoustic model. The transformation is:

\begin{displaymath}
Y_{5}[n]=ln(Y_{4}[n]),\,\,\,\,\,\,0<n<Number\,of\,filters
\end{displaymath}

DCT: The discrete cosine transform is used to compress the spectral information into a set of low order coefficients. This representation is called the Mel-cepstrum. Currently Sphinx compresses the 40 element vector $Y_{5}$ into a 13 element cepstral vector. The transformation is:

\begin{displaymath}
Y_{6}=DCT(Y_{5})
\end{displaymath}

Numerical differentiation: Acoustic modeling assumes that each acoustic vector is uncorrelated with its predecessors and successors. Since speech signals are continuous, this assumption is problematic. The traditional solution is to augment the cepstral vector with its first and second differentials. Since the Mel cepstral vector is 13 elements long in Sphinx, after appending the differentials the final acoustic vector that is 39 elements in length.

Summary: The Sphinx front end transforms a 25 ms speech sample into a 39 element vector of real numbers that represents the spectral characteristics of the waveform in a compact form. The speech signal is blocked into overlapping frames spaced 10 ms apart. Thus the front end transforms one second of speech into a series of 100 acoustic vectors. Even though the front end only occupies less than 1% of the compute cycles of Sphinx 3.2, it is very important for two reasons.

  1. Understanding acoustic vectors is a crucial prerequisite to illustrate the operation of the acoustic model.
  2. The front end is dominated by floating point computations that make it very problematic to run on embedded processors without floating point hardware. Fixed point versions are difficult to create and analyze, but have been studied in the literature. Delaney described a fixed point speech front end for Sphinx which performed 34 times better on an embedded processor than a floating point front end that uses software emulated floating point operations [32].



Binu Mathew