5. Characterization and Optimization of Sphinx 3

Chapter 4 described the Front end (FE), Gaussian (GAU) and Search (HMM) phases of the Sphinx 3.2 speech recognition system. To fully characterize the complex behavior of Sphinx, it is necessary to study the individual phases separately. In addition to the FE, GAU and HMM phases, Sphinx has a lengthy startup phase and extremely large data structures which could cause high TLB miss rates on embedded platforms with limited TLB reach. To avoid performance characteristics being aliased by startup cost and the TLB miss rate, Sphinx was modified to support check-pointing and fast restart. For embedded platforms, the check-pointed data structures may be moved to ROM in a physically mapped segment similar to kseg0 in MIPS processors [71]. Results in this chapter are based on this low startup cost version of Sphinx, referred to as original.

Previous studies have not characterized the three phases separately [6,59]. To capture the phase characteristics and to separate optimizations for embedded architectures, a phased version of Sphinx was developed so that each of the FE, GAU and HMM phases can be run independently with input and output data redirected to intermediate files. In the rest of this chapter FE, GAU and HMM refers to the corresponding phase run in isolation while phased refers to all three chained sequentially with no feedback. In phased, FE and HMM are identical to original, while the work load of GAU is increased by the lack of dynamic feedback from HMM. Breaking this feedback path exposes parallelism in each phase and allows the phases to be pipelined. GAU OPT refers to a cache optimized version of the GAU phase alone. PAR runs each of the FE, GAU OPT and HMM phases on separate processors. It also uses the same cache optimizations as GAU OPT.

Both simulation and native profiling tools were used to analyze Sphinx 3. Simulations provide flexibility and a high degree of observability, while profiled execution on a real platform provides realistic performance measures and serves as a way to validate the accuracy of the simulator. The configurations used to analyze Sphinx 3 are shown in Table 5.1.

Table 5.1: Experiment Parameters
Native Execution:
SGI Onyx3, 32 R12K processors at 400 MHz
32 KB 2-way IL1, 32 KB 2-way DL1, 8 MB L2
Software: IRIX 64, MIPS Pro compiler, Perfex, Speedshop
Simulator: (default configuration)
SimpleScalar 3.0, out of order CPU model, PISA ISA
8 KB 2-way IL1, 2 cycle latency, 32 KB 2-way DL1, 4 cycle latency
2 MB 2-way L2, 20 cycle latency, 228 cycle DRAM latency
L1 line size 64 bytes, L2 line size 128 bytes
Software: gcc 2.6.3
ILP Experiment Configurations
Reasonable configuration:
32 KB DL1, 4 cycle latency, 2 MB L2, 20 cycle latency
2 memory ports
Aggressive configuration:
32 KB DL1, 2 cycle latency, 8 MB L2, 20 cycle latency
4 memory ports

A multi-GHz processor is required to operate Sphinx in real time. Parameters like L1 cache hit time, memory access time and floating point latency were measured on a 1.7 GHz AMD Athlon processor using the lmbench hardware performance analysis benchmark [68]. Numbers that could not be directly measured were obtained from vendor microarchitecture references [51,5]. The Simplescalar simulator was then configured to reflect these parameters [19]. Unless mentioned otherwise, the remainder of this chapter uses the default configuration from Table 5.1.

Native profiling indicates that the original Sphinx spends approximately 0.89%, 49.8% and 49.3% of its compute cycles in the FE, GAU and HMM phases respectively. Another recent study found that as high as 70% of another speech recognizer's execution time was spent in Gaussian probability computation [59]. In the phased version approximately 0.74%, 55.5% and 41.3% of time was spent in FE, GAU and HMM respectively. Since FE is such a small component of the execution time, the rest of this work excludes it and concentrates on the analysis of the GAU and HMM phases.


Binu Mathew