# 6. A Custom Gaussian Accelerator

Chapter 4 introduced the use of multivariate mixture Gaussians in the acoustic model evaluation of Sphinx 3.2 and indicated that this computation is common to other speech recognition systems like HTK and the ICRC recognizer [59,111]. Chapter 5 showed that 55.5% of the execution time of Sphinx 3.2 was spent in Gaussian computation when using the Hub-4 speech model. The high percentage of execution time spent in this computation together with its applicability to a variety of speech recognizers argues for special acceleration hardware for mixture Gaussians. Accelerators may be implemented as custom nonprogrammable circuits or as domain specific programmable processors. The custom circuit option will represent a practical upper bound on achievable performance and energy efficiency. The programmable option which sacrifices some performance and energy to gain generality will be explored in Chapter 9. This chapter describes how a high throughput custom datapath is able to achieve area, power and bandwidth efficiency as well as scalability by means of:

- Reducing floating point precision.
- Restructuring the computation.
- Sharing memory bandwidth.

Earlier work by Pihl explored the use of special-purpose floating point formats in Gaussian estimation to save memory bandwidth [77]. Special floating point formats should be almost invisible to the application so that speech models may be developed without access to any special hardware. A custom software floating point emulation library was developed to conduct an empirical search for the precision requirements of the GAU phase. The library supported multiplication, addition, MAC, and operations on IEEE 754 format floating point numbers. The approach was to experimentally reduce mantissa and exponent sizes without changing the output results of the Sphinx 3 recognizer. The result was a reduced precision floating point format similar to the IEEE 754 format which has a sign-bit, an 8-bit excess 127 exponent and a hidden one-bit in its normalized mantissa. Unlike IEEE 754, which has 23 explicit-bits in the mantissa, the new format used only 12 bits. Conversion between the reduced precision representation and IEEE 754 was done by truncating the extra mantissa bits when converting from IEEE 754 to the new format and concatenating additional 0 bits when converting from the new format to IEEE 754. Such a transformation can be done within a floating point unit without any changes being visible to the application. Though this work was done independently, it is worthwhile to note that a previous study arrived at similar conclusions based on an earlier version of Sphinx [97]. However that research used digit serial multipliers, which cannot provide the kind of throughput required for GAU computation. Hence the accelerator discussed here uses fully pipelined reduced precision multipliers instead.

Another key insight is that current high performance microprocessors provide a fused multiply add operation that would benefit GAU. However, GAU also needs an add multiply (subtract-square) operation. There is scope for floating point circuit improvements relying on the nature of always returning a positive number. Further gains can be obtained in area, latency, power and the magnitude of the numerical error by fusing the operations . This is the approach used in this research.

**Subsections**

- 6.1 Top Level Organization
- 6.2 Coprocessor Datapath
- 6.3 Implementation
- 6.4 Applications
- 6.5 Accelerator Evaluation

Binu Mathew