SiliconIntelligence

5.2 ILP in Sphinx

Before exploring special-purpose architecture extensions for speech, it is worthwhile to investigate the limits of modern architectures. GAU is a floating point dominant code while HMM is dominated by integer computations. GAU also appears to be easily vectorizable. Two simulation studies were undertaken to explore possibilities for extracting ILP. For GAU, a surplus of integer ALUs was provided and the number of floating point units was varied. Since this algorithm uses an equal number of multiplies and adds, the number of floating point adders and multipliers were increased in equal numbers from one to four, which corresponds to the X axis varying from two to eight FPUs in Figure 5.4. Two different memory system hierarchies were considered: a reasonable one for a multi-GHZ processor and an aggressive memory system with lower latencies. Both configurations are summarized in Table 5.1.

Figure 5.4: GAU and GAU OPT IPC
\includegraphics[width=0.95\columnwidth]{graphs/sphinx_opt/gau_ipc_barchart}

The SGI-2+2f entry describes the measured total IPC on the R12000, which has two integer and two floating point units. The SGI-2 entry is the measured floating point IPC alone. In the case of GAU, IPC remains low because of insufficient memory bandwidth to keep the FPUs active. In the case of the R12000, which can issue two floating point operations per cycle, the IPC for this loop is an underwhelming 0.37. GAU OPT, uncovers opportunities for ILP by virtue of its cache optimizations thereby improving IPC greatly. However, the IPC saturates at 1.2 in spite of available function units. A recently published study also indicated IPC in the range of 0.4 to 1.2 for another speech recognizer [59]. Clearly, the architecture and compiler are unable to automatically extract the available ILP, which again argues for custom acceleration strategies.

Figure 5.5: HMM IPC
\includegraphics{graphs/sphinx_opt/hmm_ipc_barchart}

Figure 5.5 shows the corresponding experiment for the HMM phase. In this experiment, the number of integer adders and multipliers are varied equally from one to four. In spite of available execution resources, IPC remains low. It should be noted that in both experiments, the SGI results are indicative of cases where the CPU to memory clock ratio is low. This ratio will undoubtedly increase in the future.

The observations from sections 5.1 and 5.2 have several implications:

  1. If speech is an ``always on'' background application, it could cause significant L2 cache pollution and memory bandwidth degradation to the foreground application. To guarantee real-time processing, it might be better to stream data around the L2 rather than pollute it.
  2. Since the L2 cache is one of the largest sources of capacitance on the chip, accessing it for stream data incurs a large power overhead. Low power embedded platforms may not need any L2 cache at all since dramatic increases in L2 size are not accompanied by corresponding improvements in DRAM bandwidth requirements or performance.
  3. Bandwidth reduction is important for its own sake as well as to reduce power consumption. Bandwidth partitioning so that each phase has independent access to its data set is important.



Binu Mathew