6.3 Implementation

The datapath shown in Figure 6.2 was implemented using a datapath description language (Synopsys Module Compiler Language) and is subsequently synthesized for a $0.25\mu$ CMOS process. The control sections were written in Verilog and synthesized using the Synopsys Design Compiler. The gate level netlist is then annotated with worst case wire loads calculated using the same wire load model used for synthesis. The netlist is then simulated at the Spice level using Synopsys Nanosim and transistor parameters extracted for the same $0.25\mu$ MOSIS process. Energy consumption is estimated from the RMS supply current computed by Spice. The unoptimized fully pipelined design can operate above 300 MHz at the nominal voltage of 2.5 volts with unit initiation interval. At this frequency the performance exceeds the real-time requirements for GAU, indicating an opportunity to further reduce power. A lower frequency and voltage can be used to further reduce power.

A low power processor similar to a MIPS R4600 was designed for use as a control processor. The MIPS was chosen because it is commonly used in embedded systems and also because high performance implementations of the MIPS ISA, like the R12K, were readily available for experiments. The design of this processor was done in such a way that it could be easily modified for tight integration with ASIC coprocessors. The Gaussian accelerator was designed and attached to the control processor as a custom coprocessor, and the combination was then simulated. The control processor is a simple in-order design that uses a blocking L1 Dcache and has no L2 cache. To support the equivalent of multiple outstanding loads, it uses the MIPS coprocessor interface to directly submit DMA requests to a low priority queue in the on-chip memory controller. The queue supports 16 outstanding low priority block read requests with block sizes that are multiples of 128 bytes. A load request specifies a ROM address and a destination - one of the Feat, Mean or Var SRAMs. The memory controller initiates a queued memory read and transfers the data directly to the requested SRAM index. A more capable out of order processor could initiate the loads directly. Software running on the processor core does the equivalent of the GAU OPT phase. It accumulates 100 ms or 10 frames of speech feature vectors (1560 bytes) into the Feat SRAM whenever the accelerator has finished processing the previous block of input. Currently, the accelerator functions faster than its real-time requirement. It is possible to slow down the accelerator so that it completes the processing of each block just by the time the next block of input is ready, but this has not been attempted. The data transfer uses the memory controller queue interface. Next, it loads two interleaved Mean/Var vectors from ROM into the corresponding SRAM using the queue interface. A single transfer in this case is 640 bytes. The Mean/Var SRAM is double buffered to hide the memory latency. Initially, the software fills both the buffers. It then queues up a series of vector execute commands to the control logic of the Gaussian accelerator. A single command corresponds to executing the interchanged loops 2,3,1. The processor then proceeds to read results from the output queue of the Gaussian accelerator. When 10 results have been read, it is time to switch to the next Mean/Var vector and refill the used up half of the Mean/Var SRAM. This process continues until the end of the Gaussian ROM is reached. When one cache line of results has been accumulated, they are written to the output queue where another phase or an I/O interface can read them.

Calculations based on the throughput of the accelerator showed that it needed to operate at 202 MHz to achieve real-time speech processing. To simplify the electrical interface between the processor and the coprocessor, both circuits need to operate at the same clock frequency. Since the processor runs a general purpose operating system, events like clock ticks and background tasks sometimes interrupt the main program that transfers data between main memory and the input and output queues. Additional head-room is required so that these interruptions do not prevent real-time processing of the speech data. The extra performance required from the processor depends on the mix of control tasks running on the processor. When the accelerator is scaled to process multiple channels the processor needs to have commensurate processing ability too. So the operating frequency of the system was chosen to be as high as possible subject to the limitations of the $0.25\mu$ process. The maximum frequency at which the circuits were stable was 300 MHz. A cycle accurate simulator was developed and validated by running it in lock step with the processor's HDL model. The simulator was detailed enough to boot the SGI Linux 2.5 operating system and run user applications in multitasking mode. The resulting system accurately models the architecture depicted in Figures 6.2 and 6.1. The GAU OPT application for this system is a simple 250 line C program with fewer than 10 lines of assembly language for the coprocessor interface. Loop unrolling and double buffering were done by hand in C. The application was compiled using MIPS GCC 3.1 and run as a user application under Linux inside the simulator. It was able to process 100 ms samples of a single channel in 67.3 ms and scale up to 10 channels in real time. The actual data may be seen in Section 6.5.2.

Binu Mathew