SiliconIntelligence

6.2 Coprocessor Datapath

Figure 6.2 shows the architecture of the accelerator. The datapath consists of an $(a-b)^{2}\times c$ floating point unit, followed by an adder that accumulates the sum as well as a fused multiply add $(a\times b+c)$ unit that performs the final scaling. Given that X, Mean, and Var are 39-element vectors, a vector style architecture is suggested. The problem comes in the accumulation step, since this operation depends on the sum from the previous cycle, and floating point adders have multicycle latencies. For a vector length of N and an addition latency of M, a straightforward implementation takes $(N-1)\times M$ cycles. Binary tree reduction (similar to an optimal merge algorithm) is possible, but even then the whole loop cannot be pipelined with unit initiation interval.

Figure 6.2: Gaussian Coprocessor
\includegraphics[width=0.95\columnwidth]{figures/sphinx_custom/gauss_coproc}

This problem is solved using by reordering Loops 1,2,3 to a 2,3,1 order. This calculates an $(X-M)^{2}\times V$ term for each input block while reading out the mean and variance values just once from the SRAM. Effectively this is an interleaved execution of 10 separate vectors on a single function unit, which leaves enough time to do a floating point addition of a partial sum term before the next term arrives for that vector. The cost is 10 internal registers to maintain partial sums. Loops 2,3,1 can now be pipelined with unit initiation interval. In the original algorithm, the Mean/Var SRAM is accessed every cycle whereas with the loop interchanged version this 64-bit wide SRAM is accessed only once every 10 cycles. Since SRAM read current is comparable to function unit current in the CMOS technology used for this design, the loop interchange also contributes significant savings in power consumption.

The Final Sigma unit in Figure 6.2 works in a similar manner, except that instead of a floating point adder, it uses a fused multiply add unit. It scales the sum and adds the final weight. This unit has a fairly low utilization since it receives only $8\times10$ inputs every $39\times10\times8$ cycles. To save power this unit is disabled when it is idle. In a multichannel configuration it is possible to share this unit between multiple channels. To reduce the number of reads the processor needs to perform to fetch results from the accelerator, this unit may be made to accumulate the final score. This also serves to reduce the outgoing bandwidth from the processor by a factor of eight. In that case, due to the interleaved execution this unit also requires 10 intermediate sum registers. Log domain addition can be implemented using an integer subtract, table lookup and an integer add operation. The state machine needs to be adapted to recirculate the results through the the integer add/subtract unit within the floating point adder. The lookup table used for extrapolation is constant and can therefore be implemented as optimized logic within the state machine. In this design, log domain addition is implemented in software.



Binu Mathew