SiliconIntelligence

9.9 Programming Example

This section illustrates the operation of the perception processor using a simple kernel that is mapped into microcode. The algorithm to multiply two $16\times16$ floating point matrices is shown in Figure 9.9. The control pattern consists of 3 level nested for loops. Assuming that the matrices are stored in row major order, the inner product computation will access array $A$ along the row while $B$ will be accessed along the column causing a base stride access pattern. The compute pattern consists of multiply accumulate operations, which form the core of the inner product function.

Figure 9.10 outlines a simple custom hardware accelerator for this algorithm. Address generator $A$ fetches the rows of matrix $A$. Address generator $B$ generates the base stride pattern for the columns of matrix $B$. Corresponding rows and columns are fetched and applied to the floating point multiplier. The output of the multiplier is accumulated in a scratch register by the floating point adder. When an inner product sum is ready, it is written to a result SRAM, which is not shown in the figure.

Figure 9.9: Matrix Multiply Algorithm
\begin{figure}\begin{centering}
\begin{verbatim}def inner_product(A, B, row, c...
...= inner_product(A, B, i, j)\end{verbatim}
\par
\end{centering}
\par
\end{figure}

Figure 9.10: Inner Product Accelerator
\includegraphics[width=0.95\columnwidth]{figures/cluster/inner_prod_accel}

In theory, this simple pipeline could compute one inner product every 16 cycles. However, the final accumulation of the inner product value creates a pipeline problem. The floating point add takes 7 cycles and since the output is accumulated, a new product value can only be handled every 7 cycles. Hence each inner product takes $16\times7$ cycles. Interleaving the computation of 7 or more inner products relieves this bottleneck. However, this interleave complicates address generation. The additional functionality required to fix this problem includes: a) address generator $B$ needs to be able to generate multiple interleaved base-stride patterns b) address generator $A$ needs to hold each row element long enough for all the interleaved inner products and, c) several scratch registers are required to hold the intermediate sums. If the interleave factor is the same as the latency of the floating point adder, no scratch registers are required. The output of the adder may be fed back as an input and the intermediate sums will circulate through the pipeline registers of the adder.

Compilers for high performance architectures attempt to approximate the dataflow in the custom accelerator. In vector processors, vector chaining creates a similar dataflow and reduction operators help alleviate some of the performance penalty caused by the floating point accumulate operation. By selecting independent adds and multiplies, which are ready for issue from its instruction window, an out of order processor will work somewhat like a vector processor that can be time sliced across several interleaved vectors. In addition, a combination of software pipelining and branch prediction ensures that the pipeline has as few wasted cycles as possible. Address generation will be handled by generic ALUs which send computed addresses to available load/store ports. Some form of register renaming will also be required to enable software pipelining to work well in nontrivial kernels.

Figure 9.11: Assembly Code for Interleaved Inner Product
\begin{figure}\begin{centering}
\begin{verbatim}i_loop = LoopContext(start_cou...
...pu1.add( fpu1, fpu1.b_reg )\end{verbatim}
\par
\end{centering}
\par
\end{figure}

Figure 9.11 shows the cleaned up perception processor assembly code for the interleaved inner product. For brevity the outer loops, which invoke the interleaved inner product, are not shown. This code is capable of sustaining the same throughput (7 inner products every $16\times7$ cycles) as the refined custom hardware accelerator. Performance and energy efficiency are achieved by a combination of techniques.

The inner product loop i_loop is marked for hardware modulo loop acceleration, and its parameters are configured into a free context in the loop unit. Two address contexts A_ri and B_ci are allocated and the address generators attached to the input SRAM ports are reconfigured. Both contexts are tied to the loop i_loop. B_ci is set to generate a column walk indexed by i_loop, with the starting offset specified in a constant field in the load opcode. A_ri is set to access the matrix row by row in conjunction with an outer loop. The address contexts effectively implement array variable renaming functions, a fact which is not evident in the code.

On entering i_loop the previous loop is pushed on a stack, though its counter value is still available for use by the address contexts, particularly A_ri. The new loop updates its counter every 7 cycles and admits new loop bodies into the pipeline. This is not a branch in a traditional sense and there is no branch penalty.

Communication is explicit and happens via load/store instructions or via interfunction unit data transfers, both of which explicitly address pipeline registers. In the example $A[r][i]$ and $B[i][c]$ are allocated to pipeline registers $fpu0.a\_reg$ and $fpu0.b\_reg$ respectively. In fact, it is more appropriate to say that $B[i][c+k]$ where $k$ refers to the $k^{th}$ interleaved inner product resides in $fpu0.b\_reg$ at time $t0+k$. No scratch registers are required for the sum. The intermediate sums are merely circulated through the long latency FPU adder. This notion of allocating variables both in time and space is central to programming the perception processor.

The return value of each opcode mnemonic is the relative time at which its result is available. The $AT$ pseudo op is a compile time directive that controls the relative time step in which following instructions are executed. Dataflow is arranged by referring to the producer of a value and the time step it is produced in. Such a reference will be translated by the compiler into commands for the forwarding logic. More complex programs are written as several independent execution streams. The streams are then made to rendezvous at a particular cycle by adjusting the starting time of each stream. The example shows that compile time pseudo ops can perform arithmetic on relative times to ensure correct dataflow without the programmer needing to be aware of the latencies of the actual hardware implementation.

The loop body for $i\_loop$ will consist of 7 inner loop bodies created by loop unrolling. Each inner loop body before unrolling takes 18 cycles to execute. Since $i\_loop$ has been specified to have an initiation interval of 7 cycles, a total of 3 $i\_loop$ bodies corresponding to 21 of the original loop bodies will be in flight within the cluster at a time. It is the modulo aware nature of the address generators that permits each of these loop bodies to refer to array variables in a generic manner like $A[r][i]$ and get the reference that is appropriate for the value of $r$ and $i$ which were current at the time that loop body was started. Without special purpose address generation, such high levels of ILP will not be possible. A previous version of the architecture without modulo address generators had limited ILP because generic function units and registers were used for address generation [67].

For this example, interleaving 7 inner products at a time results in two left over columns. They are handled by a similar loop to the one shown in Figure 9.11 except that it will have more idle slots. The adder needs to be active all the time, but the multiplier needs to work only 2 out of every 7 cycles. Since the multiplier pipeline will not shift 5 out of 7 cycles, the dynamic energy consumption resembles an ideal circuit where the adder runs at full frequency and the multiplier runs at $2/7$ of the frequency thereby consuming less energy.

The overall effect is that the dataflow and throughput of the perception processor matches the custom hardware but in a more programmable manner. The address generators transfer data between the SRAMs and execution units in a distributed and autonomous manner similar to the custom accelerator in Figure 9.10. The output of the multiplier is directly forwarded to the input of the adder. As in the case of the accelerator, no scratch registers are used. The intermediate sums are circulated through the pipeline registers in the adder. All together, the microcode and the interconnect provide a level of programmability while retaining a level of hardware economy close to that of the ASIC.



Binu Mathew