9.6.2 Stream Address Generators

Most perception algorithms have a high ratio of array variable accesses to operators. Multiple SRAM ports are essential for high throughput. The three dual ported SRAMs in Figure 9.1 together have a read/write power consumption approximately equal to the total function unit power consumption. Since each additional SRAM port introduces area and energy overhead, utilizing them effectively is essential for performance. A previous version of the architecture which used generic integer ALUs for address generation was unable to maximize SRAM port utilization [67]. This is because generating an address for a 2D array access involves multiply/shift and add operations which incurs multiple cycles of latency in a traditional processor. When a tight loop body involves several array accesses a significant fraction of the function units and registers will need to be allocated for address calculation rather than computing results. Since address calculation for arrays is a stylized operation, it is possible to design distributed semiautonomous address generation hardware that frees up function unit resources for the actual result computation and improves data delivery and throughput. In the perception processor, dedicated address generators are attached to each SRAM port. They handle commonly occurring address sequences like vector and strided access as well as 2D array accesses including row and column walks. They can handle address generation under regular, modulo and unrolled loops and can deal with special situations that occur when multiple loop bodies are in flight simultaneously.

Before entering into a loop intensive section of code, the compiler uses write_context instructions to write descriptions of array access patterns into the address context register files of address generators. For increased throughput the same access pattern may be written into multiple address generators. Each address context includes the row and element sizes, the base address as well as the loop counter indices that correspond to the array's loop variables. The loop counter indices may be used to retrieve the value of loop count variables generated by the loop unit in Figure 9.6. In the current implementation there are four context entries in each address generator corresponding to a total of 24 access patterns simultaneously. Since write_context is a single cycle operation, dynamic reconfiguration has very low overhead. The parameters for an array access pattern are packed into a single 32-bit word with the base address at the least significant bit. So arithmetic can be done on the packed word to update the base address dynamically.

Address computation for array and vector references issued to an SRAM port are handled by its attached stream address generator. The operation of the address generator depends on loop counters from the loop unit and array parameters like base address and row size that are stored in its address context register file. Figure 9.7 shows the internal structure of an address generator. To understand how this simple structure can accomplish a variety of address calculations, it is essential to understand how a compiler generates addresses for array references. Consider the 2D arrays declared and used in C as shown in Figure 9.8.

Figure 9.7: Stream Address Generator

Figure 9.8: Loop Acceleration Example
\begin{verbatim}struct Complex A[N][M];
...+) { ...
t2 = B[i][k].real;

To simplify the discussion, assume word oriented addressing. Let the size of the Complex struct be denoted as elem_ size . Then, the size of one row of A is $row\_size=elem\_size\times N$. If the offset of imag within the struct is 1 and the base address of $A$ is $Base_{A}$, then the base addresses of the imag field will be $Base_{imag}=Base_{A}+1$. So the address expressions corresponding to the load into $t1$ is $Base_{imag}+i\times row\_size+j\times elem\_size$ since C stores arrays in row major order. A vector is a single-dimensional array, so its address expression is just a special case where $row\_size=0$. For more complex index expressions of the form $P\times i+Q$, the factors $P$ and $Q$ may be absorbed into the row size and base address respectively. A column-walk of the form $A[j][i]$ can be evaluated similarly. By constraining the row and element sizes to be powers of two, the address expression reduces to the form $address=Base+((i<<x)\vert(j<<y))$.

For cases where row size cannot be a power of two, to help pack more data into the scratch memory, row size may be picked as the sum of two powers of two and separate expressions may be used to access the bulk of the array and the residue. For arrays with $n>2$ dimensions, the base address is repeatedly recalculated to account for $n-2$ dimensions and the last two levels of loop nest are left to the hardware to deal with. Not all array accesses need to use the same loop variables. In the example, the access of $B$ depends on $i,k$ unlike $A$ which depends on $i,j$. The address generator is capable of picking the correct loop variables and plugging them into the address expression.

Each address generator has a designated partner ALU in the cluster with several address generators possibly sharing the same partner. In cases where the address generator is not equipped to compute the array index function, it is possible to directly issue an address computed by its partner ALU. The partner ALU can also compute address contexts on the fly and reconfigure an address generator. The combination of an address generator and its partner ALU can effectively deal with indirect access streams of the type $A[B[i]]$. Address generation adds 1 cycle latency to load/store operations.

When the compiler emits code for an array access, the index of an address generator and the index of an address context register within that generator are encoded into the context index field of the load/store instruction. The selected address generator then uses the context index field to retrieve the array parameters from the context register file as shown in Figure 9.7. The retrieved context entry specifies the loop variables to be used for calculating the address. The muxes at the top right of the figure use this information to select the appropriate loop counters. The mux inputs are connected to the loop count registers of the loop unit shown in Figure 9.6. The shifters then shift the selected loop variables, and the result is OR-ed and added to the base address to generate an address. To improve the processor's cycle time, pipeline registers have been inserted just before the final add operation.

Several special cases are handled in the address generator. It is common to unroll loops by a small factor and software pipeline them for performance. In that case, instead of using two loop variables, it is possible to use one loop variable and one unroll factor to compute the address. The unroll factor is packed into the immediate field of the instruction and selected in lieu of the loop variable using the upper 2-to-1 mux in Figure 9.7. When the access pattern is too complex to be handled by the address generator, the lower 2-to-1 mux selects an address that is computed by an ALU. To handle vectors and ALU generated addresses with one or zero loop variables respectively, the loop unit has a special loop counter which is always zero.

Binu Mathew