SiliconIntelligence

9.6.1 Loop Unit

The index expressions of array accesses in a multilevel nested loop will depend on some subset of the loop variables. The purpose of the loop unit is to compute and maintain the loop variables required for address generation in the memory system while the loop body itself is executed in the function units. Figure 9.6 shows a simplified organization of the loop unit. The loop unit offers hardware support for modulo scheduling, a software pipelining technique that offers high levels of loop performance in VLIW architectures [81].

Figure 9.6: Loop Unit
\includegraphics[width=0.95\columnwidth]{figures/cluster/loopu}

A brief introduction to some modulo scheduling terminology is necessary to understand the functioning of the loop unit. Assume a loop body which takes $N$ cycles to execute. Modulo scheduling allows starting the execution of a new instance of this loop body every $II$ (Initiation Interval) cycles where $II$ is less than $N$. A normal loop that is not modulo scheduled may be considered a modulo scheduled loop $II=N$. How $II$ is determined and the conditions that must be satisfied by the loop body are described in [81]. The original loop body may be converted to a modulo scheduled loop body by replicating instructions such that every instruction that was originally scheduled in cycle $n$ is replicated so that it also appears in all possible cycles $(n+i\times II)\,mod\,N$ where $i$ is an integer. This has the effect of pasting a new copy of the loop body at intervals of $II$ cycles over the original loop body and wrapping around all instructions that appear after cycle $N$. If a particular instruction is scheduled for cycle $n$, then $n/II$ is called its modulo period.

The compiler configures static parameters including $II$ and loop count limits into loop context registers. The corresponding dynamic values of the loop variables are held in the loop counter register file. The only other piece of information required is which loop body is currently pointed to by the program counter. A four-entry loop stack captures this information. In this implementation, the loop unit can keep track of four levels of loop nest at a time, which is sufficient for the benchmarks used in this research. For larger loop nests the address expressions that depend on additional outer loops may be done in software as in a traditional processor. A four-entry loop context register file holds the encoded start and end counts and the increment of up to four innermost for loops. Loops are a resource that can be allocated and managed just like one would allocate memory on a traditional architecture. The loop unit maintains a counter for each loop nest and updates it periodically. It also modifies the program counter and admits new loop bodies into the pipeline in the case of modulo loops. In that case it also does additional manipulation of the loop counter to drain the pipeline correctly on loop termination. On entering a new loop any previous loop is pushed on a stack, though its counter value is still available for use by address generators. Loop parameters may be loaded from memory. This permits modulo scheduling of loops whose loop counts are not known at compile time. Appropriate loop parameters may be loaded from SRAM at run time depending on the size of input data.

Just before starting a loop intensive section of code, loop parameters (perhaps dynamically computed) are written into the context registers using write_context instructions. On entry into each loop body, a push_loop instruction pushes the index of the context register for that loop onto the stack. At any given moment, the top of the stack represents the innermost loop that is being executed at that time. An II counter repeatedly counts up to the initiation interval and then resets itself. Every II cycles, the loop increment is added to the loop variable that is held in the loop counter register file. This is done automatically. No loop increment instructions are required. When the end count of the loop is reached, the innermost loop will have completed. The top entry is automatically popped off the stack, and the process is repeated for the enclosing loop. Note from Figure 9.6 that the registers and datapaths have small widths of 4 and 9 bits that cover most common loops. These widths are parameters specified in the perception processor configuration. The netlist generator tool can generate perception processors which use any user specified widths. The choices in Figure 9.6 were sufficient to cover benchmarks used in this research. Loops that are incompatible with a particular perception processor configuration can always be done in software, so the reduced bit-widths save energy in the common case.



Binu Mathew