SiliconIntelligence

9.1 Pipeline Structure

Figure 9.2: Pipeline Structure
\includegraphics[width=1\columnwidth]{figures/cluster/cluster_arch_alt}

The perception processor architecture was designed to be able to emulate dataflows that typically occur within custom ASIC accelerators. To this end, it has a simple and rather different pipeline structure from a traditional processor. In sharp contrast to the typical five-stage Instruction Fetch/Instruction Decode/Execute/Memory/Write Back (IF/ID/EX/MEM/WB) pipeline of a MIPS like RISC processor, the perception processor pipeline consists of just three stages: Fetch/Decode/Execute [46]. The number of actual stages in the final execute phase depends on the function unit. The pipeline structure is shown in Figure 9.2. Conspicuous departures from the RISC model include the absence of register lookups in the decode stage and the lack of memory and write back stages.

In the perception processor, the microinstructions are fetched from a very wide instruction memory which is more than 200 bits wide. The decode stage is minimal - it is limited to performing sign or zero extensions to constants, generating NOPs for function units while the memory system is being reconfigured, and generating clock enable signals for active function units. The wide instruction is then dispatched to a set of function units, a loop unit, and a set of address generators. All resources, including the actual function units and SRAM ports, appear as peers in the EX stage. The final output of all these peer units can be transferred back to the input of the units by an interconnect network. The latency of transfers depends on proximity. Nearest neighbors can be reached in the same cycle while reaching a nonneighboring unit incurs an additional cycle of latency.

In the MIPS RISC execution model, every single instruction implicitly encodes a path through the pipeline. An integer instruction takes the IF/ID/EX/MEM/WB while a floating point instruction takes a detour through the FPU in the EX stage. There is also an implicit hardware controlled timing regime that dictates the relative cycle time at which an instruction reaches each stage subject to dependences checked by interlocks.

In the perception processor, instructions do not encode any such implicit paths. The instructions are called microcode because they serve the traditional horizontal microcode function where individual bits directly control hardware functions like mux selects and register write enables. To get the functionality implied by a MIPS instruction, the stage by stage functionality of the MIPS instruction must be identified and the equivalent microinstruction bits set in several successive microinstruction words. The advantage of this lower level approach is that the hardware can be controlled in a fine grained fashion, which is impossible in the MIPS case. For example, interconnect muxes may be set to route data between selected function units and memory in a manner which directly represents the dataflow graph of an algorithm, and data may be streamed through the dynamically configured structure. The ability to reconfigure the structure through microcode on a cycle by cycle basis means that the function units may be virtualized to map flow-graphs that are too large to fit the processor. This manifests itself as higher initiation intervals and larger number of temporary results that need to be saved or rerouted when compared to a processor that has enough physical resources to allocate to the entire flow-graph. Performance degrades gracefully under virtualization. The perception processor supplants the instruction centric RISC execution model with a data centric execution model, which lends it the flexibility to efficiently mimic the styles of computation found in VLIW and vector processors as well as custom ASIC datapaths.



Binu Mathew