SiliconIntelligence

1.2 The Solution

Figure 1.2: High Level Architecture
\includegraphics[width=0.95\columnwidth]{figures/intro/system_arch}

This dissertation addresses the design of programmable processors that can handle sophisticated perception workloads in real time at power budgets suitable for embedded devices. Programmable processors optimized for the perception domain are intended to be used as a coprocessors for general purpose host processors. A high level view of the architecture is shown in Figure 1.2. A number of function units are organized as a cluster and embedded in a rich interconnection network that provides connection between function units in the cluster and four memories. The host processor moves data into or out of the coprocessor via double buffered input and output SRAMs. Local storage for the cluster is provided by the scratch SRAM, and the microcode program that controls the operation of the cluster is held in the u-Code SRAM. The execution cluster can be customized for a particular application by the selection of function units. In fact the type and number of function units, SRAMs, address generators, bit widths and interconnect topology are specified using a configuration file. The hardware design (Verilog netlist) and a customized simulator are automatically generated by a cluster generator. Henceforth the term perception processor refers to the generic architecture behind any domain-specific processor created using the cluster generator tool.

Perception algorithms tend to be stream oriented, i.e., they process a sequence of similar data records where the data records may be packets or blocks of speech signals, video frames or the output of other stream processing routines. Each input packet is processed by a relatively simple and regular algorithm that often refers to some limited local state tables or history to generate an output packet. The packets have fixed or variable but bounded sizes. The algorithms are typically loop oriented with dominant components being nested for loops with flow-dependent bodies. Flow dependence implies that loop-carried dependences have constant distances in the iteration space of the nested loop structure. Processors that are optimized for this style of computation are called stream processors. While there are subtle differences in the nuances, the notion of streams and algorithm kernels described here is essentially the same as that developed by Dally et al. for the Imagine Stream Processor [82]. The perception processor developed in this research is a specialized stream processor optimized for speech recognition and vision. However, attempts will be made to show its generality to other stream oriented algorithms in Chapters 10 and 12.

Fine-grained control of physical resources is provided by a horizontal microcode program. The architecture and the fine-grained control mechanism support data flows that resemble the custom computational pipelines found in ASICs. Software based control provides a significant level of generality. Any algorithm can be mapped onto the cluster, albeit with varying levels of efficiency. The result is a cluster that can be tailored to a particular domain and can support multiple applications or applications phases. The approach includes a specialized microcode compiler that maps applications onto the perception processor. Currently, the input to the compiler is a tiny specialized language implemented on top of the Python scripting language. It supports constructs for various types of for loops, array access patterns, opcode mnemonics, loop unrolling and processor reconfiguration requests. Compilers for more general languages like C or C++ are definitely possible, but have not been implemented. The compiler uses hardware support for modulo-scheduled loops in conjunction with array address generators to deliver high throughput for flow dependent loops [81]. The microcode provides fine-grained control over data steering, clock gating and function unit utilization and it permits single cycle reconfiguration of address generators.

Energy efficiency is primarily the result of minimized communication and activity. The compiler uses fine-grained clock gating to ensure that each function unit is active only when required. Compiler-controlled dataflow permits software to explicitly address output and input stage pipeline registers of function units and orchestrate data transfer between them over software-controlled bypass paths. Data values are transported only if necessary, and the compiler takes care to ensure that value changes are visible on heavily loaded wires and forwarding paths only if a unit connected to that path needs the data value. By explicitly enabling pipeline registers the compiler is able to control the lifetime of function unit outputs and directly route data to other function units, avoiding unnecessary access to a register file. The resulting dataflows or active datapaths resemble custom computational pipelines found in ASICs, but have the advantage of flexibility offered by software control. This may be thought of as a means of exploiting the natural register renaming that occurs when a multistage pipeline shifts and each individual pipeline register gets a new value. However the active datapath in the cluster will utilize multiplexer circuits that provide generality at the cost of power, area and performance. These muxes and the associated penalties will not be present in a custom ASIC design.

The resulting architecture is powerful enough to support complex perception algorithms at energy consumption levels commensurate with mobile device requirements. The approach represents a middle ground between general purpose embedded processors and ASICs. It possesses a level of generality that cannot be achieved by a highly specialized ASIC, while delivering performance and energy efficiency that cannot be matched by general purpose processor architectures.



Binu Mathew