12. Future Research

The architecture of the perception processor presented in this dissertation gradually evolved from analyzing and observing the characteristics of speech recognition and vision algorithms and trying to design ASICs and traditional processors to accelerate these tasks. The design process has led to the realization that it may be possible to systematically derive power efficient high performance processors for a wider class of algorithms. This chapter outlines possible directions for future extensions to the perception processor architecture. In this chapter, the term stream processor refers to the extended version of the architecture so as to clearly distinguish it from the perception processor presented in Chapter 9.

The term stream processing refers to real-time computations on high bandwidth data streams. Examples include link-level encryption in networks, video trans-coding and compression of video streams. Perceptual algorithms tend to be stream oriented. An important direction for future research is the architecture of generic, high performance, low power, stream processors that can accelerate both perception algorithms and streaming algorithms from other domains.

Figure 12.1: Generic Stream Function
\begin{verbatim}StreamFunc(input_iterator, inp...
...tuple) /* Stores output_tuple */\end{verbatim}

Figure 12.1 shows an abstract representation of a stream function. It is a generalization of the map(), reduce() and filter() list processing functions and list comprehensions found in the Python and Haskell languages [101,54]. Analogues exist in Lisp and similar languages. It applies a side effect free function lambda_func() to arguments gathered from a set of input variables and stores the result to a set of output variables. The input and output variables may be scalars, vectors, multidimensional arrays or more complex aggregates. The procedure input_iterator() is history sensitive. Each time it is invoked, it returns a tuple consisting of input data gathered from the various input variables. The input_predicate() function examines the input tuple gathered by the iterator and decides if further processing is required or not. If further processing is required, lambda_func() is used to transform the input tuple to an output tuple. The function output_predicate() examines an output tuple and decides if it needs to be saved or not. If the result needs to be saved, the history sensitive output_iterator() procedure scatters the output tuple over the output variables. Complex streaming algorithms may be expressed as the composition of several StreamFunc() instantiations with the outputs of earlier instances used as the input of later instances. Some restrictions like constant dependence distance or flow dependence may need to be imposed to map such functions onto stream processors with limited on-chip memory.

Figure 12.2: Stream Processor

The structure of StreamFunc() lends itself to a highly parallel hardware implementation. Figure 12.2 shows the logical organization of a generic stream processor. Its architecture is reminiscent of a hydraulic system and fluid flow analogies apply to the throughput of the system. The input iterator unit pumps or gathers data from a set of SRAMs. The input predicate examines the data and either passes it to the execution cluster or drops it. The execution cluster constantly transforms the data being pumped into it. The output predicate then examines the transformed results and either drops it or passes it on to the output iterator which saves it to output memory. The structure is highly parallel and capable of sustaining high throughput. The gathering, transformation and scattering of data are staged under the control of microcode.

The perception processor described in Chapter 9 is less generic when compared to this stream processor. The input and output iterator functionality is provided by the Loop Unit and the Address Generators, but they are limited to accelerating simple nested for loops as well as array and vector accesses. A stream processor needs high performance but generic mechanisms for implementing more complex loop nests and data access patterns. The perception processor does not implement input or output predicates though conditional moves in the execution cluster permit selection of alternative results. In the perception processor hardware acceleration is limited to lambda functions that correspond to the loop bodies of modulo schedulable loops. Other types of code may be used, but with no significant advantage over what a normal VLIW processor might provide. A generic stream processor may need to support complex lambda functions that involve conditional execution and hardware acceleration for scheduling regimes other than modulo scheduling. Like the perception processor, the stream processor will also need to behave like a normal processor when operating outside the stream function so as to efficiently implement loop prologues, epilogues and assorted processing that does not fit the stream function model.

Research in scheduling algorithms that can produce good mappings for stream functions on to stream processors with a specified configuration will be important from a code generation perspective as well as for automated architecture exploration. Such algorithms will need to perform both power and performance optimization as well as ensure that parameters like supply current variation meet design constraints. Algorithms for splitting and composing complex stream functions expressed as combinations of basic stream functions so as to make the best use of the limited number of function units and storage resources available in a particular stream processor configuration will also be important.

The structure of perception applications is suitable for a pipeline of perception processors. Stream processors should support more complex communication and synchronization modes. Chapters 5 and 6 indicated that DRAM bandwidth reservation or independent DRAM buses for individual algorithmic phases may be required to ensure adequate bandwidth for perception applications. Chapter 3 explained that the IPC improvement provided by thread level parallelism can be an important source of power savings. Together these factors indicate that research into chip multiprocessors consisting of clusters of stream and RISC processors, a stream optimized interconnect and multiple DRAM buses could be extremely beneficial. Finally, tools to characterize the global dataflow within complex applications, refactor applications to ease mapping on to heterogeneous chip multiprocessors and programming language support for streams could be important directions for future research.

Binu Mathew