11. Conclusions

Natural human interfaces built on technologies like speech recognition, gesture recognition, object detection and tracking are central to the widespread acceptance of future embedded systems. The chances for today's isolated embedded devices to develop into tomorrow's ubiquitous computing environment also depends on services like secure wireless networking, media processing and integration with visual and audio interfaces. The levels of performance and power efficiency required to achieve these goals are orders of magnitude beyond the ability of current embedded processors. Application specific processor architectures can effectively solve some of these challenges.

The performance characteristics of a face recognition system based on well-known algorithms and a leading research speech recognition system were analyzed. By recasting these perception algorithms as well as DSP and encryption algorithms on to an architecture optimized for stream processing, high levels of ILP and energy efficiency were demonstrated. The perception processor uses a combination of VLIW execution clusters, compiler directed dataflow and clock gating, hardware support for modulo scheduling and special purpose address generators to achieve high performance at low power for perception algorithms. Operationally, the combination of stream address generators and scratch-pad memories represent a unification of VLIW and vector styles of execution. The perception processor is a fairly minimal, yet programmable hardware substrate that can mimic the dataflow found in ASICs. It outperforms the throughput of a Pentium 4 by 1.75 times with an energy delay product that is 159 times better than an XScale embedded processor. Its energy delay product is just 12 times worse than that of an ASIC implementation. This approach has a number of advantages:

  1. Its energy-delay efficiency is close to what can be achieved by a custom ASIC.
  2. The design cycle is extremely short when compared to an ASIC since it substitutes circuit design with interconnect topology selection and microcode programming.
  3. The perception processor architecture is simple and regular. Hardware netlists for perception processor configurations are automatically generated. Once the netlist generator and the basic architectural components are proven to be correct, perception processor configurations should be easier to implement correctly compared to ASICs. The perception processor architecture provides very fine grain control over hardware resources making work arounds for hardware problems and software bug fixes easy.
  4. Since applications are implemented in microcode, post deployment bug fixes are trivial.
  5. It retains a large amount of generality compared to an ASIC.
  6. It is well suited for rapid automated generation of domain specific processors.
A larger set of applications needs to be analyzed in the future to ensure that the architectural primitives of the perception processor have sufficient generality to cover the perception domain comprehensively. Automated architecture exploration and application analysis, programming language support for perceptual primitives and streaming, and formal methods to ensure real-time response will be important directions for future research.

It has been shown that fine-grained management of communication and storage resources can improve performance and reduce energy consumption whereas simultaneously improving on both these axes using a traditional microprocessor approach has been problematic. The perception processor is an attractive choice when performance, power efficiency, programmability and rapid design cycles are important. For the first time, sophisticated real-time perception applications appear to be possible within an energy budget that is commensurate with the embedded space.

Binu Mathew