9.8 Design Flow

The hardware netlist for a perception processor is automatically generated from a configuration description using a specially developed netlist compiler tool. The configuration description is created manually based on an analysis of benchmarks. Of particular importance to the analysis is the relative importance of various types of operators within an algorithm. This determines the mix of function units incorporated into a perception processor. Also important is the dataflow within loop bodies, which determines the interconnect topology and size and number of SRAMs. It may be possible to perform this analysis in a semiautomated manner in the future. Based on benchmark analysis an architect creates a configuration description expressed as a Python script. The configuration script selects a set of function units from a library of components like ALUs, multipliers and floating point units implemented using Verilog HDL and Synopsys module compiler languages. Each function unit in the library is annotated with attributes like latency, opcode width and names of input and output ports. Each function unit is provided a name and a position in the eight slots available for function units. The architect also selects the number of input and output muxes used to create the interconnect. Depending on the type and number of function units and SRAMs the actual HDL code for the muxes will be generated by the netlist compiler. The architect then specifies the topology of the interconnect. This is done by specifying the names of the function units connected to each of the input and output muxes.

The architect also describes an instruction format in symbolic form. This is a tree structure that defines the relative position of opcode bits for each function unit and interconnect mux within a wide instruction word. Each field is then recursively split into subfields. It is possible to define alternate interpretations for bitfields. For example, the opcode slots of several function units may also be used to contain reconfiguration information for the loop unit. A shared instruction type field in each instruction word determines which of the interpretations should be used. The netlist compiler tool converts the configuration description into the top level HDL description of a perception processor. It generates a small instruction decoder based on the instruction format specified by the architect. It also creates the interconnect and its constituent muxes and connects the ports of various hardware modules together to create a complete perception processor implementation.

The generated processor netlist along with HDL descriptions of various components is processed by a series of commercial ASIC design tools. The Synopsys design compiler maps the HDL description into a gate level netlist. A suite of specially developed gate level netlist processing scripts analyze the input and output connectivity of each gate in the netlist to derive heuristic estimates for wire lengths. These scripts also modify the netlist and insert an RC component on each wire. Each RC component is named uniquely, and the wire length associated with each component is saved to a text database. The modified netlist and a wrapper HDL design which instantiates the processor, SRAMs, clock generator, self-checking routines, etc., are simulated using Synopsys Nanosim, a transistor level Spice simulator. Spice transistor models for a $0.13\mu$ CMOS process are also provided to Nanosim. Based on the saved wire lengths and the resistance and capacitance of the lowest level metal layer, the resistance and capacitance of each wire in the design are computed. A script then instructs the Nanosim simulator at run time to annotate these computed values onto the RC elements that were inserted previously. A test-bench then loads a microprogram binary into the instruction-SRAM. Nanosim then performs a low level simulation of the entire circuit. It periodically samples and records the supply current to a text database. The simulation repeatedly executes the same microprogram. At the end of each execution, self-checking routines in the test bench verify that the results present in the output SRAM match results that were precomputed by running a C or Python implementation of the algorithm. Simultaneously, a specially developed numerical integration program uses the supply current database to compute power and energy consumption. When the average power consumption result converges, the Nanosim simulation is terminated.

The configuration description written by the architect is also used as an input to the microcode compiler so that the compiler knows the actual configuration of the processor it is generating code for. The compiler translates a microprogram expressed in a limited subset of Python into a microcode binary. It then configures a generic perception processor simulator to represent the parameters specified in the configuration description. Each microprogram file also includes an additional pure Python reference implementation of the algorithm and some test data. The microcode binary is simulated using the test data, and output vectors are generated and saved. The simulator then runs the reference implementation of the algorithm and verifies that the simulation results match the reference implementation. It then saves the output vectors in a form suitable for use with the Verilog self-checking routines described previously. Another result of the simulation is a log of read, write and idle cycles of each SRAM. The simulator uses this log along with SRAM power consumption information provided by the CAD tool which generated the SRAM macrocell to compute the energy consumption of each SRAM. The SRAM power consumption is then added to the processor power consumption computed using numerical integration of the Nanosim output database to arrive at the overall power consumption.

Binu Mathew