9.7 Compiler Controlled Clock Gating

In a traditional architecture, a function unit pipeline always shifts unless a stall situation happens. Operands enter the pipeline, and results exit it under hardware control. A distinguishing feature of the perception processor architecture is that a compiler can manage pipeline activity on a cycle by cycle basis. Microinstructions contain an opcode field for each function unit in the cluster. The fetch logic enables the pipeline shift and clock signals of a function unit only if the corresponding field is not a NOP. It can also generate a NOP when the opcode field is used for another purpose. The net result is that a function unit pipeline makes progress only during cycles when operations are issued to it and stalls by default. The scheme provides fine grain software control over clock gating while not requiring additional bits in the instruction to enable or disable a function unit. When the result of an N-cycle operation is required, but the function unit is not used after that operation, dummy instructions are inserted by the compiler into following instruction slots to flush out the required value. To avoid excessive power-line noise a compiler may keep a function unit active even when it has nothing to compute. The regular nature of modulo scheduled loops make them good candidates for analytical modeling and reduction of power-line noise [112].

Fine grain compiler directed pipeline control has two main purposes. First, the compiler has explicit control over the lifetimes of values held in a pipeline unlike a traditional architecture where values enter and exit the pipeline under hardware control and only quantities held in architected registers may be explicitly managed. In the perception processor, pipeline registers and the associated bypass paths may be managed as if they were a small register file, and dataflows found in custom hardware can be easily mimicked. Second, it lets the compiler control the amount of activity within a cluster. Software control of dynamic energy consumption makes energy vs ILP trade-offs possible. The resulting activity pattern can approximate the ideal condition where each function unit has its own clock domain and runs with just the right frequency.

Binu Mathew