9.6.3 Array Variable Renaming

Setting the modulo period field in load.context/store.context instructions to a nonzero value unlocks a performance enhancing feature called Array Variable Renaming. Modulo scheduling makes it is possible to overlap the execution of multiple instances of the inner loop body. Assume that the k loop from Figure 9.8 has a latency of 30 cycles and that after satisfying resource conflicts and data dependences it is possible to start a new copy of the loop body every 5 cycles. Then, up to 6 copies of the loop body could be in flight through the execution pipeline. To get data dependences correct for new loop bodies, the loop variable should be incremented every 5 cycles. However, when it is incremented, old instances of the loop body that are in flight will get the wrong value and violate dependences for load/store instructions that happen close to the end of the loop body.

The traditional solution is to use multiple copies of the loop variable in conjunction with the VLIW equivalent of register-renaming - a rotating register file. Multiple address calculations are performed, the appropriate values loaded into the register file and the register file is rotated. For long latency loop bodies with short initiation intervals, this leads to increased register pressure. The solution to this problem is to increment a single copy of the loop variable every initiation interval and compensate for the increment in older copies of the loop body which are in flight. The compensation factor, which is really the modulo period, is encoded into the immediate field of load/store instructions. It is subtracted from the loop variable's value to cause dependences to resolve correctly. In effect, this has the effect of rotating the array variable and letting a generic expression like $A[i][j]$ be rebound to separate addresses. Array variable renaming, effectively converts the entire scratch pad memory into a rotating register file with separate virtual rotating registers for each array accessed in a loop. Array variable renaming is much more powerful than register rotation, but it can also be used in conjunction with a rotating register file. This could be useful in cases in which it is possible to custom design rotating register files that have lower latency than the SRAM and address generator combination used to implement array renaming. Such a combination of array renaming and register rotation can capitalize on the flexibility provided by array renaming and the low latency provided by a custom designed rotating register file. The perception processor does not have an architected register file at all - it merely uses array variable renaming in the place of register-renaming to achieves very high throughput at low power.

Binu Mathew