9.6 Memory System Architecture

Perception applications are stream oriented with a large number of 2D array and vector accesses per elementary operation. These accesses typically occur within tight loops with known bounds. Traditional processors have a limited number or load/store ports, and this limits overall performance because of the high number of array accesses, which is the reason DSPs traditionally partition their memory resources. A large number of SRAM ports are required to efficiently feed data to function units. Increasing the number of ports on a single SRAM or cache increases access time and power consumption. This motivates the choice of multiple small software managed scratch SRAMs. It is also possible to power down SRAMs that are not required. For low leakage processes a large fraction of the energy consumption is in the sense amplifiers of the SRAM ports. They consume approximately 50% of the processor energy in the 0.25$\mu$ implementation. Mechanisms to efficiently use these expensive resources are important for both performance and energy conservation.

Hardware performance counter based measurements on a MIPS R14K processor showed that 32.5% (Geometric mean) of the executed instructions were loads/stores for a set of perception benchmarks described later in Section 10.1. The high rate of load/store operations combined with the regular array access patterns makes it possible to overlap computation and SRAM access possible using hardware accelerators. A large fraction of the remaining 67.5% execution component is array address calculations that support load/store operations. Significant optimizations are possible by associating each SRAM port with an address generator that deals with common access patterns of streaming applications. The access patterns include 2D array and vector accesses in modulo scheduled or software pipelines loops. Details may be found in Section 9.6.4.

Four new instructions are required to take advantage of the optimizations:

Reconfigure an address generator by transferring a description of an access pattern into a context register within the memory system. This instruction when applied to the loop unit similarly transfers the parameters of a loop into a loop context register.
$load.context\,\,dest,\,context\_index$ and

These are loads/stores that use the address generation mechanism. The $context\_index$ encoded into the immediate constant field of the instruction specifies the address generator to be used and the index of a context register within it.

Let the memory system know that a new loop is starting.


Binu Mathew