SiliconIntelligence

9.5 Interconnect

The local bypass muxes in each function unit are intended for fast, frequent communication with the immediate function unit neighbors. The interconnect supports communication with nonneighbor function units and SRAMs. Such communications have a latency of one cycle. In a multicluster configuration, intercluster communication will incur even larger delays. Values transferred via the interconnect to the input registers of a function unit may be held indefinitely which is useful for caching common constants.

Figure 9.5: Interconnect Architecture
\includegraphics[width=0.75\columnwidth,keepaspectratio]{figures/cluster/interconnect}

In modulo scheduled loops, each resource may be used only during one modulo period. Reusing a resource later will render the loop body unschedulable. It is common to find a lot of data reads early in the loop body and a few stores toward the end that correspond to computed values graduating. Conflicts in the interconnect often make modulo scheduling difficult. Partitioning the interconnect muxes by direction has the potential to reduce scheduling conflicts. Incoming muxes transfer data between function units and from SRAM ports to function units while outgoing muxes are dedicated to transferring function unit outputs to SRAM write ports.

The high level architecture of the interconnect is remarkably simple. Assume an organization with N incoming muxes and M outgoing muxes as shown in Figure 9.5. Each incoming mux is logically a 16-to-1 mux which selects the output of one of the eight function units, six SRAM ports or constant fields within the microinstruction. There is some hierarchy in the actual circuit to optimize size and delay. There is currently an unused port in the 16-to-1 mux which is reserved for inter cluster communication in future multicluster configurations. As seen in Figure 9.4, there are two interconnect pipeline registers at the input of each function unit. Half of the N muxes feed the A input registers of function units. The muxes are connected to the input registers in round robin manner. The other N/2 muxes serve the B input registers. The muxes are partitioned by input register so that both operands of a function unit may be delivered from elsewhere in the cluster without conflict. The M outgoing muxes are 8-to-1 muxes that connect the function unit outputs to the SRAM write ports. Again, the muxes are connected in a round robin manner to the SRAM data inputs. Upon specifying values for N and M, a netlist generator tool developed as a part of this research generates Verilog HDL for the processor and the interconnect. While the simple round robin connections have worked well for the benchmarks used in this research, it is possible to manually specify any custom topology for the interconnect. The choice of interconnect parameters depends on the dataflow within the algorithm kernels and the number of computed results that need to be retired per cycle. It is possible to implement compiler based instruction scheduling algorithms that are topology neutral by describing communication paths as a weighted graph structure, an approach which was used in an earlier version of this architecture [67]. The actual processor configurations that are evaluated in Chapter 10 uses four incoming muxes and one outgoing mux.

It is possible that two operands need to be made available at a function unit as part of a dataflow but interconnect conflicts make such a transfer impossible. In such cases it is possible to transfer one operand in an earlier cycle and freeze its destination interconnect register using clock gate control till both operands arrive and can be consumed. The conflict can thus be resolved and a feasible schedule attained, but latency and loop initiation interval increase somewhat as congestion increases. This method of staging during separate cycles, transfers that are logically simultaneous, is called interconnect borrowing.



Binu Mathew