# Scalable Interconnects for Reconfigurable Spatial Architectures

**Yaqi Zhang**, Alexander Rucker, Matthew Vilim, Raghu Prabhakar, William Hwang, Kunle Olukotun

> Electrical Engineering Stanford University



ISCA '19: The 46th International Symposium on Computer Architecture, Phoenix, AZ

# **Spatial Accelerators**

- Energy efficient
- High-throughput
- Low-latency

Examples:

- Plasticine (ISCA '17)
- Compressed-sparse CNN accelerator (ISCA '17)
- Stream-dataflow accelerator (ISCA '17)

# **Accelerator Characteristics**

- High compute density
- High on-chip memory bandwidth



# **Accelerator Characteristics**

- High compute density
- High on-chip memory bandwidth
- Distributed compute and memory resources
- Streaming interface between compute and memory
- Statically mapped and scheduled compute graph



On-chip networks play a critical role in:

• Energy efficiency ( $\downarrow$  data movement)



- Energy efficiency (↓ data movement)
- Flexibility



- Energy efficiency (↓ data movement)
- Flexibility
- Scalability



- Energy efficiency (↓ data movement)
- Flexibility
- Scalability
- Compute utilization



| Architocturo | Communication |             |            |
|--------------|---------------|-------------|------------|
| Architecture | Frequency     | Granularity | Limited by |
| Processor    | Infrequent    | Packet      | Latency    |



Memory Bus Multi-Processor

Scalable Interconnects for Reconfigurable Spatial Architectures

| Architocturo | Communication |             | Limited by  |
|--------------|---------------|-------------|-------------|
| Architecture | Frequency     | Granularity | Lillined by |
| Processor    | Infrequent    | Packet      | Latency     |



Multi-Processor



| Architactura        | Communication |              | Limited by |
|---------------------|---------------|--------------|------------|
| Architecture        | Frequency     | Granularity  | Linited by |
| Processor           | Infrequent    | Packet       | Latency    |
| Spatial Accelerator | Frequent      | Fine-grained |            |





| Architactura        | Communication |              | Limited by |
|---------------------|---------------|--------------|------------|
| Architecture        | Frequency     | Granularity  | Linined by |
| Processor           | Infrequent    | Packet       | Latency    |
| Spatial Accelerator | Frequent      | Fine-grained | Throughput |





Motivation

Network Design Space

**Compilation Flow** 

**Evaluation** 



Scalable Interconnects for Reconfigurable Spatial Architectures

#### **Static Network**



| Pros                 | Cons                                 |
|----------------------|--------------------------------------|
| Guaranteed bandwidth | Low link utilization<br>P&R failures |

### **Dynamic Network**



| Pros         | Cons                          |
|--------------|-------------------------------|
| Link sharing | Limited bandwidth<br>Deadlock |

# **Hybrid Network: Static and Dynamic**



| Pros                                             | Cons                           |
|--------------------------------------------------|--------------------------------|
| Link sharing<br>More bandwidth<br>Guaranteed P&R | More area<br>More static power |



Motivation

Network Design Space

**Compilation Flow** 

Evaluation



#### A DSL for Reconfigurable Accelerators



- Annotate data size N
- Calculate loop iterations



# **Accelerator Compiler**

- Allocate compute and memory Virtual Blocks (VBs)
- Infer activation counts for logical links





#### Partition VB graph to meet hardware constraints



Scalable Interconnects for Reconfigurable Spatial Architectures

# Mapping

- Partition VB graph to meet hardware constraints
- Place and route the VB graph onto the network
- Allocate VCs for the dynamic network



• Start with random placement



- Start with random placement
- Route all links, in order of activation count



- Start with random placement
- Route all links, in order of activation count
  - Build most efficient broadcast tree
  - Guarantee static network placement, if possible



- Start with random placement
- Route all links, in order of activation count
  - Build most efficient broadcast tree
  - Guarantee static network placement, if possible
  - Else, map the link to the dynamic network



- Start with random placement
- Route all links, in order of activation count
- Re-place VBs with the highest routing cost
  - Dynamic network congestion
  - Average route length
  - Maximum route length

- Start with random placement
- Route all links, in order of activation count
- Re-place VBs with the highest routing cost
  - Dynamic network congestion
  - Average route length
  - Maximum route length



- Start with random placement
- Route all links, in order of activation count
- Re-place VBs with the highest routing cost
  - Dynamic network congestion
  - Average route length
  - Maximum route length



- Start with random placement
- Route all links, in order of activation count
- Re-place VBs with the highest routing cost
- Repeat routing

- Start with random placement
- Route all links, in order of activation count
- Re-place VBs with the highest routing cost
- Repeat routing

#### Summary

Iteratively reduce routing cost Map bandwidth-critical links onto the static network

# Area and Energy Characterization

- Synthesize switch and router RTL at 28 nm, 1GHz
- Power simulation with Primetime



# **Area and Energy Characterization**

- Synthesize switch and router RTL at 28 nm, 1GHz
- Power simulation with Primetime
- Decompose power into:
  - Inactive (per-cycle)
  - Active (per-bit)



# Simulation

- Integrate simulator with DRAMSim and BookSim
- Track transmitted data in switches and routers
- Estimate per-app power with activity traces:

$$E_{net} = \sum_{allocated} P_{inactive} T_{sim} + E_{flit} # flit$$





Motivation

Network Design Space

**Compilation Flow** 

#### Evaluation

# **Area and Energy Characterization**



Scalable Interconnects for Reconfigurable Spatial Architectures

# Area and Energy Characterization





Scalable Interconnects for Reconfigurable Spatial Architectures

#### **Benchmarks**

| Category       | Application                    |
|----------------|--------------------------------|
|                | Dot Product                    |
| Linear Algebra | Outer Product                  |
|                | Black Scholes                  |
|                | GEMM                           |
| Database       | TPC-H Query 6                  |
| Clustering     | k-Means Clustering             |
|                | Lattice Regression             |
| Inforanco      | LSTM (RNN)                     |
| merence        | GRU (RNN)                      |
|                | LeNet (CNN)                    |
|                | Gaussian Discriminant Analysis |
| Training       | Logistic Regression            |
|                | Stochastic Gradient Descent    |

### **Benchmark Resource Usage**



Scalable Interconnects for Reconfigurable Spatial Architectures

# **Evaluated Design Space**

- Different network configurations
  - Static: flow control, bandwidth
  - Dynamic: VC count, flit width
  - Hybrid
- Different applications
- Different architectures
  - Pipelined (high throughput)
  - Scheduled (low throughput)

# **Evaluated Metrics**

- Performance (Perf)
- Area efficiency (1/Area)
- Performance per area (Perf/Area)
- Power efficiency (1/Power)
- Energy efficiency (Perf/Watt)

# **Evaluated Metrics**

- Performance (Perf)
- Area efficiency (1/Area)
- Performance per area (Perf/Area)
- Power efficiency (1/Power)
- Energy efficiency (Perf/Watt)

Reported values are the geomean across all applications, normalized to the worst network configuration.

# **Evaluated Metrics**

- Performance (Perf) 🖘
- Area efficiency (1/Area)
- Performance per area (Perf/Area)
- Power efficiency (1/Power)
- Energy efficiency (Perf/Watt)



# **Hybrid Network VCs and Flit Width**



Dynamic network flit width and VC count can be decreased with no performance loss.

# Static vs. Dynamic vs. Hybrid



The dynamic network performs poorly on compute-bound applications due to insufficient bandwidth.

# Static vs. Dynamic vs. Hybrid



The dynamic network performs poorly on compute-bound applications due to insufficient bandwidth.

# **Most Efficient Network Configurations**



The hybrid network reduces data movement by using a dynamic network as an escape path.

Scalable Interconnects for Reconfigurable Spatial Architectures

# **Most Efficient Network Configurations**

#### **Pipelined Architecture**



A hybrid network improves energy efficiency by **1.8***x* with performance similar to a static network.

# **Most Efficient Network Configurations**

#### **Pipelined Architecture**



A hybrid network improves energy efficiency by **1.8***x* with performance similar to a static network. Performance varies up to **7***x* between the best and worst network configurations.

# Conclusion

- Network performance correlates strongly with *bandwidth* for spatial accelerators
- Bandwidth scales more efficiently on a static network
- A hybrid (large static, small dynamic) network:
  - Eliminates place and route failure
  - Improves perf/watt

# Conclusion

- Network performance correlates strongly with bandwidth for spatial accelerators
- Bandwidth scales more efficiently on a static network
- A hybrid (large static, small dynamic) network:
  - Eliminates place and route failure
  - Improves perf/watt

# Thank You!

#### **Static Network: Flow Control**



**End-to-end Flow Control** 

Back Pressure



**Per-hop Flow Control** 

#### **Static Network: Bandwidth**



#### We vary the number of links between switches.

#### **Dynamic Network**



#### We vary the number of Virtual Channels (VCs) and flit width.

#### **Static Network Bandwidth**





3x static network bandwidth

Bandwidth strongly impacts accelerator performance.

# **Static Network Flow Control**

#### **Credit-Based vs. Per-Hop**





#### Credit-based flow control has **3x** lower performance.

### **Accelerator Model**

- Pool of compute and memory resource
- Compute:
  - SIMD pipeline, or
  - Vector processor with a small instruction window

![](_page_57_Figure_5.jpeg)

# **Statically Routed Dynamic Network**

- Streaming protocol requires in-order transmission
  - Can't use adaptive or oblivious routing
  - Can't drop packets
- Routes are looked up in a table at runtime
  - Route to multiple outputs for efficient broadcast links

# **Performance Scaling**

![](_page_59_Figure_1.jpeg)

# **Key Design Challenges**

![](_page_60_Figure_1.jpeg)