# Serving Recurrent Neural Networks Efficiently with a Spatial Accelerator

Tian Zhao, Yaqi Zhang, Kunle Olukotun

Stanford University



#### **RNNs are Popular Data Center Workloads**

Machine Learning Workload at Google



How to design an efficient hardware accelerator for all the RNN kernels?



## **RNN is Hard to Serve Efficiently**

- RNN kernels contain complex dataflow.
- RNN sizes can vary a lot over different problems.



LSTM Example

#### **RNN Kernels Contain Complex Dataflow** Weight Matrices (D x H) H: #hidden units Wf Wo Wz Wi D: #input features ··· Þ a Di $\rightarrow$ Weight Matrices – $(H \times H)$ $\cdots \square$ Uf --- Þ A - D - h bo bf bi bz Bias Vectors (H)







#### **RNN Sizes Can Vary over Different Problems**

| Tasks                      | RNN Type                      | RNN Size |
|----------------------------|-------------------------------|----------|
| Sequence<br>Classification | Long Short-term Memory (LSTM) | 128      |
| Speech<br>Recognition      | Gated Recurrent Unit (GRU)    | 2816     |

#### **RNN is Hard to Serve Efficiently**

- RNN kernels contain complex dataflow.
- RNN sizes can vary a lot over different problems.

#### Accelerators with BLAS Abstraction

| <b>BLAS Level</b> | Example Operation                  | Accelerator Example                          |
|-------------------|------------------------------------|----------------------------------------------|
| 2                 | Matrix Vector Multiplication (MVM) | Brainwave Neural<br>Processing Unit (BW NPU) |
| 3                 | Matrix Matrix Multiplication (MMM) | Tensor Processing Unit<br>(TPU)              |

# Is BLAS the right ISA for accelerators?

# Is BLAS the right ISA for accelerators?

- Programmability (+)
- Efficiency on
  - individual kernel (+)
  - end-to-end task (-)







Matrix Vector Multiplication

**Element-wise Operation** 



Matrix Vector Multiplication

**Element-wise Operation** 







# Intermediate Results Buffered in On-chip Scratchpad



#### Intermediate Results Buffered in On-chip Scratchpad



#### Frequent Access to the On-chip Scratchpad



**BLAS** abstraction leads to hardware underutilization caused by misalignment.

# Alternative: Loop-level abstraction











#### Intermediate Results Buffered in Register



# Fine Grain Tiling along the Hidden Unit Dimension



# Fine Grain Tiling along the Hidden Unit Dimension



#### Fine Grain Tiling Converts MVM to DP



# Fine Grain Tiling Uses Cheaper Memory Elements



# Loop Abstraction Enables Fine Grain Tiling to:

- reduce hardware underutilization due to unalignment.
- reduce the size of the intermediate buffers.

**BLAS** abstraction leads to a heterogenous accelerator design that contains unbalanced pipeline.

# Pipelining the RNN Serving Task



#### Heterogeneous vs. Homogeneous Accelerators









A heterogenous accelerator will have an unbalanced pipeline with respect to different problems.







A homogenous accelerator can achieve a balanced pipeline regardless of the problem sizes.

# **Evaluation Configurations**

| Specification           | Tesla V100 GPU        | Stratix 10 FPGA | Plasticine CGRA |
|-------------------------|-----------------------|-----------------|-----------------|
| Programming<br>Language | TensorFlow +<br>cuDNN | Brainwave ISA   | Spatial Lang.   |
| Accelerator Type        | Temporal              | Spatial         | Spatial         |
| ISA Type                | MMM                   | MVM             | Loop            |
| Implementation Type     | Heterogeneous         | Heterogeneous   | Homogeneous     |

# **Evaluation Configurations**

| Specification            | Tesla V100 GPU | Stratix 10 FPGA | Plasticine CGRA |
|--------------------------|----------------|-----------------|-----------------|
| Peak 32-bit TFLOPS       | 15.7           | 10              | 12.5            |
| Technology ( <i>nm</i> ) | 12             | 14              | 28              |
| Die Area ( $mm^2$ )      | 815            | 1200            | 494             |
| TDP (W)                  | 300            | 148             | 160             |

### **Evaluation on DeepBench**

**FLOPS** Utilization



#### Improvement over CPU Baseline



# Homogeneous accelerators with loop-level abstraction achieves better HW utilization