

# Specialization in Hardware Architectures for Deep Learning

Michaela Blott Distinguished Engineer June 2021



#### Background

#### > Xilinx

- Fabless semiconductor company, founded in Silicon Valley in 1984
- Invented the FPGA

#### > Xilinx Research Dublin

- >> ~10 researchers plus university program
- >> Plus 4-6interns typically

#### Focus: FPGAs in Machine Learning

Building systems, architectural exploration, algorithmic optimizations, benchmarking

#### In collaboration with partners, customers and universities



Lucian Petrica, Giulio Gambardella, Alessandro Pappalardo, Ken O'Brien, Nick Fraser, Yaman Umuroglu , Michaela Blott + Kees Vissers



## What are FPGAs?

#### Customizable, Programmable Hardware Architectures

• The chameleon amongst the semiconductors...



- Customizes IO interfaces, compute architectures, memory subsystems to meet the application
- Use case: Nothing else works, and you want to avoid ASIC implementation; or ASIC emulation





#### What are FPGAs?



# Why do we need specialization in hardware architectures for Deep Learning?

- DNNs bring huge potential and are penetrating many applications
- Associated compute and memory requirements are huge
- Compute requirements are outpacing Moore's Law
- Hitting the physical limits of siliconbased computing
- Architectural innovation needed



#### **Explosion of Innovative Approaches**







## **Deep Learning Processor Architectures**



#### **Specialization, Performance & Flexibility**

| generic<br>slower |                                       | Specialization<br>Performance<br>Efficiency | co-designed<br>specialized<br>faster<br>more efficient |
|-------------------|---------------------------------------|---------------------------------------------|--------------------------------------------------------|
|                   | Customize for DNN in general          | Customized for topologies                   |                                                        |
|                   | Matrix of Processing Engines<br>(MPE) | Spatial Processors<br>(SP)                  |                                                        |



, .

,

## Matrix of Processing Engines Customizing for DNN in General

- Popular layer-by-layer compute
- Batching to achieve high compute efficiency
- Customized for ML in general
- Specialized processing engines
  - Operators
  - ALU types
    - tensor-, matrix- or vector-based



#### **Specialization, Performance & Flexibility**





## Spatial Processors (SP): Customizing for Specific Topologies

- Hardware instantiates the topology as a dataflow architecture
  - Customize everything to the **specifics of the given DNN**, any operation, any connectivity
- Benefits:
  - Improved efficiency
  - Low fixed latency
- Scale performance & resources to meet the application requirements
  - If resources allow, we can completely unfold to create a circuit that inferences at clock speed and thereby meet these new throughput requirements



## SPs can scale performance, reduce latency and provide improved efficiency



allocated resource ~

compute requirement

per layer

DNN



# **Customizing Arithmetic**



## **Customizing Arithmetic to Minimum Precision Required**

Shrinks hardware cost & scales performance

- Instantiate n-times more compute within the same fabric, thereby scale performance n-times
- 8b/8b -> 1b/1b, RTL => 70x



C= size of accumulator \* size of weight \* size of activation

## **Customizing Arithmetic to Minimum Precision Required**

#### Potential to reduce memory footprint and avoid memory bottleneck

- DNN inference is typically memory bound
- DNN model can stay on-chip

#### Inherently saves power

|                     |             | Re  | elative | Energ | y Cost |       |
|---------------------|-------------|-----|---------|-------|--------|-------|
| Operation:          | Energy (pJ) |     |         |       |        |       |
| 8b Add              | 0.03        | ]   |         |       |        |       |
| 16b Add             | 0.05        |     |         |       |        |       |
| 32b Add             | 0.1         |     |         |       |        |       |
| 16b FP Add          | 0.4         |     |         |       |        |       |
| 32b FP Add          | 0.9         |     |         |       |        |       |
| 8b Mult             | 0.2         |     |         |       |        |       |
| 32b Mult            | 3.1         |     |         |       |        |       |
| 16b FP Mult         | 1.1         |     |         |       |        |       |
| 32b FP Mult         | 3.7         |     |         |       |        |       |
| 32b SRAM Read (8KB) | 5           |     |         |       |        |       |
| 32b DRAM Read       | 640         |     |         |       |        |       |
|                     |             | 1 1 | 0       | 100   | 1000   | 10000 |

[Adapted from Horowitz. Computing's Energy Problem (and what we can do about it), ISSCC'14]

Customized arithmetic brings performance, resource, memory and energy benefits Requires co-design (retraining of CNNs)

| Precision | Modelsize Mbyte<br>(ResNet50) |
|-----------|-------------------------------|
| 1b        | 3.2                           |
| 8b        | 25.5                          |
| 32b       | 102.5                         |



## **Granularity of Customizing Arithmetic**









# Challenge

How can we enable a broader spectrum of end-users to be able to specialize hardware architectures and co-design solutions?





- Providing tools and platforms for exploration of CNN compute architectures
- End-to-end flow
  - ML engineers can create specialized hardware architectures on an FPGA
    - with spatial architectures and custom precision
- Open source https://xilinx.github.io/finn
  - Transparency and flexibility for the fast changing landscape of algorithms
    - if not supported, you can add your own



## **From CNN to FPGA Deployment**









#### Many Use Cases, Platforms, Datasets and Topologies

- Many embedded and server-class platforms
- Multi-FPGA and single-node



#### Many more applications

- Radio modulation classification
- Speech recognition
- Facemask detection
- Object recognition with prosthetic hands
- Optical character recognition
- Playing card for solitaire playing robot arm
- Many topologies
  - MLPs, CNV, Yolo variants
  - MobileNetv1& ResNet50
  - LSTM
  - QuartzNet in progress





#### **Status & Results**



Looking to grow community and build-up industrial applications If you like to collaborate- we'd love to hear from you ©



## Results

**XILINX**.

#### **Deep Network Intrusion Detection System**



**Goal:** Implement **NN-based traffic classifier** delivering 100G **line-rate** throughput = 150 Mips Latency sensitive (buffer 10s of MB/msec)

[1] Moustara, nour, and Jin Stay. <u>ONSW-INBTS. a comprehensive data set for network initiasion detection systems (ONSW-INBTS network data set).</u> Initiary communications and Information Systems Conference (MilCIS), 2015. IEEE, 2015.

[2] Murovič, Tadej, and Andrej Trost. "Massively parallel combinational binary neural networks for edge processing." Elektrotehniski Vestnik 86.1/2 (2019): 47-53.





- >1000x performance improvement over Vitis AI, less resources,
- 100Gbps line rate (150Mips)
- Exploits: dataflow processing, reduced precision, fine-grained sparsity





# Summary





 Spectrum of innovative architectures emerge to address upcoming compute and memory requirements in DNNs

Specialization of hardware architecture are critical to scaling architectures

- In particular for extreme throughput applications as we see for example in communications

We looked at the NIDS example which showed the tremendous benefits we get from quantization and spatial implementations

## **Infrastructure for Experimentation & Collaboration**

- Xilinx academic compute clusters (XACC)
  - 4 centres world-wide
  - Free to use

26

- Enabling research community
- Explore innovative compute architectures
- Flexibility, networked FPGAs



Many examples emerging: <u>https://xilinx.github.io/xacc/</u>



# **XILINX**.

# Thank You





