# Wafer-Scale Hardware for ML and Beyond

Rob Schreiber, Cerebras Systems, Inc.

2nd International Workshop On Machine Learning Hardware (IWMLH), July 2021



### Moore's Law

• transistors per mm<sup>2</sup>



#### • mm<sup>2</sup> per chip



# Cerebras Wafer Scale Engine



**Cerebras WSE-2** 2.6 Trillion Transistors 46,225 mm<sup>2</sup> Silicon

Cerebras Systems © 2021



**Largest GPU** 54.2 Billion Transistors 826 mm<sup>2</sup> Silicon

|                  | Cerebras WSE     | A100               | Cerebras Advantage |
|------------------|------------------|--------------------|--------------------|
| Chip size        | 46,225 mm²       | 826 mm²            | 56 X               |
| Cores            | 850,000          | 6912 + 432         | 123X               |
| On-chip memory   | 40 Gigabytes     | 40 Megabytes       | 1,000 X            |
| Memory bandwidth | 20 Petabytes/sec | 1555 Gigabytes/sec | 12,733 X           |
| Fabric bandwidth | 220 Petabits/sec | 600 Gigabytes/sec  | 45,833 X           |



# CS-1 Chassis





# Cross-wafer connectivity Yield

Power and cooling Thermal expansion



#### Cross-Die Wires

# Developed with TSMC

# Uniform bandwidth across wafer



## Redundancy

#### Extra rows

# Logical 2D mesh



## Yes, we can build wafer-scale systems

## What did we put on the wafer?

#### All the memory

Fine grained parallelism

Shared nothing

Power-efficient, general purpose core



On the Wafer:

• Huge compute

Huge memory + comm bandwidth

• Great flops/watt

• 40 GB of SRAM memory

Cerebras Systems: Placement Visualizer Generated: Sun Oct 15, 2017 22:14 PM [vijay@server1] vgg\_final







#### **Detailed Routing**



Semi-supervised learning, scalable data analysis and agent based simulations on population scale data

> Scope of CANDLE Deep Learning

RAS

Pathway

Treatment Strategy Unsupervised learning coupled with multi-scale molecular simulations

> Supervised learning augmented by stochastic pathway modeling and experimental design

Drug Response

# @LLNL: Inertial confinement fusion model



#### CPU and GPU performance





#### On the CS-1: 3D mesh --> 2D machine





#### BiCGStab: Building Blocks



# Interprocessor Communication

- The wafer is a dataflow computer:
- Pre-routed virtual channels ("colors")
- Single word packets
- Single clock latency
- Arrival triggers a task
- Data arrives in registers
- 24 colors
- Link level flow control
- Communication in the ISA



## All reduce in 1.3 $\mu s$



#### Sparse matrix vector product via vector operations and dataflow



#### Writing your own code: The SDK

- The SDK: Low-level programming for creating custom kernels
  - DSL with abstractions for the *lower-level* constructs of the WSE architecture
  - Libraries for common primitives, such as communication, BLAS, rand, etc
  - Debugger and performance profiling tools
  - Hardware simulator
  - Examples and documentation: language specification, sample code, and programming guides

#### Beta --- September 2021

#### Results

#### On NETL Xeon Cluster

- 0.86 PF/s
  (600x600x1536)
- Over 30 % of peak
- 28 usecs / iteration
- ~200 X cluster



# National Energy Technology Lab









\*Plots are from customer's cost models.

# Implication

# Strong scaling is attainable for problems that fit on the wafer

#### Making an Impact: Real Time CFD



#### **Real Time CFD**

- Online Equipment Monitoring
- Cyber-Physical Security
- Failure Prediction
- Renewable Integration
- Dynamic Baseload Power
- Higher Efficiencies
- Safer Operation
- Better Command and Control

Steam turbine rotor produced by Siemens, Germany" by Christian Kuhna is licensed under <u>CC BY 3.0</u>

# Conclusion

|                                                                                  | The Top                                                                                                                                    |                                                                                                                                                                                                         |
|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 01010011 01100011<br>01101001 01100101<br>01101110 01100011<br>01100101 00000000 |                                                                                                                                            |                                                                                                                                                                                                         |
| Software                                                                         | Algorithms                                                                                                                                 | Hardware architecture                                                                                                                                                                                   |
| Software performance engineering                                                 | New algorithms                                                                                                                             | Hardware streamlining                                                                                                                                                                                   |
| Removing software bloat<br>Tailoring software to<br>hardware features            | New problem domains<br>New machine models                                                                                                  | Processor simplification<br>Domain specialization                                                                                                                                                       |
|                                                                                  | The Bottom                                                                                                                                 |                                                                                                                                                                                                         |
|                                                                                  | 01101001 01100101<br>01100101 00000000<br>Software<br>engineering<br>Removing software bloat<br>Tailoring software to<br>hardware features | 01010011011000110110100101100011010010100000000SoftwareAlgorithmsSoftware performance<br>engineeringNew algorithmsRemoving software bloat<br>hardware featuresNew problem domains<br>New machine models |