#### Node-Aware Stencil Communication for Heterogeneous Supercomputers

Feb 28 2020

<u>Carl Pearson</u><sup>1</sup>, Mert Hidayetoglu<sup>1</sup>, Mohammad Almasri<sup>1</sup>, Omer Anjum<sup>1</sup>, I-Hsin Chung<sup>2</sup>, Jinjun Xiong<sup>2</sup>, Wen-Mei Hwu<sup>1</sup>

<sup>1</sup> Electrical and Computer Engineering, University of Illinois Urbana-Champaign

<sup>2</sup> IBM T.J. Watson Research, Yorktown Heights, NY



**ILLINOIS** Electrical & Computer Engineering GRAINGER COLLEGE OF ENGINEERING

#### **Carl Pearson**



Ph.D. student, Electrical and Computer Engineering, University of Illinois Urbana-Champaign

- (Multi-)GPU communication
- Accelerating irregular applications
- cwpearson



- cwpearson
- 💌 pearson at illinois.edu
- <u>https://cwpearson.github.io</u>



# Outline

- Motivation
- Stencils
- Decomposition
- Placement
- Specialization
- Results
- Future Directions

In submission: 2020 International Workshop on Automatic Performance Tuning (iWAPT)



#### **Motivation**

- Regular computation, access, and structure reuse → stencil on GPU
- High-resolution modeling → Large stencils
- Limited GPU memory → distributed stencils with communication
- Fast stencil codes → larger impact of communication
- Heterogeneous nodes ("fat nodes") → how to do communication

- Contributions:
  - A three-phase solution for optimized stencil communication on heterogeneous clusters
  - Capability-based communication specialization
  - Runtime node-aware data placement

#### Glossary







#### **Stencil Overview**



Required halo exchange depends on stencil complexity



#### Glossary





- Typically more than one quantity
  - problem-dependent
  - physical properties (pressure, temperature)
  - directional derivatives
- Each quantity's halo is exchanged with corresponding quantity in other subdomains



## Intuition

- Off-node communication is expensive → minimize required, maximize injection bandwidth
  - "hierarchical decomposition"
  - multiple ranks per node
- On-node communication hardware → assign subdomains to GPUs to maximize use of bandwidth
  - "node-aware placement"
- On-node bandwidth depends on communication method 
  → use best method to
  achieve hardware bandwidth
  - "capability-based specialization"
  - parallel, asynchronous exchanges



#### **Decomposition - Minimize Required Comm.**



Intuition: less halo-to-interior ratio means less communication



#### **Decomposition - Approach**



- Divide into *n* subdomains
- Generate sorted prime factors, largest to smallest.
  - Evenly-sized subdomain require dividing by integers.
  - Prime factors is the largest number of integers that multiply to N

- Divide the longest dimension by prime factors
  - $\circ$  ~ use smaller prime factors later to clean up

#### **Hierarchical Decomposition**





### **Communication In Fat Nodes**

Different Bandwidths between GPUs Not the same as theoretical [1]

X-bus: achieved <u>30 GB/s</u> unidirectional NVLink: achieved <u>42 GB/s</u> unidirectional NIC: achieved <u>12.5 GB/s</u> unidirectional

Neighbor GPUs have higher bandwidth

[1] Pearson et al. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects. ACM/SPEC International Conference on Performance Engineering. 2019.



#### Placement

#### How to place subdomains on GPUs to maximize bandwidth utilization?





#### **Quadratic Assignment Problem**

n facilities and n locations
w: weight matrix: w<sub>i,j</sub> amount of "flow" between i and j.
d: distance matrix: distance between i and j
f: bijection n -> n "assignment" of facilities to location

 $\sum_{i,j < n} w_{i,j} d_{f(i),f(j)}$ 

Minimize cost function: sum of products of weights and distances under *f*.

GPUs: locations subdomains: facilities w: required communication d: GPU bandwidth f: assignment of subdomains to GPUs





Allocating Facilities with CRAFT. Buffa, Armour, Vollman. 1962.

Start with some initial placement

while true:

- Check all possible location swaps
- Choose swap that lowers cost the most
- if no better swap:
- break

n<sup>3</sup> for n facilities (n swaps for n locations, roughly n iterations) key to not recompute cost each time - each swap only changes a bit of the cost matches exact solution for n < 6 in our case



#### **Example Placement**



Node-Aware Placement

**Trivial Placement** 



## **Capability Specialization**

Achieve best use of bandwidth, regardless of ranks/node and GPUs/rank

- "Staged": works for any 2 GPUs anywhere
  - pack from device 3D region into device 1D buffer
  - $\circ$  copy from device 1D buffer to host 1D buffer
  - $\circ$  MPI\_Send to other host 1D buffer
  - $\circ$  copy from host 1D buffer to device 1D buffer
  - $\circ$  unpack from device 1D buffer to device 3D buffer

Optimizations are node-aware shortcuts on top of this







#### Pack and Unpack





#### **CUDA-Aware MPI**



Same as the staged, but MPI responsible for getting data between GPUs



#### Colocated



Exchange between different ranks on the same node Different ranks are different processes with different address spaces Use cudaIpc\* to move a pointer between ranks, then cudaMemcpy\*



#### **Peer- and Self-exchange**





Peer: Two GPUs in the same rank

Self: Same GPU is on both sides of the domain Only if decomposition has extent=1 in any direction



#### Overlap



All operations are parallel and asynchronous



# 1 Node (Summit)



Specialization has a big impact in intra-node performance



#### Weak Scaling (Summit)



Exchange time stabilizes once most nodes have 26 neighbors Specialization has a smaller impact on off-node performance (1.16x at 256 nodes) CUDA-aware causes poor scaling

#### ECE ILLINOIS 24



Spectrum MPI 10.3.0.1 puts many device-device copies in default stream, and also calls cudaDeviceSynchronize(), which synchronizes other asynchronous operations

### Strong Scaling: 1363<sup>3</sup>



#### Future Work: Adjust Partition by Bandwidth

n

n

Minimal surface area for subdomain is not optimal



Square Subdomains max(n / 10GB/s, n / 1GB/s) = n / 1 GB/s



Stretched Subdomains max(10n / 10GB/s, (n/10) / 1GB/s) = n / 10 GB/s

# Future Work: All Pack Directions not Equal

Pack / Unpack performance depends on strides





#### Future Work: Topology-Aware Placement

Extent QAP to n ~ 1k: need a better placement algorithm, SCOTCH or something? No measurable locality on summit





#### **Future Work: Store Halos Separately**

Pros: no more packing and unpacking

Const: smart-pointer in cuda kernel to redirect accesses to the right buffer

Requires evaluation on real kernels





css-host-yz-20, 4 ranks, 1 GPU / rank, 71ff24, driver 440.33.01, CUDA 10.2, Ubuntu 18.04, kernel 4.14.0-74-generic, timeline\_28038.nvvp



#### Takeaways so Far

- Use (at least) one rank per GPU to maximize MPI injection bandwidth
- Data placement was good for 20% performance for one node
- Communication specialization was good for 6x on one node
  - still 1.16x at 256 nodes allows MPI to just do off-node
- CUDA-Aware MPI seems like a proof-of-concept right now
- Some opportunities to improve partitioning and placement according to node topology
- May be able to trade off kernel time with communication time by storing halos in a packed configuration



#### Implementation - CUDA/C++ Header-only Library

https://github.com/cwpearson/stencil - not quite public yet

Fast stencil exchange for any configuration of CUDA + MPI

Tested on Summit and Hal

Support for any combination of quantity types (float, double)

- Still has a few loose ends:
  - Multi-radius stencils (improve communication performance)
  - Export to standard visualization formats
  - Checkpointing
  - Convenience functions for overlapping communication and computation



#### Thank you - Carl Pearson



Ph.D. student, Electrical and Computer Engineering, University of Illinois Urbana-Champaign

- (Multi-)GPU communication
- Accelerating irregular applications
- cwpearson



- cwpearson
- 🛛 pearson at illinois.edu
- <u>https://cwpearson.github.io</u>



#### **Extra Slides**





#### Weak Scaling (Summit) - Detail



#### Weak Scaling (Summit) - CUDA-Aware Detail



#### **Future Work: Placement Performance**

- Naive implementation right now
- Same placement on all nodes -> only do it once, no need to broadcast full placement information





#### **Communication Architecture**

| Node N   |                                     |           | Holds placement information for all                                                                                                                                                                              |
|----------|-------------------------------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Rank P-1 | Rank P                              | Placement | subdomains:<br>convert subdomain index to node, rank,<br>GPU, and vis-versa                                                                                                                                      |
|          | Subdomain                           |           | One subdomain per GPU                                                                                                                                                                                            |
|          | self<br>peer<br>colocated<br>remote |           | Communicators:<br>One group per subdomain<br>One self-communicator<br>Peer communicator per SD on same rank<br>Colocated communicator for each SD on<br>same node<br>Remote communicator per SD on other<br>node |



#### **Future Work: Library Performance**

Measure inter-node and intra-node tiny messages Represents overhead





#### Future Work: Bandwidth Measurements

- CUDA-Aware MPI Performance
- MPI Performance
  - On-node vs off-node
- Can't rely on specs to get actual bandwidth
- Use these instead distance for placement?

#### **Future Work: Further Reduce MPI messages**

Consolidate all messages to a remote node into a single buffer

Pros: fewer, larger MPI messages

Cons: Incurs intra-node messaging and synchronization overhead



#### **Future Work: System-level heterogeneity**

Whether in compute performance and communication contention

Could apply a similar placement scheme, but use ^ as inputs

Overlap with dynamic load balancing techniques?

