enzyme-gpu-tests

This repo contains the benchmarks for Enzyme on GPU's

https://github.com/wsmoses/enzyme-gpu-tests

Last synced: 6 months ago · JSON representation ·

Repository

This repo contains the benchmarks for Enzyme on GPU's

Basic Info

Host: GitHub
Owner: wsmoses
Language: LLVM
Default Branch: main
Homepage:
Size: 41.7 MB

Statistics

Stars: 11
Watchers: 3
Forks: 3
Open Issues: 2
Releases: 1

Created almost 5 years ago · Last pushed 7 months ago

Metadata Files

Readme Citation

Enzyme-GPU-Tests

This repo contains the benchmarks for Enzyme on GPU's.

If Enzyme, or part of this repository is useful to you, please cite: @inproceedings{enzymeGPU, title={Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme}, author={Moses, William S and Churavy, Valentin and Paehler, Ludger and H{\"u}ckelheim, Jan and Hari Krishna Narayanan, Sri and Schanen, Michel and Doerfert, Johannes}, booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, year={2021}, location = {St. Louis, Missouri}, series = {SC '21} }

Below we describe how to run the 6 benchmarks presented here: XSBench (CUDA), RSBench (CUDA), Parboil LBM (CUDA), LULESH (CUDA), DG (CUDA), DG (AMD). Of these benchmarks, the AMD and CUDA DG codes are Julia-based where as the remainder are C++/CUDA. Evaluation of these benchmarks allowed the paper to demonstrate the efficiency of the generated gradients in comparison to the original code, the impact of the novel optimizations, the scalability of the generated gradients, and the correctness of the tool.

To run the benchmarks used in the paper, we first need to build the LLVM compiler toolchain before we can subsequently link the compiler plugin of Enzyme against our built LLVM version. To install LLVM, please follow the following steps:

``` $ cd ~ $ git clone https://github.com/llvm/llvm-project $ cd llvm-project

The following git hash was used in our paper

You may use this, or a newer version

$ git checkout 8dab25954b0acb53731c4aa73e9a7f4f98263030 $ mkdir build && cd build $ cmake ../llvm -DLLVMENABLEPROJECTS="clang" -DLLVMTARGETSTOBUILD="X86;NVPTX" -DCMAKEBUILD_TYPE=Release -G Ninja $ ninja

This may take a while

clang is now be available in ~/llvm-project/build/bin/clang

``` We now must build an Enzyme based off of our chosen LLVM version.

``` $ cd ~ $ git clone https://github.com/wsmoses/Enzyme

The following git hash was used in our paper

You may use this, or a newer version

$ git checkout ec75831a8cb0170090c366f8da6e3b2b87a20f6e $ cd Enzyme/enzyme $ mkdir build && cd build $ cmake ../enzyme -DLLVMDIR=/path/to/llvm/build -DCMAKEBUILD_TYPE=Release -G Ninja $ ninja

ClangEnzyme-13.so will now be available in ~/Enzyme/enzyme/build/Enzyme/ClangEnzyme-13.so

```

Some of the C++ benchmarks require a custom CUDA libdevice (the implementation of various CUDA intrinsics). This is to remedy an issue within LLVM that prevents common math functions from being identified as LLVM intrinsics (this is being worked on in upstream LLVM). For the default CUDA installation, libdevice can be found at /usr/local/cuda/nvvm/libdevice.10.bc. The following code snippet describes how to replace a libdevice file, assuming you are using CUDA 11.2. The instructions are similar for a different CUDA installation, with the path changed accordingly. Note that you may need to be root to perform the change, and that you should always make a backup of your previous libdevice.

```

Save a copy of your current libdevice

$ sudo cp /usr/local/cuda-11.2/nvvm/libdevice/libdevice.10.bc /usr/local/cuda-11.2/nvvm/libdevice/libdevice.10.bc.old $ sudo cp /path/to/new/libdevice.10.bc /usr/local/cuda-11.2/nvvm/libdevice/libdevice.10.bc ```

We have created Python3 scripts to ease setting up and running our experiments. They will attempt to deduce appropriate paths given the following environmental variables. In the event that there is an issue, you likely may need to change bench.py or the Makefile for the test as appropriate. All of the bench.py benchmarking scripts follow the same structure.

```

The index of the CUDA GPU desired

$ export DEVICE=1

The path to the CUDA installation

$ export CUDA_PATH=/usr/local/cuda-11.2

The path to the Clang++ binary we built above

$ export CLANG_PATH=/path/to/llvm/build/bin/clang++

The path to the Enzyme plugin we built above

$ export ENZYME_PATH=/path/to/Enzyme/enzyme/build/Enzyme/ClangEnzyme-13.so ```

We can now work with the benchmark suite (this repository).

$ git clone https://github.com/wsmoses/Enzyme-GPU-Tests

The benchmark suite folder breaks down into the following structure: * DG, a Discontinuous Galerkin benchmark (CUDA & ROCm in Julia) * LBM, a Lattice Boltzmann benchmark (CUDA in C++) * LULESH, a Lagrangian Hydrodynamics benchmark (CUDA in C++) * RSBench, a Monte Carlo Particle Transport benchmark (CUDA in C++) * XSBench, a Monte Carlo Particle Transport benchmark (CUDA in C++)

We can now enter one of the 4 C++ test directories (XSBench, RSBench, LBM, LULESH) and run the corresponding benchmark.

``` $ cd Enzyme-GPU-Tests/LBM $ python3 bench.py

output of benchmark times printed out here

```

The bench.py script will run first an ablation analysis that enables or disables differentiation, along with several optimizations. The result of these tests will be the execution time of the gradient and/or original kernel. The script will then run by scaling tests for both the gradient and original kernel by evaluating on increasing problem sizes. Some benchmarks (XSBench, LBM, LULESH, DG (CUDA)) will end by printing out the derivative as computed by both numeric differentiation and Enzyme. These will include VERIFY=yes as part of the run line. All other run lines will contain the execution time of that benchmark. Be aware that LULESH's ablation analysis includes benchmarks configurations, which do not perform compiler optimizations and are hence significantly slower than the other benchmarks.

XSBench and RSBench require the libdevice found in Enzyme-GPU-Tests/libdevice1.

LBM uses the packaged libdevice from NVIDIA.

LULESH uses the libdevice found in Enzyme-GPU-Tests/libdevice2. The LULESH benchmark suite furthermore relies on NVIDIA's N-Sight compute utility (NCU). NCU has known issues with access to the GPU Performance counters, which you will need to benchmark the gradient kernel of LULESH. If you should run into this issue please have a look at the following documentation to remedy it.

Odd performance results or a compiler error is a potential indicator of using an incorrect libdevice.

For example, when compiling one of the ablation tests of RSBench without the correct libdevice, one may see the following error when running bench.py: ``` cannot handle (augmented) unknown intrinsic %5 = tail call i32 @llvm.nvvm.d2i.hi(double %0)

21 fatal error: error in backend: (augmented) unknown intrinsic

clang-13: error ```

The two DG tests were run using Julia 1.6. Julia at this version must be found in your path before being able to run the Julia tests. To obtain a working Julia installation see here and follow the provided installation instructions.

DG (CUDA) was run with the libdevice found in Enzyme-GPU-Tests/libdevice1.

We have provided a similar bench.py script for DG. While printed in a different format (CSV-style), it contains the same information about runtimes for both ablation and scaling as the C++ CUDA tests (DG AMD does not have an ablation analysis as it does not run without all optimizations applied).

$ cd Enzyme-GPU-Tests/DG/cuda $ python3 bench.py

Note that the numeric verification may come earlier in the script's output, and should look something like this: ```

Enzyme derivative as the first element of tuple

followed by numeric approximation on the right

(dQ.dval[1], (o2 - o1) / 0.0001) = (-1.105959f0, [-1.10626220703125]) ```

The forward pass alone in the CSV-style output are denoted as the "primal" rows, whereas the derivative runtimes are marked as "all_dub".

The DG tests may require additional setup. For example, you see an output like below (note that this may also occur if you try to run DG (AMD) on a system without the relevant AMD libraries available). Warning: HSA runtime has not been built, runtime functionality will be unavailable. Please run Pkg.build("AMDGPU") and reload AMDGPU.

You may then need to explicitly run various setup routines within Julia's package manager. To fix the Julia setup for the test, perform the following to enter an interactive shell.

$ cd Enzyme-GPU-Tests/DG/rocm $ julia --project=. julia> using Pkg; Pkg.build("AMDGPU")

Owner

Name: William Moses
Login: wsmoses
Kind: user
Location: Cambridge, MA
Company: MIT

Website: wsmoses.com
Twitter: wsmoses
Repositories: 124
Profile: https://github.com/wsmoses

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it using the following metadata.
title: Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme
authors: 
- family-names: Moses
  given-names: William S.
  affiliation: MIT
  orcid: https://orcid.org/0000-0003-2627-0642
- family-names: Churavy
  given-names: Valentin
  affiliation: MIT
  orcid: https://orcid.org/0000-0002-9033-165X
- family-names: Paehler
  given-names: Ludger
- family-names: Hueckelheim
  given-names: Jan
- family-names: "Hari Krishna Narayanan"
  given-names: Sri
- family-names: Schanen
  given-names: Michel
- family-names: Doerfert
  given-names: Johannes
keywords:
- Enzyme
- GPU
- CUDA
- ROCm
- C++
- Julia
- Automatic Differentiation
version: 0.1.0

GitHub Events

Total

Watch event: 3
Issue comment event: 1
Push event: 3
Pull request review comment event: 1
Pull request event: 2
Fork event: 1
Create event: 2

Last Year

Watch event: 3
Issue comment event: 1
Push event: 3
Pull request review comment event: 1
Pull request event: 2
Fork event: 1
Create event: 2

Committers

Last synced: over 1 year ago

All Time

Total Commits: 36
Total Committers: 5
Avg Commits per committer: 7.2
Development Distribution Score (DDS): 0.611

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
William S. Moses	gh@w****m	14
Valentin Churavy	v**y@g**m	9
Ludger Paehler	l**r@t**e	7
William Moses	w**s@c**o	4
Valentin Churavy	v**y@m**u	2

Committer Domains (Top 20 + Academic)

mit.edu: 1 cyclops.juliacomputing.io: 1 tum.de: 1 wsmoses.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

enzyme-gpu-tests

Science Score: 64.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Enzyme-GPU-Tests

The following git hash was used in our paper

You may use this, or a newer version

This may take a while

clang is now be available in ~/llvm-project/build/bin/clang

The following git hash was used in our paper

You may use this, or a newer version

ClangEnzyme-13.so will now be available in ~/Enzyme/enzyme/build/Enzyme/ClangEnzyme-13.so

Save a copy of your current libdevice

The index of the CUDA GPU desired

The path to the CUDA installation

The path to the Clang++ binary we built above

The path to the Enzyme plugin we built above

output of benchmark times printed out here

21 fatal error: error in backend: (augmented) unknown intrinsic

Enzyme derivative as the first element of tuple

followed by numeric approximation on the right

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels