superbenchmark - Release SuperBench v0.12.0

SuperBench 0.12.0 Release Notes

SuperBench Improvements

Optimized cutlass build process for faster builds and smaller binaries.
Improve image build pipeline.
Add support for arm64 builds.
Upgrade pipeline dependencies.
Fix SuperBench installation and code lint issues.
Update Flake8 repository.
Add support for the latest Python versions.
Enhance error handling for pkg_resources imports.
Update ROCm image build labels.
Add CUDA 12.8 and CUDA 12.9 support.
Consolidate multi-architecture Docker images.
Upgrade runner OS to latest version.
Fix typos in documentation and code.

Micro-benchmark Improvements

Add general CPU bandwidth and latency benchmarks.
Add nvbandwidth build process and benchmarks.
Add architecture support for 10.0 in gemm-flops.
Add GPU Stream micro benchmark.
Add FP4 GEMM FLOPS support in cublaslt_gemm benchmark.
Add Grace CPU support for CPU Stream benchmark.
Revise CPU Stream benchmark.
Fix NUMA error on Grace CPU in gpu-copy benchmark.
Bump onnxruntime-gpu dependency from 1.10.0 to 1.12.0.
Fix stderr message in gpu-copy benchmark.
Fix TensorRT inference parsing.
Handle N/A values in nvbandwidth benchmark.
Avoid unintended nvbandwidth function calls in all benchmarks.
Support CUDA arch flag and autotuning in cublaslt GEMM.

Model-benchmark Improvements

Add LLaMA-2 model benchmarks.
Add Mixture of Experts model benchmarks.
Add DeepSeek inference benchmark (AMD GPU).

Result Analysis

Enhance logging for diagnosis rule baseline errors.

Documentation Updates

Update CODEOWNERS file.

- Python
Published by polarG 7 months ago

superbenchmark - Release SuperBench v0.11.0

SuperBench 0.11.0 Release Notes

SuperBench Improvements

Add CUDA 12.4 dockerfile.
Upgrade nccl version to v2.23.4 and install ucx v1.16.0 in cuda 12.4 dockefile.
Fix MSCCL build error in CUDA12.4 docker build pipeline.
Add ROCm6.2 dockerfile.
Update hpcx link in cuda11.1 dockerfile to fix docker build failure.
Improve document (Fix metrics name and typos in user tutorial, add BibTeX in README and repo).
Limit protobuf version to be 3.20.x to fix onnxruntime dependency error.
Update omegaconf version to 2.3.0 and fix issues caused by omegaconf version update.
Fix MSCCL build error in CUDA12.4 docker build pipeline.
Update Docker Exec Command for Persistent HPCX Environment.
Fix cuda 12.2 dockerfile LDLIBRARYPATH issue.
Use types-setuptools to replace types-pkg_resources.
Add configuration for NDv5 H100 and AMD MI300x.

Micro-benchmark Improvements

Add hipblasLt tuning to dist-inference cpp implementation.
Add support for NVIDIA L4/L40/L40s GPUs in gemm-flops.
Upgrade mlc to v3.11.

Model-benchmark Improvements

Support FP8 transformer model training in ROCm6.2 dockerfile.

Result Analysis

Fix bug of failure test and warning of pandas in data diagnosis.

- Python
Published by yukirora over 1 year ago

superbenchmark - Release SuperBench v0.10.0

SuperBench 0.10.0 Release Notes

SuperBench Improvements

Support monitoring for AMD GPUs.
Support ROCm 5.7 and ROCm 6.0 dockerfile.
Add MSCCL support for Nvidia GPU.
Fix NUMA domains swap issue in NDv4 topology file.
Add NDv5 topo file.
Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.

Micro-benchmark Improvements

Add HPL random generator to gemm-flops with ROCm.
Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
Update Docker image for H100 support.
Update MLC version into 3.10 for CUDA/ROCm dockerfile.
Bug fix for GPU Burn test.
Support INT8 in cublaslt function.
Add hipBLASLt function benchmark.
Support cpu-gpu and gpu-cpu in ib-validation.
Support graph mode in NCCL/RCCL benchmarks for latency metrics.
Support cpp implementation in distributed inference benchmark.
Add O2 option for gpu copy ROCm build.
Support different hipblasLt data types in dist inference.
Support in-place in NCCL/RCCL benchmark.
Support data type option in NCCL/RCCL benchmark.
Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
Update hipblaslt GEMM metric unit to tflops.
Support FP8 for hipblaslt benchmark.

Model Benchmark Improvements

Change torch.distributed.launch to torchrun.
Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.

Result Analysis

Support baseline generation from multiple nodes.

- Python
Published by abuccts about 2 years ago

superbenchmark - Release SuperBench v0.9.0

SuperBench 0.9.0 Release Notes

SuperBench Improvements

Support Ctrl+C and interrupt to stop all SuperBench testing.
Support Windows Docker for VDI/Gaming GPU.
Support DirectX platform for Nvidia and AMD GPU.
Add System Config Info feature in SB runner to support distributed collection.
Support DirectX test pipeline.

Micro-benchmark Improvements

Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth by DirectX.
Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS by DirectX.
Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth by DirectX.
Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency on AMD graphic GPUs.
Support best algorithm selection in cudnn-function microbenchmark.
Revise step time collection in distributed inference benchmark.

Model Benchmark Improvements

Fix early stop logic due to num_steps in model benchmarks.
Support TensorRT models on Nvidia H100.

Documentation Improvements

Improve documentation for System Config Info.
Update outdate references.

- Python
Published by yukirora over 2 years ago

superbenchmark - Release SuperBench v0.8.0

SuperBench 0.8.0 Release Notes

SuperBench Improvements

Support SuperBench Executor running on Windows.
Remove fixed rccl version in rocm5.1.x docker file.
Upgrade networkx version to fix installation compatibility issue.
Pin setuptools version to v65.7.0.
Limit ansible_runner version for Python 3.6.
Support cgroup V2 when read system metrics in monitor.
Fix analyzer bug in Python 3.8 due to pandas api change.
Collect real-time GPU power in monitor.
Remove unreachable condition when write host list in mpi mode.
Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
Fix wrong unit of cpu-memory-bw-latency in document.

Micro-benchmark Improvements

Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
Add HPL Benchmark for HPC Linpack Benchmark.
Support flexible warmup and non-random data initialization in cublas-benchmark.
Support error tolerance in micro-benchmark for CuDNN function.
Add distributed inference benchmark.
Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.

Model Benchmark Improvements

Fix torch.dist init issue with multiple models.
Support TE FP8 in BERT/GPT2 model.
Add num_workers configurations in model benchmark.

- Python
Published by abuccts almost 3 years ago

superbenchmark - Release SuperBench v0.7.0

SuperBench v0.7.0 Release Notes

SuperBench Improvements

Support non-zero return code when "sb deploy" or "sb run" fails in Ansible.
Support log flushing to the result file during runtime.
Update version to include revision hash and date.
Support "pattern" in mpi mode to run tasks in parallel.
Support topo-aware, all-pair, and K-batch pattern in mpi mode.
Fix Transformers version to avoid Tensorrt failure.
Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
Support "sb deploy" without pulling image.

Micro-benchmark Improvements

Support list of custom config string in cudnn-functions and cublas-functions.
Support correctness check in cublas-functions.
Support GEMM-FLOPS for NVIDIA arch90 GPUs.
Support cuBLASLt FP16 and FP8 GEMM.
Add wait time option to resolve mem-bw unstable issue.
Fix bug for incorrect datatype judgement in cublas-function source code.

Model Benchmark Improvements

Support FP8 in BERT model training.

Distributed Benchmark Improvements

Support pair-wise pattern in IB validation benchmark.
Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.

- Python
Published by abuccts about 3 years ago

superbenchmark - Release SuperBench v0.6.0

SuperBench v0.6.0 Release Notes

SuperBench Improvement

Support running on host directly without Docker.
Support running sb command inside docker image.
Support ROCm 5.1.1.
Support ROCm 5.1.3.
Fix bugs in data diagnosis.
Fix cmake and build issues.
Support automatic configuration yaml selection on Azure VM.
Refine error message when GPU is not detected.
Add return code for Timeout.
Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
Support node_num=1 in mpi mode.
Update Python setup for require packages.
Enhance parameter parsing to allow spaces in value.
Support NO_COLOR for SuperBench output.

Micro-benchmark Improvements

Fix issues in ib loopback benchmark.
Fix stability issue in ib loopback benchmark.

Distributed Benchmark Improvements

Enhance pair-wise IB benchmark.
Bug Fix in IB benchmark.
Support topology-aware IB benchmark.

Data Diagnosis and Analysis

Add failure check function in data_diagnosis.py.
Support JSON and JSONL in Diagnosis.
Add support to store values of metrics in data diagnosis.
Support exit code of sb result diagnosis.
Format int type and unify empty value to N/A in diagnosis output files.

- Python
Published by abuccts over 3 years ago

superbenchmark - Pre-release v0.6.0-rc1

Pre-release v0.6.0-rc1.

- Python
Published by abuccts over 3 years ago

superbenchmark - Release SuperBench v0.5.0

SuperBench 0.5.0 Release Notes

Micro-benchmark Improvements

Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test.
Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
Support data checking in GPU copy bandwidth test.
Update rccl-tests submodule to fix divide by zero error.
Add GPU-Burn micro-benchmark.

Model-benchmark Improvements

Sync results on root rank for e2e model benchmarks in distributed mode.
Support customized env in local and torch.distributed mode.
Add support for pytorch>=1.9.0.
Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
Remove FP16 samples type converting time.
Support FAMBench.

Inference Benchmark Improvements

Revise the default setting for inference benchmark.
Add percentile metrics for inference benchmarks.
Support T4 and A10 in GEMM benchmark.
Add configuration with inference benchmark.

Other Improvements

Add command to support listing all optional parameters for benchmarks.
Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file.
Support timeout to detect the benchmark failure and stop the process automatically.
Add rocm5.0 dockerfile.
Improve output interface.

Data Diagnosis and Analysis

Support multi-benchmark check.
Support result summary in md, html and excel formats.
Support data diagnosis in md and html formats.
Support result output for all nodes in data diagnosis.

- Python
Published by abuccts almost 4 years ago

superbenchmark - Pre-release v0.5.0-rc1

- Python
Published by abuccts almost 4 years ago

superbenchmark - Release SuperBench v0.4.0

SuperBench 0.4.0 Release Notes

SuperBench Framework

Monitor

Add monitor framework for NVIDIA GPU, CPU, memory and disk.

Data Diagnosis and Analysis

Support baseline-based data diagnosis.
Support basic analysis feature (boxplot figure, outlier detection, etc.).

Single-node Validation

Micro Benchmarks

CPU Memory Validation (tool: Intel Memory Latency Checker).
GPU Copy Bandwidth (tool: built by MSRA).
Add ORT Model on AMD GPU platform.
Add inference backend TensorRT.
Add inference backend ORT.

Multi-node Validation

Micro Benchmarks

IB Networking validation.
TCP validation (tool: TCPing).
GPCNet Validation (tool: GPCNet).

Other Improvement

Enhancement
- Add pipeline for AMD docker.
- Integrate system config info script with SuperBench.
- Support FP32 mode without TF32.
- Refine unit test for microbenchmark.
- Unify metric names for all benchmarks.
Document
- Add benchmark list.
- Add monitor document.
- Add data diagnosis document.

- Python
Published by abuccts about 4 years ago

superbenchmark - Release SuperBench v0.3.0

SuperBench v0.3.0 Release Notes

SuperBench Framework

Runner

Implement MPI mode.

Benchmarks

Support Docker benchmark.

Single-node Validation

Micro Benchmarks

Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

| Metrics | Unit | Description | |----------------|------|-------------------------------------| | H2DMemBWGPU | GB/s | host-to-GPU bandwidth for each GPU | | D2HMemBWGPU | GB/s | GPU-to-host bandwidth for each GPU |

IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

| Metrics | Unit | Description | |----------|------|---------------------------------------------------------------| | IBWrite | MB/s | The IB write loopback throughput with different message sizes | | IBRead | MB/s | The IB read loopback throughput with different message sizes | | IB_Send | MB/s | The IB send loopback throughput with different message sizes |

NCCL/RCCL (Tool: NCCL/RCCL Tests)

| Metrics | Unit | Description | |---------------------|------|-----------------------------------------------------------------| | NCCLAllReduce | GB/s | The NCCL AllReduce performance with different message sizes | | NCCLAllGather | GB/s | The NCCL AllGather performance with different message sizes | | NCCLbroadcast | GB/s | The NCCL Broadcast performance with different message sizes | | NCCLreduce | GB/s | The NCCL Reduce performance with different message sizes | | NCCLreducescatter | GB/s | The NCCL ReduceScatter performance with different message sizes |

Disk (Tool: FIO – Standard Disk Performance Tool)

| Metrics | Unit | Description | |----------------|------|---------------------------------------------------------------------------------| | SeqRead | MB/s | Sequential read performance | | SeqWrite | MB/s | Sequential write performance | | RandRead | MB/s | Random read performance | | RandWrite | MB/s | Random write performance | | SeqR/WRead | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) | | SeqR/WWrite | MB/s | Write performance in sequential read/write (read:write = 4:1) | | RandR/WRead | MB/s | Read performance in random read/write (read:write = 4:1) | | RandR/WWrite | MB/s | Write performance in random read/write (read:write = 4:1) |

H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

| Metrics | Unit | Description | |---------------|------|-----------------------------------------------------| | H2DSMBWGPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU | | D2HSMBWGPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |

AMD GPU Support

Docker Image Support

ROCm 4.2 PyTorch 1.7.0
ROCm 4.0 PyTorch 1.7.0

Micro Benchmarks

Kernel Launch (Tool: MSR-A build)

| Metrics | Unit | Description | |--------------------------|-----------|--------------------------------------------------------------| | KernelLaunchEventTime | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() | | KernelLaunchWallTime | Time (ms) | Dispatch latency measured in CPU time |

GEMM FLOPS (Tool: AMD rocblas-bench Tool)

| Metrics | Unit | Description | |----------|--------|-------------------------------| | FP64 | GFLOPS | FP64 FLOPS without MatrixCore | | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore | | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore | | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore | | INT8(MC) | GOPS | INT8 FLOPS with MatrixCore |

E2E Benchmarks

CNN models -- Use PyTorch torchvision models
- ResNet: ResNet-50, ResNet-101, ResNet-152
- DenseNet: DenseNet-169, DenseNet-201
- VGG: VGG-11, VGG-13, VGG-16, VGG-19
BERT -- Use huggingface Transformers
- BERT
- BERT Large
LSTM -- Use PyTorch
GPT-2 -- Use huggingface Transformers

Bug Fix

VGG models failed on A100 GPU with batch_size=128

Other Improvement

Contribution related
- Contribute rule
- System information collection
Document
- Add release process doc
- Add design documents
- Add developer guide doc for coding style
- Add contribution rules
- Add docker image list
- Add initial validation results

- Python
Published by abuccts over 4 years ago

superbenchmark - Release SuperBench v0.2.1

SuperBench v0.2.1 Release Notes

Bug Fixes

Fix Ansible connection issue when running in localhost.
Fix crashes of vgg models distributed training.
Fix bug when convert bool config to store_true argument.

- Python
Published by abuccts over 4 years ago

superbenchmark - Release SuperBench v0.2.0

SuperBench v0.2.0 Release Notes

SuperBench Framework

Implemented a CLI to provide a command line interface.
Implemented Runner for nodes control and management.
Implemented Executor.
Implemented Benchmark framework.

Supported Benchmarks

Supported Micro-benchmarks
- GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN)
- Kernel Launch Time (KernelLaunchEventTime, KernelLaunchWallTime)
- Operator Performance (MatMul, Sharding_MatMul)
Supported Model-benchmarks
- CNN models (Reference: torchvision models)
- ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152)
- DenseNet (DenseNet-161, DenseNet-169, DenseNet-201)
- VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11bn, VGG13bn, VGG16bn, VGG19bn)
- MNASNet (mnasnet05, mnasnet075, mnasnet10, mnasnet13)
- AlexNet
- GoogLeNet
- Inception_v3
- mobilenet_v2
- ResNeXt (resnext5032x4d, resnext10132x8d)
- Wide ResNet (wideresnet502, wideresnet1012)
- ShuffleNet (shufflenetv2x05, shufflenetv2x10, shufflenetv2x15, shufflenetv2x20)
- SqueezeNet (squeezenet10, squeezenet11)
- LSTM model
- BERT models (BERT-Base, BERT-Large)
- GPT-2 model (specify which config)

Examples and Documents

Added examples to run benchmarks respectively.
Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks).
Built SuperBench website.

- Python
Published by TobeyQin over 4 years ago

Recent Releases of superbenchmark

superbenchmark - Release SuperBench v0.12.0

SuperBench 0.12.0 Release Notes

SuperBench Improvements

Micro-benchmark Improvements

Model-benchmark Improvements

Result Analysis

Documentation Updates

superbenchmark - Release SuperBench v0.11.0

SuperBench 0.11.0 Release Notes

SuperBench Improvements

Micro-benchmark Improvements

Model-benchmark Improvements

Result Analysis

superbenchmark - Release SuperBench v0.10.0

SuperBench 0.10.0 Release Notes

SuperBench Improvements

Micro-benchmark Improvements

Model Benchmark Improvements

Result Analysis

superbenchmark - Release SuperBench v0.9.0

SuperBench 0.9.0 Release Notes

SuperBench Improvements

Micro-benchmark Improvements

Model Benchmark Improvements

Documentation Improvements

superbenchmark - Release SuperBench v0.8.0

SuperBench 0.8.0 Release Notes

SuperBench Improvements

Micro-benchmark Improvements

Model Benchmark Improvements

superbenchmark - Release SuperBench v0.7.0

SuperBench v0.7.0 Release Notes

SuperBench Improvements

Micro-benchmark Improvements

Model Benchmark Improvements

Distributed Benchmark Improvements

superbenchmark - Release SuperBench v0.6.0

SuperBench v0.6.0 Release Notes

SuperBench Improvement

Micro-benchmark Improvements

Distributed Benchmark Improvements

Data Diagnosis and Analysis

superbenchmark - Pre-release v0.6.0-rc1

superbenchmark - Release SuperBench v0.5.0

SuperBench 0.5.0 Release Notes

Micro-benchmark Improvements

Model-benchmark Improvements

Inference Benchmark Improvements

Other Improvements

Data Diagnosis and Analysis

superbenchmark - Pre-release v0.5.0-rc1

superbenchmark - Release SuperBench v0.4.0

SuperBench 0.4.0 Release Notes

SuperBench Framework

Monitor

Data Diagnosis and Analysis

Single-node Validation

Micro Benchmarks

Multi-node Validation

Micro Benchmarks

Other Improvement

superbenchmark - Release SuperBench v0.3.0

SuperBench v0.3.0 Release Notes

SuperBench Framework

Runner

Benchmarks

Single-node Validation

Micro Benchmarks

AMD GPU Support

Docker Image Support

Micro Benchmarks

E2E Benchmarks

Bug Fix

Other Improvement

superbenchmark - Release SuperBench v0.2.1

SuperBench v0.2.1 Release Notes

Bug Fixes

superbenchmark - Release SuperBench v0.2.0

SuperBench v0.2.0 Release Notes