Recent Releases of superbenchmark
superbenchmark - Release SuperBench v0.12.0
SuperBench 0.12.0 Release Notes
SuperBench Improvements
- Optimized cutlass build process for faster builds and smaller binaries.
- Improve image build pipeline.
- Add support for arm64 builds.
- Upgrade pipeline dependencies.
- Fix SuperBench installation and code lint issues.
- Update Flake8 repository.
- Add support for the latest Python versions.
- Enhance error handling for
pkg_resourcesimports. - Update ROCm image build labels.
- Add CUDA 12.8 and CUDA 12.9 support.
- Consolidate multi-architecture Docker images.
- Upgrade runner OS to latest version.
- Fix typos in documentation and code.
Micro-benchmark Improvements
- Add general CPU bandwidth and latency benchmarks.
- Add nvbandwidth build process and benchmarks.
- Add architecture support for 10.0 in gemm-flops.
- Add GPU Stream micro benchmark.
- Add FP4 GEMM FLOPS support in
cublaslt_gemmbenchmark. - Add Grace CPU support for CPU Stream benchmark.
- Revise CPU Stream benchmark.
- Fix NUMA error on Grace CPU in gpu-copy benchmark.
- Bump onnxruntime-gpu dependency from 1.10.0 to 1.12.0.
- Fix stderr message in gpu-copy benchmark.
- Fix TensorRT inference parsing.
- Handle N/A values in nvbandwidth benchmark.
- Avoid unintended nvbandwidth function calls in all benchmarks.
- Support CUDA arch flag and autotuning in
cublasltGEMM.
Model-benchmark Improvements
- Add LLaMA-2 model benchmarks.
- Add Mixture of Experts model benchmarks.
- Add DeepSeek inference benchmark (AMD GPU).
Result Analysis
- Enhance logging for diagnosis rule baseline errors.
Documentation Updates
- Update CODEOWNERS file.
- Python
Published by polarG 7 months ago
superbenchmark - Release SuperBench v0.11.0
SuperBench 0.11.0 Release Notes
SuperBench Improvements
- Add CUDA 12.4 dockerfile.
- Upgrade nccl version to v2.23.4 and install ucx v1.16.0 in cuda 12.4 dockefile.
- Fix MSCCL build error in CUDA12.4 docker build pipeline.
- Add ROCm6.2 dockerfile.
- Update hpcx link in cuda11.1 dockerfile to fix docker build failure.
- Improve document (Fix metrics name and typos in user tutorial, add BibTeX in README and repo).
- Limit protobuf version to be 3.20.x to fix onnxruntime dependency error.
- Update omegaconf version to 2.3.0 and fix issues caused by omegaconf version update.
- Fix MSCCL build error in CUDA12.4 docker build pipeline.
- Update Docker Exec Command for Persistent HPCX Environment.
- Fix cuda 12.2 dockerfile LDLIBRARYPATH issue.
- Use types-setuptools to replace types-pkg_resources.
- Add configuration for NDv5 H100 and AMD MI300x.
Micro-benchmark Improvements
- Add hipblasLt tuning to dist-inference cpp implementation.
- Add support for NVIDIA L4/L40/L40s GPUs in gemm-flops.
- Upgrade mlc to v3.11.
Model-benchmark Improvements
- Support FP8 transformer model training in ROCm6.2 dockerfile.
Result Analysis
- Fix bug of failure test and warning of pandas in data diagnosis.
- Python
Published by yukirora over 1 year ago
superbenchmark - Release SuperBench v0.10.0
SuperBench 0.10.0 Release Notes
SuperBench Improvements
- Support monitoring for AMD GPUs.
- Support ROCm 5.7 and ROCm 6.0 dockerfile.
- Add MSCCL support for Nvidia GPU.
- Fix NUMA domains swap issue in NDv4 topology file.
- Add NDv5 topo file.
- Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.
Micro-benchmark Improvements
- Add HPL random generator to gemm-flops with ROCm.
- Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
- Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
- Update Docker image for H100 support.
- Update MLC version into 3.10 for CUDA/ROCm dockerfile.
- Bug fix for GPU Burn test.
- Support INT8 in cublaslt function.
- Add hipBLASLt function benchmark.
- Support cpu-gpu and gpu-cpu in ib-validation.
- Support graph mode in NCCL/RCCL benchmarks for latency metrics.
- Support cpp implementation in distributed inference benchmark.
- Add O2 option for gpu copy ROCm build.
- Support different hipblasLt data types in dist inference.
- Support in-place in NCCL/RCCL benchmark.
- Support data type option in NCCL/RCCL benchmark.
- Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
- Update hipblaslt GEMM metric unit to tflops.
- Support FP8 for hipblaslt benchmark.
Model Benchmark Improvements
- Change torch.distributed.launch to torchrun.
- Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.
Result Analysis
- Support baseline generation from multiple nodes.
- Python
Published by abuccts about 2 years ago
superbenchmark - Release SuperBench v0.9.0
SuperBench 0.9.0 Release Notes
SuperBench Improvements
- Support Ctrl+C and interrupt to stop all SuperBench testing.
- Support Windows Docker for VDI/Gaming GPU.
- Support DirectX platform for Nvidia and AMD GPU.
- Add System Config Info feature in SB runner to support distributed collection.
- Support DirectX test pipeline.
Micro-benchmark Improvements
- Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth by DirectX.
- Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS by DirectX.
- Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth by DirectX.
- Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency on AMD graphic GPUs.
- Support best algorithm selection in cudnn-function microbenchmark.
- Revise step time collection in distributed inference benchmark.
Model Benchmark Improvements
- Fix early stop logic due to num_steps in model benchmarks.
- Support TensorRT models on Nvidia H100.
Documentation Improvements
- Improve documentation for System Config Info.
- Update outdate references.
- Python
Published by yukirora over 2 years ago
superbenchmark - Release SuperBench v0.8.0
SuperBench 0.8.0 Release Notes
SuperBench Improvements
- Support SuperBench Executor running on Windows.
- Remove fixed rccl version in rocm5.1.x docker file.
- Upgrade networkx version to fix installation compatibility issue.
- Pin setuptools version to v65.7.0.
- Limit ansible_runner version for Python 3.6.
- Support cgroup V2 when read system metrics in monitor.
- Fix analyzer bug in Python 3.8 due to pandas api change.
- Collect real-time GPU power in monitor.
- Remove unreachable condition when write host list in mpi mode.
- Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
- Fix wrong unit of cpu-memory-bw-latency in document.
Micro-benchmark Improvements
- Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
- Add HPL Benchmark for HPC Linpack Benchmark.
- Support flexible warmup and non-random data initialization in cublas-benchmark.
- Support error tolerance in micro-benchmark for CuDNN function.
- Add distributed inference benchmark.
- Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.
Model Benchmark Improvements
- Fix torch.dist init issue with multiple models.
- Support TE FP8 in BERT/GPT2 model.
- Add num_workers configurations in model benchmark.
- Python
Published by abuccts almost 3 years ago
superbenchmark - Release SuperBench v0.7.0
SuperBench v0.7.0 Release Notes
SuperBench Improvements
- Support non-zero return code when "sb deploy" or "sb run" fails in Ansible.
- Support log flushing to the result file during runtime.
- Update version to include revision hash and date.
- Support "pattern" in mpi mode to run tasks in parallel.
- Support topo-aware, all-pair, and K-batch pattern in mpi mode.
- Fix Transformers version to avoid Tensorrt failure.
- Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
- Support "sb deploy" without pulling image.
Micro-benchmark Improvements
- Support list of custom config string in cudnn-functions and cublas-functions.
- Support correctness check in cublas-functions.
- Support GEMM-FLOPS for NVIDIA arch90 GPUs.
- Support cuBLASLt FP16 and FP8 GEMM.
- Add wait time option to resolve mem-bw unstable issue.
- Fix bug for incorrect datatype judgement in cublas-function source code.
Model Benchmark Improvements
- Support FP8 in BERT model training.
Distributed Benchmark Improvements
- Support pair-wise pattern in IB validation benchmark.
- Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.
- Python
Published by abuccts about 3 years ago
superbenchmark - Release SuperBench v0.6.0
SuperBench v0.6.0 Release Notes
SuperBench Improvement
- Support running on host directly without Docker.
- Support running
sbcommand inside docker image. - Support ROCm 5.1.1.
- Support ROCm 5.1.3.
- Fix bugs in data diagnosis.
- Fix cmake and build issues.
- Support automatic configuration yaml selection on Azure VM.
- Refine error message when GPU is not detected.
- Add return code for Timeout.
- Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
- Support node_num=1 in mpi mode.
- Update Python setup for require packages.
- Enhance parameter parsing to allow spaces in value.
- Support NO_COLOR for SuperBench output.
Micro-benchmark Improvements
- Fix issues in ib loopback benchmark.
- Fix stability issue in ib loopback benchmark.
Distributed Benchmark Improvements
- Enhance pair-wise IB benchmark.
- Bug Fix in IB benchmark.
- Support topology-aware IB benchmark.
Data Diagnosis and Analysis
- Add failure check function in data_diagnosis.py.
- Support JSON and JSONL in Diagnosis.
- Add support to store values of metrics in data diagnosis.
- Support exit code of sb result diagnosis.
- Format int type and unify empty value to N/A in diagnosis output files.
- Python
Published by abuccts over 3 years ago
superbenchmark - Pre-release v0.6.0-rc1
Pre-release v0.6.0-rc1.
- Python
Published by abuccts over 3 years ago
superbenchmark - Release SuperBench v0.5.0
SuperBench 0.5.0 Release Notes
Micro-benchmark Improvements
- Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test.
- Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
- Support data checking in GPU copy bandwidth test.
- Update rccl-tests submodule to fix divide by zero error.
- Add GPU-Burn micro-benchmark.
Model-benchmark Improvements
- Sync results on root rank for e2e model benchmarks in distributed mode.
- Support customized
envin local and torch.distributed mode. - Add support for pytorch>=1.9.0.
- Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
- Remove FP16 samples type converting time.
- Support FAMBench.
Inference Benchmark Improvements
- Revise the default setting for inference benchmark.
- Add percentile metrics for inference benchmarks.
- Support T4 and A10 in GEMM benchmark.
- Add configuration with inference benchmark.
Other Improvements
- Add command to support listing all optional parameters for benchmarks.
- Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file.
- Support timeout to detect the benchmark failure and stop the process automatically.
- Add rocm5.0 dockerfile.
- Improve output interface.
Data Diagnosis and Analysis
- Support multi-benchmark check.
- Support result summary in md, html and excel formats.
- Support data diagnosis in md and html formats.
- Support result output for all nodes in data diagnosis.
- Python
Published by abuccts almost 4 years ago
superbenchmark - Release SuperBench v0.4.0
SuperBench 0.4.0 Release Notes
SuperBench Framework
Monitor
- Add monitor framework for NVIDIA GPU, CPU, memory and disk.
Data Diagnosis and Analysis
- Support baseline-based data diagnosis.
- Support basic analysis feature (boxplot figure, outlier detection, etc.).
Single-node Validation
Micro Benchmarks
- CPU Memory Validation (tool: Intel Memory Latency Checker).
- GPU Copy Bandwidth (tool: built by MSRA).
- Add ORT Model on AMD GPU platform.
- Add inference backend TensorRT.
- Add inference backend ORT.
Multi-node Validation
Micro Benchmarks
- IB Networking validation.
- TCP validation (tool: TCPing).
- GPCNet Validation (tool: GPCNet).
Other Improvement
Enhancement
- Add pipeline for AMD docker.
- Integrate system config info script with SuperBench.
- Support FP32 mode without TF32.
- Refine unit test for microbenchmark.
- Unify metric names for all benchmarks.
Document
- Add benchmark list.
- Add monitor document.
- Add data diagnosis document.
- Python
Published by abuccts about 4 years ago
superbenchmark - Release SuperBench v0.3.0
SuperBench v0.3.0 Release Notes
SuperBench Framework
Runner
- Implement MPI mode.
Benchmarks
- Support Docker benchmark.
Single-node Validation
Micro Benchmarks
- Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)
| Metrics | Unit | Description | |----------------|------|-------------------------------------| | H2DMemBWGPU | GB/s | host-to-GPU bandwidth for each GPU | | D2HMemBWGPU | GB/s | GPU-to-host bandwidth for each GPU |
- IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)
| Metrics | Unit | Description | |----------|------|---------------------------------------------------------------| | IBWrite | MB/s | The IB write loopback throughput with different message sizes | | IBRead | MB/s | The IB read loopback throughput with different message sizes | | IB_Send | MB/s | The IB send loopback throughput with different message sizes |
- NCCL/RCCL (Tool: NCCL/RCCL Tests)
| Metrics | Unit | Description | |---------------------|------|-----------------------------------------------------------------| | NCCLAllReduce | GB/s | The NCCL AllReduce performance with different message sizes | | NCCLAllGather | GB/s | The NCCL AllGather performance with different message sizes | | NCCLbroadcast | GB/s | The NCCL Broadcast performance with different message sizes | | NCCLreduce | GB/s | The NCCL Reduce performance with different message sizes | | NCCLreducescatter | GB/s | The NCCL ReduceScatter performance with different message sizes |
- Disk (Tool: FIO – Standard Disk Performance Tool)
| Metrics | Unit | Description | |----------------|------|---------------------------------------------------------------------------------| | SeqRead | MB/s | Sequential read performance | | SeqWrite | MB/s | Sequential write performance | | RandRead | MB/s | Random read performance | | RandWrite | MB/s | Random write performance | | SeqR/WRead | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) | | SeqR/WWrite | MB/s | Write performance in sequential read/write (read:write = 4:1) | | RandR/WRead | MB/s | Read performance in random read/write (read:write = 4:1) | | RandR/WWrite | MB/s | Write performance in random read/write (read:write = 4:1) |
- H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)
| Metrics | Unit | Description | |---------------|------|-----------------------------------------------------| | H2DSMBWGPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU | | D2HSMBWGPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |
AMD GPU Support
Docker Image Support
- ROCm 4.2 PyTorch 1.7.0
- ROCm 4.0 PyTorch 1.7.0
Micro Benchmarks
- Kernel Launch (Tool: MSR-A build)
| Metrics | Unit | Description | |--------------------------|-----------|--------------------------------------------------------------| | KernelLaunchEventTime | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() | | KernelLaunchWallTime | Time (ms) | Dispatch latency measured in CPU time |
- GEMM FLOPS (Tool: AMD rocblas-bench Tool)
| Metrics | Unit | Description | |----------|--------|-------------------------------| | FP64 | GFLOPS | FP64 FLOPS without MatrixCore | | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore | | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore | | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore | | INT8(MC) | GOPS | INT8 FLOPS with MatrixCore |
E2E Benchmarks
CNN models -- Use PyTorch torchvision models
- ResNet: ResNet-50, ResNet-101, ResNet-152
- DenseNet: DenseNet-169, DenseNet-201
- VGG: VGG-11, VGG-13, VGG-16, VGG-19
BERT -- Use huggingface Transformers
- BERT
- BERT Large
LSTM -- Use PyTorch
GPT-2 -- Use huggingface Transformers
Bug Fix
- VGG models failed on A100 GPU with batch_size=128
Other Improvement
Contribution related
- Contribute rule
- System information collection
Document
- Add release process doc
- Add design documents
- Add developer guide doc for coding style
- Add contribution rules
- Add docker image list
- Add initial validation results
- Python
Published by abuccts over 4 years ago
superbenchmark - Release SuperBench v0.2.1
SuperBench v0.2.1 Release Notes
Bug Fixes
- Fix Ansible connection issue when running in localhost.
- Fix crashes of vgg models distributed training.
- Fix bug when convert bool config to store_true argument.
- Python
Published by abuccts over 4 years ago
superbenchmark - Release SuperBench v0.2.0
SuperBench v0.2.0 Release Notes
SuperBench Framework
- Implemented a CLI to provide a command line interface.
- Implemented Runner for nodes control and management.
- Implemented Executor.
- Implemented Benchmark framework.
Supported Benchmarks
- Supported Micro-benchmarks
- GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN)
- Kernel Launch Time (KernelLaunchEventTime, KernelLaunchWallTime)
- Operator Performance (MatMul, Sharding_MatMul)
- Supported Model-benchmarks
- CNN models (Reference: torchvision models)
- ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152)
- DenseNet (DenseNet-161, DenseNet-169, DenseNet-201)
- VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11bn, VGG13bn, VGG16bn, VGG19bn)
- MNASNet (mnasnet05, mnasnet075, mnasnet10, mnasnet13)
- AlexNet
- GoogLeNet
- Inception_v3
- mobilenet_v2
- ResNeXt (resnext5032x4d, resnext10132x8d)
- Wide ResNet (wideresnet502, wideresnet1012)
- ShuffleNet (shufflenetv2x05, shufflenetv2x10, shufflenetv2x15, shufflenetv2x20)
- SqueezeNet (squeezenet10, squeezenet11)
- LSTM model
- BERT models (BERT-Base, BERT-Large)
- GPT-2 model (specify which config)
Examples and Documents
- Added examples to run benchmarks respectively.
- Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks).
- Built SuperBench website.
- Python
Published by TobeyQin over 4 years ago