Recent Releases of superbenchmark

superbenchmark - Release SuperBench v0.12.0

SuperBench 0.12.0 Release Notes

SuperBench Improvements

  • Optimized cutlass build process for faster builds and smaller binaries.
  • Improve image build pipeline.
  • Add support for arm64 builds.
  • Upgrade pipeline dependencies.
  • Fix SuperBench installation and code lint issues.
  • Update Flake8 repository.
  • Add support for the latest Python versions.
  • Enhance error handling for pkg_resources imports.
  • Update ROCm image build labels.
  • Add CUDA 12.8 and CUDA 12.9 support.
  • Consolidate multi-architecture Docker images.
  • Upgrade runner OS to latest version.
  • Fix typos in documentation and code.

Micro-benchmark Improvements

  • Add general CPU bandwidth and latency benchmarks.
  • Add nvbandwidth build process and benchmarks.
  • Add architecture support for 10.0 in gemm-flops.
  • Add GPU Stream micro benchmark.
  • Add FP4 GEMM FLOPS support in cublaslt_gemm benchmark.
  • Add Grace CPU support for CPU Stream benchmark.
  • Revise CPU Stream benchmark.
  • Fix NUMA error on Grace CPU in gpu-copy benchmark.
  • Bump onnxruntime-gpu dependency from 1.10.0 to 1.12.0.
  • Fix stderr message in gpu-copy benchmark.
  • Fix TensorRT inference parsing.
  • Handle N/A values in nvbandwidth benchmark.
  • Avoid unintended nvbandwidth function calls in all benchmarks.
  • Support CUDA arch flag and autotuning in cublaslt GEMM.

Model-benchmark Improvements

  • Add LLaMA-2 model benchmarks.
  • Add Mixture of Experts model benchmarks.
  • Add DeepSeek inference benchmark (AMD GPU).

Result Analysis

  • Enhance logging for diagnosis rule baseline errors.

Documentation Updates

  • Update CODEOWNERS file.

- Python
Published by polarG 7 months ago

superbenchmark - Release SuperBench v0.11.0

SuperBench 0.11.0 Release Notes

SuperBench Improvements

  • Add CUDA 12.4 dockerfile.
  • Upgrade nccl version to v2.23.4 and install ucx v1.16.0 in cuda 12.4 dockefile.
  • Fix MSCCL build error in CUDA12.4 docker build pipeline.
  • Add ROCm6.2 dockerfile.
  • Update hpcx link in cuda11.1 dockerfile to fix docker build failure.
  • Improve document (Fix metrics name and typos in user tutorial, add BibTeX in README and repo).
  • Limit protobuf version to be 3.20.x to fix onnxruntime dependency error.
  • Update omegaconf version to 2.3.0 and fix issues caused by omegaconf version update.
  • Fix MSCCL build error in CUDA12.4 docker build pipeline.
  • Update Docker Exec Command for Persistent HPCX Environment.
  • Fix cuda 12.2 dockerfile LDLIBRARYPATH issue.
  • Use types-setuptools to replace types-pkg_resources.
  • Add configuration for NDv5 H100 and AMD MI300x.

Micro-benchmark Improvements

  • Add hipblasLt tuning to dist-inference cpp implementation.
  • Add support for NVIDIA L4/L40/L40s GPUs in gemm-flops.
  • Upgrade mlc to v3.11.

Model-benchmark Improvements

  • Support FP8 transformer model training in ROCm6.2 dockerfile.

Result Analysis

  • Fix bug of failure test and warning of pandas in data diagnosis.

- Python
Published by yukirora over 1 year ago

superbenchmark - Release SuperBench v0.10.0

SuperBench 0.10.0 Release Notes

SuperBench Improvements

  • Support monitoring for AMD GPUs.
  • Support ROCm 5.7 and ROCm 6.0 dockerfile.
  • Add MSCCL support for Nvidia GPU.
  • Fix NUMA domains swap issue in NDv4 topology file.
  • Add NDv5 topo file.
  • Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.

Micro-benchmark Improvements

  • Add HPL random generator to gemm-flops with ROCm.
  • Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
  • Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
  • Update Docker image for H100 support.
  • Update MLC version into 3.10 for CUDA/ROCm dockerfile.
  • Bug fix for GPU Burn test.
  • Support INT8 in cublaslt function.
  • Add hipBLASLt function benchmark.
  • Support cpu-gpu and gpu-cpu in ib-validation.
  • Support graph mode in NCCL/RCCL benchmarks for latency metrics.
  • Support cpp implementation in distributed inference benchmark.
  • Add O2 option for gpu copy ROCm build.
  • Support different hipblasLt data types in dist inference.
  • Support in-place in NCCL/RCCL benchmark.
  • Support data type option in NCCL/RCCL benchmark.
  • Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
  • Update hipblaslt GEMM metric unit to tflops.
  • Support FP8 for hipblaslt benchmark.

Model Benchmark Improvements

  • Change torch.distributed.launch to torchrun.
  • Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.

Result Analysis

  • Support baseline generation from multiple nodes.

- Python
Published by abuccts about 2 years ago

superbenchmark - Release SuperBench v0.9.0

SuperBench 0.9.0 Release Notes

SuperBench Improvements

  • Support Ctrl+C and interrupt to stop all SuperBench testing.
  • Support Windows Docker for VDI/Gaming GPU.
  • Support DirectX platform for Nvidia and AMD GPU.
  • Add System Config Info feature in SB runner to support distributed collection.
  • Support DirectX test pipeline.

Micro-benchmark Improvements

  • Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth by DirectX.
  • Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS by DirectX.
  • Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth by DirectX.
  • Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency on AMD graphic GPUs.
  • Support best algorithm selection in cudnn-function microbenchmark.
  • Revise step time collection in distributed inference benchmark.

Model Benchmark Improvements

  • Fix early stop logic due to num_steps in model benchmarks.
  • Support TensorRT models on Nvidia H100.

Documentation Improvements

  • Improve documentation for System Config Info.
  • Update outdate references.

- Python
Published by yukirora over 2 years ago

superbenchmark - Release SuperBench v0.8.0

SuperBench 0.8.0 Release Notes

SuperBench Improvements

  • Support SuperBench Executor running on Windows.
  • Remove fixed rccl version in rocm5.1.x docker file.
  • Upgrade networkx version to fix installation compatibility issue.
  • Pin setuptools version to v65.7.0.
  • Limit ansible_runner version for Python 3.6.
  • Support cgroup V2 when read system metrics in monitor.
  • Fix analyzer bug in Python 3.8 due to pandas api change.
  • Collect real-time GPU power in monitor.
  • Remove unreachable condition when write host list in mpi mode.
  • Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
  • Fix wrong unit of cpu-memory-bw-latency in document.

Micro-benchmark Improvements

  • Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
  • Add HPL Benchmark for HPC Linpack Benchmark.
  • Support flexible warmup and non-random data initialization in cublas-benchmark.
  • Support error tolerance in micro-benchmark for CuDNN function.
  • Add distributed inference benchmark.
  • Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.

Model Benchmark Improvements

  • Fix torch.dist init issue with multiple models.
  • Support TE FP8 in BERT/GPT2 model.
  • Add num_workers configurations in model benchmark.

- Python
Published by abuccts almost 3 years ago

superbenchmark - Release SuperBench v0.7.0

SuperBench v0.7.0 Release Notes

SuperBench Improvements

  • Support non-zero return code when "sb deploy" or "sb run" fails in Ansible.
  • Support log flushing to the result file during runtime.
  • Update version to include revision hash and date.
  • Support "pattern" in mpi mode to run tasks in parallel.
  • Support topo-aware, all-pair, and K-batch pattern in mpi mode.
  • Fix Transformers version to avoid Tensorrt failure.
  • Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
  • Support "sb deploy" without pulling image.

Micro-benchmark Improvements

  • Support list of custom config string in cudnn-functions and cublas-functions.
  • Support correctness check in cublas-functions.
  • Support GEMM-FLOPS for NVIDIA arch90 GPUs.
  • Support cuBLASLt FP16 and FP8 GEMM.
  • Add wait time option to resolve mem-bw unstable issue.
  • Fix bug for incorrect datatype judgement in cublas-function source code.

Model Benchmark Improvements

  • Support FP8 in BERT model training.

Distributed Benchmark Improvements

  • Support pair-wise pattern in IB validation benchmark.
  • Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.

- Python
Published by abuccts about 3 years ago

superbenchmark - Release SuperBench v0.6.0

SuperBench v0.6.0 Release Notes

SuperBench Improvement

  • Support running on host directly without Docker.
  • Support running sb command inside docker image.
  • Support ROCm 5.1.1.
  • Support ROCm 5.1.3.
  • Fix bugs in data diagnosis.
  • Fix cmake and build issues.
  • Support automatic configuration yaml selection on Azure VM.
  • Refine error message when GPU is not detected.
  • Add return code for Timeout.
  • Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
  • Support node_num=1 in mpi mode.
  • Update Python setup for require packages.
  • Enhance parameter parsing to allow spaces in value.
  • Support NO_COLOR for SuperBench output.

Micro-benchmark Improvements

  • Fix issues in ib loopback benchmark.
  • Fix stability issue in ib loopback benchmark.

Distributed Benchmark Improvements

  • Enhance pair-wise IB benchmark.
  • Bug Fix in IB benchmark.
  • Support topology-aware IB benchmark.

Data Diagnosis and Analysis

  • Add failure check function in data_diagnosis.py.
  • Support JSON and JSONL in Diagnosis.
  • Add support to store values of metrics in data diagnosis.
  • Support exit code of sb result diagnosis.
  • Format int type and unify empty value to N/A in diagnosis output files.

- Python
Published by abuccts over 3 years ago

superbenchmark - Pre-release v0.6.0-rc1

Pre-release v0.6.0-rc1.

- Python
Published by abuccts over 3 years ago

superbenchmark - Release SuperBench v0.5.0

SuperBench 0.5.0 Release Notes

Micro-benchmark Improvements

  • Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test.
  • Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
  • Support data checking in GPU copy bandwidth test.
  • Update rccl-tests submodule to fix divide by zero error.
  • Add GPU-Burn micro-benchmark.

Model-benchmark Improvements

  • Sync results on root rank for e2e model benchmarks in distributed mode.
  • Support customized env in local and torch.distributed mode.
  • Add support for pytorch>=1.9.0.
  • Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
  • Remove FP16 samples type converting time.
  • Support FAMBench.

Inference Benchmark Improvements

  • Revise the default setting for inference benchmark.
  • Add percentile metrics for inference benchmarks.
  • Support T4 and A10 in GEMM benchmark.
  • Add configuration with inference benchmark.

Other Improvements

  • Add command to support listing all optional parameters for benchmarks.
  • Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file.
  • Support timeout to detect the benchmark failure and stop the process automatically.
  • Add rocm5.0 dockerfile.
  • Improve output interface.

Data Diagnosis and Analysis

  • Support multi-benchmark check.
  • Support result summary in md, html and excel formats.
  • Support data diagnosis in md and html formats.
  • Support result output for all nodes in data diagnosis.

- Python
Published by abuccts almost 4 years ago

superbenchmark - Pre-release v0.5.0-rc1

- Python
Published by abuccts almost 4 years ago

superbenchmark - Release SuperBench v0.4.0

SuperBench 0.4.0 Release Notes

SuperBench Framework

Monitor

  • Add monitor framework for NVIDIA GPU, CPU, memory and disk.

Data Diagnosis and Analysis

  • Support baseline-based data diagnosis.
  • Support basic analysis feature (boxplot figure, outlier detection, etc.).

Single-node Validation

Micro Benchmarks

  • CPU Memory Validation (tool: Intel Memory Latency Checker).
  • GPU Copy Bandwidth (tool: built by MSRA).
  • Add ORT Model on AMD GPU platform.
  • Add inference backend TensorRT.
  • Add inference backend ORT.

Multi-node Validation

Micro Benchmarks

  • IB Networking validation.
  • TCP validation (tool: TCPing).
  • GPCNet Validation (tool: GPCNet).

Other Improvement

  1. Enhancement

    • Add pipeline for AMD docker.
    • Integrate system config info script with SuperBench.
    • Support FP32 mode without TF32.
    • Refine unit test for microbenchmark.
    • Unify metric names for all benchmarks.
  2. Document

    • Add benchmark list.
    • Add monitor document.
    • Add data diagnosis document.

- Python
Published by abuccts about 4 years ago

superbenchmark - Release SuperBench v0.3.0

SuperBench v0.3.0 Release Notes

SuperBench Framework

Runner

  • Implement MPI mode.

Benchmarks

  • Support Docker benchmark.

Single-node Validation

Micro Benchmarks

  1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

| Metrics | Unit | Description | |----------------|------|-------------------------------------| | H2DMemBWGPU | GB/s | host-to-GPU bandwidth for each GPU | | D2HMemBWGPU | GB/s | GPU-to-host bandwidth for each GPU |

  1. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

| Metrics | Unit | Description | |----------|------|---------------------------------------------------------------| | IBWrite | MB/s | The IB write loopback throughput with different message sizes | | IBRead | MB/s | The IB read loopback throughput with different message sizes | | IB_Send | MB/s | The IB send loopback throughput with different message sizes |

  1. NCCL/RCCL (Tool: NCCL/RCCL Tests)

| Metrics | Unit | Description | |---------------------|------|-----------------------------------------------------------------| | NCCLAllReduce | GB/s | The NCCL AllReduce performance with different message sizes | | NCCLAllGather | GB/s | The NCCL AllGather performance with different message sizes | | NCCLbroadcast | GB/s | The NCCL Broadcast performance with different message sizes | | NCCLreduce | GB/s | The NCCL Reduce performance with different message sizes | | NCCLreducescatter | GB/s | The NCCL ReduceScatter performance with different message sizes |

  1. Disk (Tool: FIO – Standard Disk Performance Tool)

| Metrics | Unit | Description | |----------------|------|---------------------------------------------------------------------------------| | SeqRead | MB/s | Sequential read performance | | SeqWrite | MB/s | Sequential write performance | | RandRead | MB/s | Random read performance | | RandWrite | MB/s | Random write performance | | SeqR/WRead | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) | | SeqR/WWrite | MB/s | Write performance in sequential read/write (read:write = 4:1) | | RandR/WRead | MB/s | Read performance in random read/write (read:write = 4:1) | | RandR/WWrite | MB/s | Write performance in random read/write (read:write = 4:1) |

  1. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

| Metrics | Unit | Description | |---------------|------|-----------------------------------------------------| | H2DSMBWGPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU | | D2HSMBWGPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |

AMD GPU Support

Docker Image Support

  • ROCm 4.2 PyTorch 1.7.0
  • ROCm 4.0 PyTorch 1.7.0

Micro Benchmarks

  1. Kernel Launch (Tool: MSR-A build)

| Metrics | Unit | Description | |--------------------------|-----------|--------------------------------------------------------------| | KernelLaunchEventTime | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() | | KernelLaunchWallTime | Time (ms) | Dispatch latency measured in CPU time |

  1. GEMM FLOPS (Tool: AMD rocblas-bench Tool)

| Metrics | Unit | Description | |----------|--------|-------------------------------| | FP64 | GFLOPS | FP64 FLOPS without MatrixCore | | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore | | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore | | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore | | INT8(MC) | GOPS | INT8 FLOPS with MatrixCore |

E2E Benchmarks

  1. CNN models -- Use PyTorch torchvision models

    • ResNet: ResNet-50, ResNet-101, ResNet-152
    • DenseNet: DenseNet-169, DenseNet-201
    • VGG: VGG-11, VGG-13, VGG-16, VGG-19​
  2. BERT -- Use huggingface Transformers

    • BERT
    • BERT Large
  3. LSTM -- Use PyTorch

  4. GPT-2 -- Use huggingface Transformers

Bug Fix

  • VGG models failed on A100 GPU with batch_size=128

Other Improvement

  1. Contribution related

    • Contribute rule
    • System information collection
  2. Document

    • Add release process doc
    • Add design documents
    • Add developer guide doc for coding style
    • Add contribution rules
    • Add docker image list
    • Add initial validation results

- Python
Published by abuccts over 4 years ago

superbenchmark - Release SuperBench v0.2.1

SuperBench v0.2.1 Release Notes

Bug Fixes

  • Fix Ansible connection issue when running in localhost.
  • Fix crashes of vgg models distributed training.
  • Fix bug when convert bool config to store_true argument.

- Python
Published by abuccts over 4 years ago

superbenchmark - Release SuperBench v0.2.0

SuperBench v0.2.0 Release Notes

SuperBench Framework

  • Implemented a CLI to provide a command line interface.
  • Implemented Runner for nodes control and management.
  • Implemented Executor.
  • Implemented Benchmark framework.

Supported Benchmarks

  • Supported Micro-benchmarks
    • GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN)
    • Kernel Launch Time (KernelLaunchEventTime, KernelLaunchWallTime)
    • Operator Performance (MatMul, Sharding_MatMul)
  • Supported Model-benchmarks
    • CNN models (Reference: torchvision models)
    • ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152)
    • DenseNet (DenseNet-161, DenseNet-169, DenseNet-201)
    • VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11bn, VGG13bn, VGG16bn, VGG19bn)
    • MNASNet (mnasnet05, mnasnet075, mnasnet10, mnasnet13)
    • AlexNet
    • GoogLeNet
    • Inception_v3
    • mobilenet_v2
    • ResNeXt (resnext5032x4d, resnext10132x8d)
    • Wide ResNet (wideresnet502, wideresnet1012)
    • ShuffleNet (shufflenetv2x05, shufflenetv2x10, shufflenetv2x15, shufflenetv2x20)
    • SqueezeNet (squeezenet10, squeezenet11)
    • LSTM model
    • BERT models (BERT-Base, BERT-Large)
    • GPT-2 model (specify which config)

Examples and Documents

  • Added examples to run benchmarks respectively.
  • Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks).
  • Built SuperBench website.

- Python
Published by TobeyQin over 4 years ago