Recent Releases of opencl-benchmark

opencl-benchmark - OpenCL-Benchmark v1.8

  • INT8 benchmark will now measure dp4a throughput on all supported AMD/Intel/Nvidia GPUs
  • fixed compiling on macOS with new OpenCL headers
  • updated OpenCL-Wrapper

diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA H100 80GB HBM3 | | Device Vendor | NVIDIA Corporation | | Device Driver | 565.57.01 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s) | | Memory, Cache | 81105 MB VRAM, 4224 KB global / 48 KB local | | Buffer Limits | 20276 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 31.184 TFLOPs/s (1/2 ) | | FP32 compute 62.908 TFLOPs/s ( 1x ) | | FP16 compute 123.749 TFLOPs/s ( 2x ) | | INT64 compute 3.227 TIOPs/s (1/24) | | INT32 compute 32.946 TIOPs/s (1/2 ) | | INT16 compute 30.901 TIOPs/s (1/2 ) | -| INT8 compute 30.582 TIOPs/s (1/2 ) | +| INT8 compute 103.204 TIOPs/s ( 2x ) | | Memory Bandwidth ( coalesced read ) 3025.53 GB/s | | Memory Bandwidth ( coalesced write) 3055.98 GB/s | | Memory Bandwidth (misaligned read ) 2102.44 GB/s | | Memory Bandwidth (misaligned write) 314.25 GB/s | | PCIe Bandwidth (send ) 10.53 GB/s | | PCIe Bandwidth ( receive ) 11.47 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 10.91 GB/s | |-----------------------------------------------------------------------------| diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | AMD Instinct MI300X | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3635.0 (HSA1.1,LC) (Linux) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 304 at 2100 MHz (19456 cores, 81.715 TFLOPs/s) | | Memory, Cache | 196592 MB VRAM, 32 KB global / 64 KB local | | Buffer Limits | 196592 MB global, 201310208 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 54.944 TFLOPs/s (2/3 ) | | FP32 compute 130.000 TFLOPs/s ( 2x ) | | FP16 compute 141.320 TFLOPs/s ( 2x ) | | INT64 compute 3.666 TIOPs/s (1/24) | | INT32 compute 47.736 TIOPs/s (2/3 ) | | INT16 compute 69.022 TIOPs/s ( 1x ) | -| INT8 compute 43.582 TIOPs/s (1/2 ) | +| INT8 compute 106.178 TIOPs/s ( 1x ) | | Memory Bandwidth ( coalesced read ) 3756.64 GB/s | | Memory Bandwidth ( coalesced write) 4686.31 GB/s | | Memory Bandwidth (misaligned read ) 3881.24 GB/s | | Memory Bandwidth (misaligned write) 2491.25 GB/s | | PCIe Bandwidth (send ) 54.57 GB/s | | PCIe Bandwidth ( receive ) 55.79 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 55.21 GB/s | |-----------------------------------------------------------------------------| diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | Intel(R) Arc(TM) B580 Graphics | | Device Vendor | Intel(R) Corporation | | Device Driver | 32.0.101.6559 (Windows) | | OpenCL Version | OpenCL C 3.0 | | Compute Units | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s) | | Memory, Cache | 12187 MB VRAM, 18432 KB global / 128 KB local | | Buffer Limits | 11944 MB global, 12230900 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 0.896 TFLOPs/s (1/16) | | FP32 compute 14.249 TFLOPs/s ( 1x ) | | FP16 compute 26.547 TFLOPs/s ( 2x ) | | INT64 compute 0.636 TIOPs/s (1/24) | | INT32 compute 4.556 TIOPs/s (1/3 ) | | INT16 compute 37.082 TIOPs/s ( 2x ) | -| INT8 compute 24.424 TIOPs/s ( 2x ) | +| INT8 compute 48.668 TIOPs/s ( 4x ) | | Memory Bandwidth ( coalesced read ) 574.09 GB/s | | Memory Bandwidth ( coalesced write) 468.07 GB/s | | Memory Bandwidth (misaligned read ) 796.23 GB/s | | Memory Bandwidth (misaligned write) 383.15 GB/s | | PCIe Bandwidth (send ) 4.99 GB/s | | PCIe Bandwidth ( receive ) 4.87 GB/s | | PCIe Bandwidth ( bidirectional) (Gen3 x16) 5.11 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.8

  • INT8 benchmark will now use dp4a instruction if supported
  • fixed compiling on macOS with new OpenCL headers

diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA H100 80GB HBM3 | | Device Vendor | NVIDIA Corporation | | Device Driver | 565.57.01 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s) | | Memory, Cache | 81105 MB VRAM, 4224 KB global / 48 KB local | | Buffer Limits | 20276 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 31.184 TFLOPs/s (1/2 ) | | FP32 compute 62.908 TFLOPs/s ( 1x ) | | FP16 compute 123.749 TFLOPs/s ( 2x ) | | INT64 compute 3.227 TIOPs/s (1/24) | | INT32 compute 32.946 TIOPs/s (1/2 ) | | INT16 compute 30.901 TIOPs/s (1/2 ) | -| INT8 compute 30.582 TIOPs/s (1/2 ) | +| INT8 compute 103.204 TIOPs/s ( 2x ) | | Memory Bandwidth ( coalesced read ) 3025.53 GB/s | | Memory Bandwidth ( coalesced write) 3055.98 GB/s | | Memory Bandwidth (misaligned read ) 2102.44 GB/s | | Memory Bandwidth (misaligned write) 314.25 GB/s | | PCIe Bandwidth (send ) 10.53 GB/s | | PCIe Bandwidth ( receive ) 11.47 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 10.91 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.7

  • faster enqueueReadBuffer() on modern CPUs with 64-Byte-aligned host_buffer
  • updated OpenCL headers
  • better OpenCL device specs detection using vendor ID and Nvidia compute capability
  • better VRAM capacity reporting correction for Intel dGPUs
  • fixed wrong device name reporting for AMD GPUs (unlike every sane GPU vendor they don't report device name as CL_DEVICE_NAME but need CL_DEVICE_BOARD_NAME_AMD extension instead)
  • fixed TFlops estimate for Intel Battlemage GPUs

diff |----------------.------------------------------------------------------------| | Device ID | 1 | -| Device Name | gfx90a:sramecc+:xnack- | +| Device Name | AMD Instinct MI210 | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3625.0 (HSA1.1,LC) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 104 at 1700 MHz (6656 cores, 22.630 TFLOPs/s) | | Memory, Cache | 65520 MB, 16 KB global / 64 KB local | | Buffer Limits | 65520 MB global, 67092480 KB constant | |----------------'------------------------------------------------------------|

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.6

  • automatically use zero-copy buffers on CPUs/iGPUs to reduce memory footprint
  • bandwidth kernels now write non-zero data, to avoid hardware optimizations for zero-initialized buffers

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.5

  • enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer
  • removed wait() call at the end of the benchmark on Linux

diff |----------------.------------------------------------------------------------| | Device ID | 9 | | Device Name | NVIDIA GeForce RTX 2080 Ti | | Device Vendor | NVIDIA Corporation | | Device Driver | 525.89.02 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s) | | Memory, Cache | 11011 MB, 2176 KB global / 48 KB local | | Buffer Limits | 2752 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 0.517 TFLOPs/s (1/24) | | FP32 compute 16.597 TFLOPs/s ( 1x ) | -| FP16 compute not supported | +| FP16 compute 33.054 TFLOPs/s ( 2x ) | | INT64 compute 3.563 TIOPs/s (1/4 ) | | INT32 compute 16.385 TIOPs/s ( 1x ) | | INT16 compute 13.286 TIOPs/s ( 1x ) | | INT8 compute 10.502 TIOPs/s (2/3 ) | | Memory Bandwidth ( coalesced read ) 532.76 GB/s | | Memory Bandwidth ( coalesced write) 548.88 GB/s | | Memory Bandwidth (misaligned read ) 534.43 GB/s | | Memory Bandwidth (misaligned write) 157.78 GB/s | | PCIe Bandwidth (send ) 12.86 GB/s | | PCIe Bandwidth ( receive ) 12.99 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 6.30 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX almost 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.5

  • enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer

diff |----------------.------------------------------------------------------------| | Device ID | 9 | | Device Name | NVIDIA GeForce RTX 2080 Ti | | Device Vendor | NVIDIA Corporation | | Device Driver | 525.89.02 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s) | | Memory, Cache | 11011 MB, 2176 KB global / 48 KB local | | Buffer Limits | 2752 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 0.517 TFLOPs/s (1/24) | | FP32 compute 16.597 TFLOPs/s ( 1x ) | -| FP16 compute not supported | +| FP16 compute 33.054 TFLOPs/s ( 2x ) | | INT64 compute 3.563 TIOPs/s (1/4 ) | | INT32 compute 16.385 TIOPs/s ( 1x ) | | INT16 compute 13.286 TIOPs/s ( 1x ) | | INT8 compute 10.502 TIOPs/s (2/3 ) | | Memory Bandwidth ( coalesced read ) 532.76 GB/s | | Memory Bandwidth ( coalesced write) 548.88 GB/s | | Memory Bandwidth (misaligned read ) 534.43 GB/s | | Memory Bandwidth (misaligned write) 157.78 GB/s | | PCIe Bandwidth (send ) 12.86 GB/s | | PCIe Bandwidth ( receive ) 12.99 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 6.30 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX almost 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.4

- C++
Published by ProjectPhysX almost 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.3

  • workaround for Nvidia driver bug: enqueueFillBuffer is broken for large buffers on Nvidia GPUs
  • fixed slow numeric drift issues
  • fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) with a*b+c
  • added automatic OS detection in make.sh

- C++
Published by ProjectPhysX about 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.2

Updates from OpenCL-Wrapper: - corrected TFlops/s estimate for Intel Data Center GPU Max series - made correction of wrong memory reporting on Intel Arc more robust - made CPU/GPU buffer initialization significantly faster with std::fill and enqueueFillBuffer - added operating system info to OpenCL device driver version printout - bug fix in print_message() function in utilities.hpp

- C++
Published by ProjectPhysX over 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.1

Fixed several issues with macOS

- C++
Published by ProjectPhysX about 3 years ago

opencl-benchmark - OpenCL-Benchmark v1.0

Initial Release. Have fun!

- C++
Published by ProjectPhysX about 3 years ago