Releases | Open Source Science

opencl-benchmark - OpenCL-Benchmark v1.8

INT8 benchmark will now measure dp4a throughput on all supported AMD/Intel/Nvidia GPUs
fixed compiling on macOS with new OpenCL headers
updated OpenCL-Wrapper

diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA H100 80GB HBM3 | | Device Vendor | NVIDIA Corporation | | Device Driver | 565.57.01 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s) | | Memory, Cache | 81105 MB VRAM, 4224 KB global / 48 KB local | | Buffer Limits | 20276 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 31.184 TFLOPs/s (1/2 ) | | FP32 compute 62.908 TFLOPs/s ( 1x ) | | FP16 compute 123.749 TFLOPs/s ( 2x ) | | INT64 compute 3.227 TIOPs/s (1/24) | | INT32 compute 32.946 TIOPs/s (1/2 ) | | INT16 compute 30.901 TIOPs/s (1/2 ) | -| INT8 compute 30.582 TIOPs/s (1/2 ) | +| INT8 compute 103.204 TIOPs/s ( 2x ) | | Memory Bandwidth ( coalesced read ) 3025.53 GB/s | | Memory Bandwidth ( coalesced write) 3055.98 GB/s | | Memory Bandwidth (misaligned read ) 2102.44 GB/s | | Memory Bandwidth (misaligned write) 314.25 GB/s | | PCIe Bandwidth (send ) 10.53 GB/s | | PCIe Bandwidth ( receive ) 11.47 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 10.91 GB/s | |-----------------------------------------------------------------------------| diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | AMD Instinct MI300X | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3635.0 (HSA1.1,LC) (Linux) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 304 at 2100 MHz (19456 cores, 81.715 TFLOPs/s) | | Memory, Cache | 196592 MB VRAM, 32 KB global / 64 KB local | | Buffer Limits | 196592 MB global, 201310208 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 54.944 TFLOPs/s (2/3 ) | | FP32 compute 130.000 TFLOPs/s ( 2x ) | | FP16 compute 141.320 TFLOPs/s ( 2x ) | | INT64 compute 3.666 TIOPs/s (1/24) | | INT32 compute 47.736 TIOPs/s (2/3 ) | | INT16 compute 69.022 TIOPs/s ( 1x ) | -| INT8 compute 43.582 TIOPs/s (1/2 ) | +| INT8 compute 106.178 TIOPs/s ( 1x ) | | Memory Bandwidth ( coalesced read ) 3756.64 GB/s | | Memory Bandwidth ( coalesced write) 4686.31 GB/s | | Memory Bandwidth (misaligned read ) 3881.24 GB/s | | Memory Bandwidth (misaligned write) 2491.25 GB/s | | PCIe Bandwidth (send ) 54.57 GB/s | | PCIe Bandwidth ( receive ) 55.79 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 55.21 GB/s | |-----------------------------------------------------------------------------| diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | Intel(R) Arc(TM) B580 Graphics | | Device Vendor | Intel(R) Corporation | | Device Driver | 32.0.101.6559 (Windows) | | OpenCL Version | OpenCL C 3.0 | | Compute Units | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s) | | Memory, Cache | 12187 MB VRAM, 18432 KB global / 128 KB local | | Buffer Limits | 11944 MB global, 12230900 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 0.896 TFLOPs/s (1/16) | | FP32 compute 14.249 TFLOPs/s ( 1x ) | | FP16 compute 26.547 TFLOPs/s ( 2x ) | | INT64 compute 0.636 TIOPs/s (1/24) | | INT32 compute 4.556 TIOPs/s (1/3 ) | | INT16 compute 37.082 TIOPs/s ( 2x ) | -| INT8 compute 24.424 TIOPs/s ( 2x ) | +| INT8 compute 48.668 TIOPs/s ( 4x ) | | Memory Bandwidth ( coalesced read ) 574.09 GB/s | | Memory Bandwidth ( coalesced write) 468.07 GB/s | | Memory Bandwidth (misaligned read ) 796.23 GB/s | | Memory Bandwidth (misaligned write) 383.15 GB/s | | PCIe Bandwidth (send ) 4.99 GB/s | | PCIe Bandwidth ( receive ) 4.87 GB/s | | PCIe Bandwidth ( bidirectional) (Gen3 x16) 5.11 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.8

INT8 benchmark will now use dp4a instruction if supported
fixed compiling on macOS with new OpenCL headers

diff |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | NVIDIA H100 80GB HBM3 | | Device Vendor | NVIDIA Corporation | | Device Driver | 565.57.01 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s) | | Memory, Cache | 81105 MB VRAM, 4224 KB global / 48 KB local | | Buffer Limits | 20276 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 31.184 TFLOPs/s (1/2 ) | | FP32 compute 62.908 TFLOPs/s ( 1x ) | | FP16 compute 123.749 TFLOPs/s ( 2x ) | | INT64 compute 3.227 TIOPs/s (1/24) | | INT32 compute 32.946 TIOPs/s (1/2 ) | | INT16 compute 30.901 TIOPs/s (1/2 ) | -| INT8 compute 30.582 TIOPs/s (1/2 ) | +| INT8 compute 103.204 TIOPs/s ( 2x ) | | Memory Bandwidth ( coalesced read ) 3025.53 GB/s | | Memory Bandwidth ( coalesced write) 3055.98 GB/s | | Memory Bandwidth (misaligned read ) 2102.44 GB/s | | Memory Bandwidth (misaligned write) 314.25 GB/s | | PCIe Bandwidth (send ) 10.53 GB/s | | PCIe Bandwidth ( receive ) 11.47 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 10.91 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.7

faster enqueueReadBuffer() on modern CPUs with 64-Byte-aligned host_buffer
updated OpenCL headers
better OpenCL device specs detection using vendor ID and Nvidia compute capability
better VRAM capacity reporting correction for Intel dGPUs
fixed wrong device name reporting for AMD GPUs (unlike every sane GPU vendor they don't report device name as CL_DEVICE_NAME but need CL_DEVICE_BOARD_NAME_AMD extension instead)
fixed TFlops estimate for Intel Battlemage GPUs

diff |----------------.------------------------------------------------------------| | Device ID | 1 | -| Device Name | gfx90a:sramecc+:xnack- | +| Device Name | AMD Instinct MI210 | | Device Vendor | Advanced Micro Devices, Inc. | | Device Driver | 3625.0 (HSA1.1,LC) | | OpenCL Version | OpenCL C 2.0 | | Compute Units | 104 at 1700 MHz (6656 cores, 22.630 TFLOPs/s) | | Memory, Cache | 65520 MB, 16 KB global / 64 KB local | | Buffer Limits | 65520 MB global, 67092480 KB constant | |----------------'------------------------------------------------------------|

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.6

automatically use zero-copy buffers on CPUs/iGPUs to reduce memory footprint
bandwidth kernels now write non-zero data, to avoid hardware optimizations for zero-initialized buffers

- C++
Published by ProjectPhysX over 1 year ago

opencl-benchmark - OpenCL-Benchmark v1.5

enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer
removed wait() call at the end of the benchmark on Linux

diff |----------------.------------------------------------------------------------| | Device ID | 9 | | Device Name | NVIDIA GeForce RTX 2080 Ti | | Device Vendor | NVIDIA Corporation | | Device Driver | 525.89.02 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s) | | Memory, Cache | 11011 MB, 2176 KB global / 48 KB local | | Buffer Limits | 2752 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 0.517 TFLOPs/s (1/24) | | FP32 compute 16.597 TFLOPs/s ( 1x ) | -| FP16 compute not supported | +| FP16 compute 33.054 TFLOPs/s ( 2x ) | | INT64 compute 3.563 TIOPs/s (1/4 ) | | INT32 compute 16.385 TIOPs/s ( 1x ) | | INT16 compute 13.286 TIOPs/s ( 1x ) | | INT8 compute 10.502 TIOPs/s (2/3 ) | | Memory Bandwidth ( coalesced read ) 532.76 GB/s | | Memory Bandwidth ( coalesced write) 548.88 GB/s | | Memory Bandwidth (misaligned read ) 534.43 GB/s | | Memory Bandwidth (misaligned write) 157.78 GB/s | | PCIe Bandwidth (send ) 12.86 GB/s | | PCIe Bandwidth ( receive ) 12.99 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 6.30 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX almost 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.5

enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer

diff |----------------.------------------------------------------------------------| | Device ID | 9 | | Device Name | NVIDIA GeForce RTX 2080 Ti | | Device Vendor | NVIDIA Corporation | | Device Driver | 525.89.02 (Linux) | | OpenCL Version | OpenCL C 1.2 | | Compute Units | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s) | | Memory, Cache | 11011 MB, 2176 KB global / 48 KB local | | Buffer Limits | 2752 MB global, 64 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | | FP64 compute 0.517 TFLOPs/s (1/24) | | FP32 compute 16.597 TFLOPs/s ( 1x ) | -| FP16 compute not supported | +| FP16 compute 33.054 TFLOPs/s ( 2x ) | | INT64 compute 3.563 TIOPs/s (1/4 ) | | INT32 compute 16.385 TIOPs/s ( 1x ) | | INT16 compute 13.286 TIOPs/s ( 1x ) | | INT8 compute 10.502 TIOPs/s (2/3 ) | | Memory Bandwidth ( coalesced read ) 532.76 GB/s | | Memory Bandwidth ( coalesced write) 548.88 GB/s | | Memory Bandwidth (misaligned read ) 534.43 GB/s | | Memory Bandwidth (misaligned write) 157.78 GB/s | | PCIe Bandwidth (send ) 12.86 GB/s | | PCIe Bandwidth ( receive ) 12.99 GB/s | | PCIe Bandwidth ( bidirectional) (Gen4 x16) 6.30 GB/s | |-----------------------------------------------------------------------------|

- C++
Published by ProjectPhysX almost 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.4

updated OpenCL-Wrapper
GPU Driver and OpenCL Runtime installation instructions will be printed to console if no OpenCL devices are available

- C++
Published by ProjectPhysX almost 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.3

workaround for Nvidia driver bug: enqueueFillBuffer is broken for large buffers on Nvidia GPUs
fixed slow numeric drift issues
fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) with a*b+c
added automatic OS detection in make.sh

- C++
Published by ProjectPhysX about 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.2

Updates from OpenCL-Wrapper: - corrected TFlops/s estimate for Intel Data Center GPU Max series - made correction of wrong memory reporting on Intel Arc more robust - made CPU/GPU buffer initialization significantly faster with std::fill and enqueueFillBuffer - added operating system info to OpenCL device driver version printout - bug fix in print_message() function in utilities.hpp

- C++
Published by ProjectPhysX over 2 years ago

opencl-benchmark - OpenCL-Benchmark v1.1

Fixed several issues with macOS

- C++
Published by ProjectPhysX about 3 years ago

opencl-benchmark - OpenCL-Benchmark v1.0

Initial Release. Have fun!

- C++
Published by ProjectPhysX about 3 years ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

Recent Releases of opencl-benchmark

opencl-benchmark - OpenCL-Benchmark v1.8

opencl-benchmark - OpenCL-Benchmark v1.8

opencl-benchmark - OpenCL-Benchmark v1.7

opencl-benchmark - OpenCL-Benchmark v1.6

opencl-benchmark - OpenCL-Benchmark v1.5

opencl-benchmark - OpenCL-Benchmark v1.5

opencl-benchmark - OpenCL-Benchmark v1.4

opencl-benchmark - OpenCL-Benchmark v1.3

opencl-benchmark - OpenCL-Benchmark v1.2

opencl-benchmark - OpenCL-Benchmark v1.1

opencl-benchmark - OpenCL-Benchmark v1.0