onednn - v3.9.1

This is a patch release containing the following changes to v3.9: * Reduced sizes in Graph API SDPA examples (257d689ade952ba61fa926e8ba8127685133ccd2) * Fixed correctness issue in bf16 depthwise convolution with bf16 bias on AArch64 CPUs (218b41ddb3e9e63cc6f317c02cd79a4a1e4b06a0) * Changed Intel GPU data alignment check from error to warning (5c5008a8cca72cc89650e9530cf2838d28f26277) * Improved bf16 matmul performance on processors with Intel AMX instruction set support (54b63549e97599a58a3ae6ab3e9a381f4ff03c46, 30c4d8d9d967bbcf54ba753305ca392282c692f3) * Fixed PowerPC64 build by adding -mcpu=power10 and -mmma flags (02ca915a3f79ed558edecad390c6adc166ac5d35) * Introduced support for f16 destination in int8 matmul and int8 inner product on x64 CPUs (a62ed6b88db80bbfa8574a54522fcf11fed534f6, 53c0a667a218ba3dbe9fa7ec1492bf2893f90e78, 07500433f58ec0a25c2a100b7a951f2ade44d162, 4f0f068e02af35ec1e70f7a7113227716f76730f) * Introduced support per_tensor zero-points in int8 matmul on Intel GPUs (db8e8ff737016d16fdebb2a931ecbd6aa7a16e3a, f78316439500896a5922d7495867fdc246e2d4ac, 4d458df41ca5e498e8e858ab6b4590e5679db34d, 80453a01bbee0c641c22e72279fca204c036f6c9, 7f90d50536a2dc769348a222cb39f77445d907ac, a2200e2c372d51d493b627e4a99de02991a3279d) * Fixed correctness issue in int8 reorder for cases with compensation on x64 CPUs (771ca54f64aa8a43fdf50db33d8739dc407502e8)

- C++
Published by vpirogov 9 months ago

onednn - v3.9

Performance Optimizations

Intel Architecture Processors

Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support. This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2.
Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512.
Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
Improved performance of fp8 convolution primitive with scales and bf16 output
Improved performance of matmul primitive with post-ops on processors with Intel AMX support
Improved performance of RNN primitive for LBRGRU and VANILLALSTM cell types on processors with Intel AVX2 instruction set support
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.

Intel Graphics Products

Improved performance on Intel GPUs based on Xe3 architecture.
Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
Improved RNN primitive performance with LBR_GRU cell type.
Improved int8 convolution performance with plain weights and trivial filter.
Improved convolution performance with NCHW activations with 1x1 filter and unit strides.
Improved fp32 softmax performance.
Improved performance of reorder when used with USM host memory.
Improved performance of the following subgraphs with Graph API:
- fp32 SDPA with implicit causal mask.
- fp16 SDPA on Intel GPUs without Intel XMX cores.

AArch64-based Processors

Improved int8 convolution performance.
Improved bf16 depthwise convolution performance.
Improved f16 matmul performance with Arm Compute Library (ACL).

Functionality

Functional API

Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
Sparse memory objects and sparse matmul are promoted to production status.

Graph API

Introduced support for tanh approximation in GELU operation.
Extended Graph API Softmax operation to support optional stats output.
Introduced fusion support for SDPA training forward and backward propagation.
Introduced fusion support for SDPA with bottom-right implicit causal mask.
Introduced make_scalar_tensor() API for engine-agnostic scalar tensor creation.

Microkernel API

Introduced support for fp8 data type.

Intel Architecture Processors

Introduced support for select algorithm in binary post-op.
Introduced source, destination, and weight scales support in fp8 convolution and deconvolution primitives.

Intel Graphics Products

Introduced support for select algorithm in binary primitive.

Generic GPU Vendor

Introduced support for RNN Vanilla backward propagation.

Usability

Enabled build with -Wundef compiler flag.
[Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.

Validation

Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator.

Known Limitations

The group normalization with normalization_flags::use_scale specified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier.
Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows.
fp8 matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.
int8 inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.
bf16 pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.
bf16/fp16 matmul with large inner dimension has a performance regression on Intel Datacenter GPU Max Series.
bf16/fp16 convolution with NCHW activations has a performance regression on Intel Datacenter GPU Max Series.
Softmax with non-trivial strides and blocked format may produce incorrect results.
bf16 layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vgvozdeva 10 months ago

onednn - v3.9-rc

Performance Optimizations

Intel Architecture Processors

Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support. This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2.
Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512.
Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
Improved performance of fp8 convolution primitive with scales and bf16 output
Improved performance of matmul primitive with post-ops on processors with Intel AMX support
Improved performance of RNN primitive for LBRGRU and VANILLALSTM cell types on processors with Intel AVX2 instruction set support
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.

Intel Graphics Products

Improved performance on Intel GPUs based on Xe3 architecture.
Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
Improved RNN primitive performance with LBR_GRU cell type.
Improved int8 convolution performance with plain weights and trivial filter.
Improved convolution performance with NCHW activations with 1x1 filter and unit strides.
Improved fp32 softmax performance.
Improved performance of reorder when used with USM host memory.
Improved performance of the following subgraphs with Graph API:
- SDPA with implicit causal mask.
- SDPA with bottom-right implicit causal mask.
- fp32 SDPA.
- fp16 SDPA on Intel GPUs without Intel XMX cores.

AArch64-based Processors

Improved int8 convolution performance.
Improved bf16 depthwise convolution performance.
Improved f16 matmul performance with Arm Compute Library (ACL).

Functionality

Functional API

Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
Sparse memory objects and sparse matmul are promoted to production status.

Graph API

Introduced support for tanh approximation in GELU operation.
Extended Graph API Softmax operation to support optional stats output.
Introduced support for SDPA training forward propagation and backpropagation.

Microkernel API

Introduced support for fp8 data type.

Intel Architecture Processors

Introduced support for select algorithm in binary post-op.
Introduced source, destination, and weight scales support in fp8 convolution and deconvolution primitives.

Intel Graphics Products

Introduced support for select algorithm in binary primitive.

Generic GPU Vendor

Introduced support for RNN Vanilla backward propagation.

Usability

Enabled build with -Wundef compiler flag.
[Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.

Validation

Improved benchdnn performance by optimizing input data filling and testing results comparison steps.

Known Limitations

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by tprimak 10 months ago

onednn - v3.8.1

This is a patch release containing the following changes to v3.8: * Fixed correctness issue in reorder primitive with non-trivial strides on Intel CPUs (a762d3248ee5e04b2348f3a5aeecfa64da4634d8) * Fixed runtime error in convolution weight gradient on Xe2 architecture-based Intel GPUs (a8fac73036f67657f51c10b385f967c64607e802, c409ef949ea112e8fc1caf480d55a07247b4a702) * Fixed performance regression in bf16 convolution on Intel Datacenter GPU Max Series (98170d0f138458f4b3fcefca773be2ef7e73959f, c6bae4aa45dbe9ff9fe4e51173dc301550832e08, c5edd53195f6b1465f4ab4857d64a704bb38e8e1, bb1a5919fbedd4ce078f2fcf368a3e099f6c3942) * Improved performance of fp16 matmul with fp8 compressed weights on Intel GPUs (58f3ec1510a4b10e51e57227229d2b2cfe23f55a, abff1764af8a93dda5c9c8be11c5a1a5da31daa7, ffd7dd34d837f6ddb50d2b88515c5f45bb18ed4f, 3b1e855f440a13124d33c05e1ab671eba1401bba, 2e140de469d28b3f49d3284dc0e215b9b43b718a, 3429f79274957e4bd9b9c6ec12bcf2a4e8362a5b) * Fixed runtime error in fp16 pooling primitive on Xe2 architecture based Intel GPUs (c0f6b6ded756c35d50b383c8078fdec1b3ad2f09) * Improved performance of fp16 matmul with int4 weights and 32 < m <= 64 on Intel GPUs (2fa7072a4d632e341a10d883243c0b54359da2fc) * Fixed correctness issues in bf16 matmul with 3 or more dimensional tensors on processors with Intel AMX support (dd20965518965ff0f63093c1f90c957cbe9ad3e6, ea1b4a169d3fe59a8c8a5d60e5da30a5167e0b52) * Fixed performance regression in fp16 or bf16 matmul with transposed source and weight tensors on Intel Datacenter GPU Max Series (e45e1aa4fe44e0ba0cfb74d58272fea59c47f683) * Improved performance of bf16 matmul with int4 weights on Intel GPUs (7a15c231c569432ca74f7dd1db260f1f8877980c) * Fixed runtime error in fp16 SDPA subgraph with head size 512 on Intel Core Ultra (Series 2) processor integrated GPU (bde698584cbc6ca3f02649c8ff743f9b5d3d527e)

- C++
Published by vpirogov about 1 year ago

onednn - v3.8

Performance Optimizations

Intel Architecture Processors

Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
Improved performance of int8 convolution support with zero points.
Improved fp32 convolution performance with fp16 and bf16 compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support.
Improved fp16/bf16 depthwise convolution performance with fp32 bias or sum post-ops or dilation.
Improved bf16 pooling backpropagation performance.
Improved binary post-ops performance with per_w broadcast.

Intel Graphics Products

Improved performance on Intel Arc graphics for future Intel Core Ultra processors (code name Panther Lake).
Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra processor series 2 (formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
Improved int8 matmul performance with zero-points support for source and weight tensors.
Improved f4_e2m1 and f4_e3m0 matmul and reorder performance.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with int4 and int8 compressed key and value.
- fp16/bf16 SDPA with fp32 intermediate data types. Using fp32 intermediate data types is recommended.
- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.

AArch64-based Processors

Improved fp16 reorder performance.
Improved int8 matmul performance.
Improved bf16 inner product forward propagation performance with Arm Compute Library (ACL).
Improved bf16 eltwise performance.
Improved convolution performance on processors with SVE support with ACL.

Functionality

Common

Extended Graph API Softmax operation to support inf_as_zero mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.

Intel Architecture Processors

Introduced support for f32 convolution with fp16 compressed weights.
Enabled int8/int4 compressed weights support in matmul primitive.

Intel Graphics Products

Introduced select algorithm support in binary primitive.
Introduced support for f4_e2m1 and f4_e3m0 data types in convolution primitive.
Introduced support for the GenIndex operation in Graph API.

Generic GPU Vendor

Introduced support for:
- Vanilla RNN forward propagation.
- Inner product backpropagation.
- Group normalization.
Improved accuracy of inner product primitive with sum post-ops for large shapes.

NVIDIA GPUs

Introduced Graph API support.

Usability

Added support for group normalization primitive with ONEDNN_ENABLE_PRIMITIVE build option.
Enabled support for ROCm 6 on AMD GPUs.
Improved CMake integration for oneDNN installation with Nvidia backend enabled.
Reduced memory footprint for matmul primitive when using ACL.

Validation

Added benchdnn option --execution-mode to test oneDNN functionality with SYCL Graph record/execute mode.
Extended benchdnn option --cold-cache with support for cold TLB mode.
Added benchdnn option --bia-dt to control bias data type for matmul, inner product, convolution, and deconvolution primitives.
Extended syntax of benchdnn --dt option in Graph API driver to manage data types of individual tensors in a pattern.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Breaking Changes

Removed the experimental Graph Compiler backend for Graph API.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.

- C++
Published by vpirogov about 1 year ago

onednn - v3.7.3

This is a patch release containing the following changes to v3.7.2: * Fixed correctness issue in matmul with non-trivial strides for the first tensor on processors with Intel AMX instruction set support (e18c622a3cbf766854b649de695271cfd4c19e89) * Removed spurious warning messages for SDPA subgraph on Intel GPUs (05541bb8004587eff64a69840a3d75b8245ed099, 9e9a3a6f0a611f5f5a023f2d10e8dac94a75e195) * Fixed segfault in fp32 matmul with bf16 math mode on processors with Intel AVX2 instruction set support (7d495ae13d6ce8f9a21a7db29d157e3f34bffb81) * Fixed performance regression in bf16 3D convolution backpropagation on processors with Intel AVX-512 and Intel DL Boost instruction set support (c38e02c577e92080da92c29fb2bf6f9a0b00dbb6, 67afc74d6d681ab6556a12f897a8269e7dbd22f6) * Worked around GCC 12.3 bug causing accuracy issues in fp8 functionality on Intel GPUs (69b38d74a8eccbe8a65a4ac6b4c2584c884d4d2e) * Removed -fcf-protection build option for GCC 7 and earlier versions (813725d15ab1f3ed937e55bec404e64c16d53041)

- C++
Published by vpirogov about 1 year ago

onednn - v3.8-rc

Performance Optimizations

Intel Architecture Processors

Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
Improved performance of int8 convolution support with zero points.
Improved fp32 convolution performance with fp16 and bf16 compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support.
Improved fp16/bf16 depthwise convolution performance with fp32 bias or sum post-ops or dilation.
Improved bf16 pooling backpropagation performance.
Improved binary post-ops performance with per_w broadcast.

Intel Graphics Products

Improved performance on Intel GPUs based on Xe3 architecture.
Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra (Series 2, formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
Improved int8 matmul performance with zero-points support for source and weight tensors.
Improved f4_e2m1 and f4_e3m0 matmul and reorder performance.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with int4 and int8 compressed key and value.
- fp16/bf16 SDPA with fp32 intermediate data types. Using fp32 intermediate data types is recommended.
- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.

AArch64-based Processors

Improved fp16 reorder performance.
Improved int8 matmul performance.
Improved bf16 inner product forward propagation performance with Arm Compute Library (ACL).
Improved convolution performance on processors with SVE support with ACL.

Functionality

Common

Extended Graph API Softmax operation to support inf_as_zero mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.

Intel Architecture Processors

Introduced support for f32 convolution with fp16 compressed weights.
Enabled int8/int4 compressed weights support in matmul primitive.

Intel Graphics Products

Introduced select algorithm support in binary primitive.
Introduced support for f4_e2m1 and f4_e3m0 data types in convolution.
Introduced support for the GenIndex operation in Graph API.

Generic GPU Vendor

Introduced support for:
- Vanilla RNN forward propagation
- Inner product backpropagation
- Group normalization
Improved accuracy of inner product primitive with sum post-ops for large shapes.

NVIDIA GPUs

Introduced Graph API support.

Usability

Added support for Group Normalization primitive with ONEDNN_ENABLE_PRIMITIVE build option.
Enabled support for ROCm 6 on AMD GPUs.
Improved CMake integration for oneDNN installation with Nvidia backend enabled.
Reduced memory footprint for matmul primitive when using ACL.

Validation

Added benchdnn option --execution-mode to test oneDNN functionality with SYCL Graph record/execute mode.
Extended benchdnn option --cold-cache with support for cold TLB mode.
Added benchdnn option --bia-dt to control bias data type for matmul, inner product, convolution, and deconvolution.
Extended syntax of benchdnn --dt option in Graph API driver to manage data types of individual tensors in a pattern.

Breaking Changes

Removed the experimental Graph Compiler backend for Graph API.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.

- C++
Published by vpirogov about 1 year ago

onednn - v3.7.2

This is a patch release containing the following changes to v3.7.1: * Fixed hang in matmul with odd shapes on Intel Arc GPUs (46e7499d8576ac036346f23c9f6d746823dd56b0) * Fixed out-of-registers error in matmul on Intel Arc GPUs (599c8390610bbd9e8c8afdfeba582726ad3af0e1) * Fixed incorrect results in SDPA pattern on Intel GPUs (6343c73143a6185d440bbd61e4089ed07196e9ec) * Fixed integer overflow in convolution with large shapes on x64 CPUs (c541100d5b7678cf042c698aca5cddcbac1426a2) * Fixed access violation issue in experimental Graph Compiler (8b0e6265648a95f3fdb8f3e734b8eb3075538de2) * Fixed access violation in pooling on Intel GPUs (cd2cd5d654078608b9f0de3a1a847708dc3d8e15) * Improved performance of int8 matmul with int4 weights on Intel GPUs (d6c98ec835f449c71fe359f9da68cf48ed68fad1)

- C++
Published by vpirogov about 1 year ago

onednn - v3.7.1

This is a patch release containing the following changes to v3.7: * Fixed correctness issue in int8 matmul primitive with int4 weights on on Intel Arc graphics (b16184d155b578036c94e44fbc960e25b3c522f7) * Fixed matmul performance regression on Intel Arc graphics (41e406bfb448a0600a50ef213c5237d4a3ce3155) * Fixed potential integer overflow in bf16 convolution for processors with Intel AVX-512 instruction set support (f882861fe0d0f2ed61b42377408ad62e5a665bc2) * Fixed functional issue in matmul with dropout attribute on generic GPUs (83033303c072c3b18e070d1abfedfd1f50248eac) * Fixed functional issues in matmul with scales on NVIDIA GPUs (e8d8594956a626be1debf359618024d1cb7c702a) * Fixed integer overflows for large shapes in convolution for x64 processors (fc3f17ad469b8a6da7192ae12d32625faa509f1e, 31b079f482a80836cd4192f85e703d424876febc) * Worked around an MSVC 19.29.30158.0 bug that results in a crash at binary primitive creation on x64 processors (50dd6cc832732ff0ea07e993bb07f61343ca1375)

- C++
Published by vpirogov about 1 year ago

onednn - v3.7

Performance Optimizations

Intel Architecture Processors

Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
Improved performance of int8 depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.
Improved fp16 and bf16 softmax performance with relaxed accumulation mode.
Improved performance of int8 matmul primitive with fp16 output data type.
Improved performance of the following subgraphs with Graph API:
- Gated Multi-Layer Perceptron (Gated MLP).

Intel Graphics Products

Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
Improved performance of convolution with source zero points by pre-packing compenstation.
Improved performance of backward by data convolution with strides for large filter.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot-Product Attention (SDPA) with implicit causal mask.
- SDPA with int8 or int4 compressed key and value.
- Gated MLP.

AArch64-based Processors

Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
Improved bf16 to fp32 reorder performance.
Improved bf16 reorder performance.
Improved bf16 convolution with ACL.

NVIDIA GPUs

Improved matmul performance using cuBLASLt-based implementation.

Functionality

Common

Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
Introduced initial support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
Introduced GenIndex, and GreaterEqual operations in Graph API.

Intel Architecture Processors

Introduced support for fp32 matmul with fp16 and bf16 weights.

Intel Graphics Products

Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
Introduced support for strided memory formats in convolution.

Generic GPU vendor

Introduced support for reduction primitive.
Introduced support for inner product primitive forward propagation.

Usability

Common

With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
Added Graph API examples for Gated MLP and int4 Gated MLP patterns.

Intel Architecture Processors

Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations.
Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

Intel Processor Graphics

Improved verbose diagnostics for Intel GPU driver compatibility issues.
Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
Reduced scratchpad usage for NCHW convolution on Intel GPUs.

AArch64-based Processors

Added support for the Arm Compute Library (ACL) thread_local scheduler via ThreadpoolScheduler.
Improved memory efficiency in ACL matmuls by fixing a bug where scratchpad memory was not being used.
Made the ACL matmul primitive thread-safe which allows concurrent execution.

Validation

Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.

Deprecated Functionality

Experimental Graph Compiler is deprecated and will be removed in future releases.

Breaking Changes

Updated minimal supported CMake version to 3.13 (was 2.8.12).
Updated minimal supported GCC version to 8.0 (was 4.8).
Updated minimal supported Clang version to 11.0 (was 3.0).
Updated minimal supported ACL version to 24.11.1 (was 24.09).
Removed support for SYCL standards preceding SYCL 2020.
Enforced fp32 accumulation mode in fp16 matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed accumulation mode.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Karasev @karasjoh000, John Osorio @kala855, Keola Wierschem @kwiersch, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nicolò Scipione @s-Nick, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Tadej Ciglarič @t4c1, Varad Ahirwadkar @varad-ahirwadkar, Viktoriia Gvozdeva @vgvozdeva, @vishwascm, @yair-obodovsky, Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov over 1 year ago

onednn - v3.7-rc

Performance Optimizations

Intel Architecture Processors

Improved fp16/bf16 softmax performance with relaxed accumulation mode.
Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
Improved performance of convolution and matmul primitives on processors with Intel AMX support.
Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
Improved performance of int8 matmul primitive with fp16 output datatype.
Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.

Intel Graphics Products

Introduced initial optimizations for GPUs based on Xe3 architecture.
Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
Improved performance of the following subgraphs with Graph API
- Scaled dot-product Attention (SDPA) with implicit causal mask
- Scaled dot-product Attention (SDPA) with int8/int4 compressed key and value

Functionality

Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
Enabled support for matmul primitive with grouped quantization on weight along N dimension.
Graph API: new Select, GenIndex and GreaterEqual operations.
Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
Introduced support for grouped scales and zero points in reorder primitive.
Enabled support for 4d weight scale in matmul primitive.
Graph API: added support for Quantized and non-quantized Gated MLP pattern.
Introduced preliminary support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder.

Usability

With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
Added examples for Gated MLP and int4 Gated MLP.

Validation

Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
Extended benchdnn with support and validation for the number of partition returned from the test JSON files.

Breaking Changes

Updated minimal supported CMake version to 3.13 (was 2.8.12).
Updated minimal supported GCC version to 8.0 (was 4.8).
Updated minimal supported Clang version to 11.0 (was 3.0).
Removed support for SYCL older than 2020.

Thanks to these Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vgvozdeva over 1 year ago

onednn - v3.6.2

This is a patch release containing the following changes to v3.6.1: * Fixed segmentation fault issue in convolution primitive on processors with Intel AVX2 instruction set support (2eb3dd1082db767fab171e934c551c609008289a) * Added a workaround for build issue with GCC 8.2 and GNU binutils 2.27 (19ef223ec2095e3293fc672ae598588b7a85304b, 262fb02aff10ce0220dc1d116c5d0ef5e027d573, e3782e8c1355c176efae877af0f05831158f28f8) * Fixed a thread safety issue in matmul primitive for builds relying on Arm Compute Library (ACL) and bumped minimal supported ACL version to 24.11.1 (4d962e7442e25792920e37a0b1b31d6618581c2a) * Suppressed spurious warnings for GCC (7d3164d14ce57a1e2e3e8e9ef021fe3965c09f32, c805a5033c49382151a50d87d78e38248eafe363, e526172f57725230d41d76e83ec9a622a58829bc, dc780cbb0f6e124b78ff57a939757876cd2d6b60) * Fixed segfaults in BRGEMM-based matmul, convolution, and deconvolution implementations on AArch64-based processors (a873a1c0e29e3e9e2dc482b3ce344b5c1dee2d42, 9a1dc92e8085f630b3d6f1764b5b0dcdeebd68ae) * Fixed performance regression in bf16 convolution with ACL on AArch64-based processors (4793296821c61c75b4af2f24fb542921ee931aaf) * Fixed an issue with convolution primitive creation with PREFER_YMM CPU ISA hint on AArch64-based processors (e34d9921c3e8420f924cd6c9837e63aa9945aff3) * Improved bf16 matmul performance with fp32 destination with ACL on AArch64-based processors (548d5d67cec2589bf04aed804ab8c7c37ac8d13c) * Improved bf16 to fp32 reorder performance on AArch64-based Processors (917dd13f1da78c643604bec142e274be5631b421) * Fixed issue in matmul primitive with 4D tensors on AArch64-based processors (d13c966b7bca8d8bad5efe95d369cc3f448a3c59) * Suppressed spurious GCC warnings in deconvolution primitive on AArch64-based processors (f90f60ea066d40d48b57f77d2421bf92c14e8c88) * Fixed warnings in BRGEMM implementation on AArch64-based processors (866b196ab86558743c4ac09ecfcd0bd26f02af98) * Fixed correctness issue in reorder primitive with zero points for 4D shapes on AArch64-based Processors (836ea10e3afc642a6de0018186c023efc09559cc) * Improved bf16 reorder performance on AArch64-based Processors (12bafbe1346f2c8f82c9c14a9b4e12d259f76135) * Fixed performance regression for backward convolution primitive descriptor creation time on Intel processors (2b3389fe52e0557be0c16df6dfa433f59d485118) * Improved performance of fp16 matmul with int4 weights on Intel GPUs based on Xe2 architecture (4c8fb2c2e0d4e54f799fc76329642bc97e60635f, 3dd4f43c27fc5e1552b064ab6a7cc3dca003c51c, 280bd28fd8ba33aa99df1abb1de5ec782dab2159) * Fixed performance regression for int8 convolution with large spatial sizes on processors with Intel AMX support (05d68df233bb67046697758962cd32bd6d23a956) * Restricted check for microkernel fusion support to cases when fusion functionality is actually used on Intel GPUs (48f6bd93fba1bf01a678c29d9be28b111b595a57)

- C++
Published by tprimak over 1 year ago

onednn - v3.6.1

This is a patch release containing the following changes to v3.6: * Fixed convolution correctness issue in some scenarios involving persistent cache on Intel GPUs (e595e595a7aeecc74f1b34e194f787d7639519c8) * Fixed potential page faults in reduction primitive implementation for Intel GPUs (7740c75ad347bfc4491d7dfc2ffb269a24c56490, a4fcef9ed1ebd5190cb7c5d5f998ce33c8d120e3, 32d86600146a2970724f4d6a4dbaa8c937afef6e) * Implemented a workaround for GCC 13 bug that resulted in matmul hangs on some Intel Arc graphics SKUs (a30d5267c2baaa916a92761fb3e187eb2bfd1ecd) * Updated execution units (EU) number detection logic for Intel GPUs based on Xe2 architecture to accommodate for behavior changes in Linux driver (04e7eaccf5039db5369ccf87957873989f59f01f, 97b04bdd8536ee3f13ff5038691d4ef5f6f00d6e) * Fixed build issue for static library with ONEDNNVERBOSE=OFF (7f476cbbdbb171e4a394f66fd83eb7a86755ab04) * Fixed correctness issue in SYCL deconvolution implementation with post-ops (8f600a3314374306e3370ed37b8f8aef539cf79a) * Fixed memory formats checks in SYCL softmax implementation (6ae73e4f1039d2eb1530bf863cefdf7145b856aa) * Fixed correctness issue in SYCL resampling implementation with post-ops (984505720edddde0ae7f9e17ed60853721e17091) * Aligned accessor types in SYCL kernels with SYCL specification (0d9b3bd68405b9fb3606c498c46df74c4170cead) * Improved scales argument checks in generic SYCL kernels (9f73bf19ca594dbe52086294fad5492dbd0fbdd7, 7d85c7546b98589435677d624e52387f51ef9421) * Fixed correctness issue in int8 convolution with sum post-op on NVIDIA GPUs (7486ed83f72fa60ad84f5cbf549c4128ea8de8be) * Relaxed accuracy test threshold for bf16 softmax on NVIDIA GPUs (e9d0fdbfa757c9e3738a5721a737505e75857598) * Added support for bf16 and fp16 bias for fp8 matmul on Intel CPUs (188ae7f3e3410a76a81d10b2b2bed3be7afdc307) * Fixed a bug that prevented dispatching Intel AVX-512 with Intel DL Boost implementation in int8 RNN primitive (bf58e72e5776a831bc65882e97e38f64626d5dec) * Fixed a runtime fail with `CLOUTOFRESOURCES` error in fp16 convolution on Intel Arc graphics (39a5f6753ae376cebb2ed4e16379ca1d78d1459b, 7e1663fea5f9e0db643cce3d993d7f34145dcda6)

- C++
Published by vpirogov over 1 year ago

onednn - v3.6

Performance Optimizations

Intel Architecture Processors

Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
Improved performance of group normalization primitive.
Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
Improved performance of fp8 matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
Improved fp32 RNN primitive performance on processors with Intel AVX2 instruction set support.
Improved performance of the following subgraphs with Graph API:
- convolution and binary operation fusions with better layout selection in Graph API.
- fp8 convolution and unary or binary on processors with Intel AMX instruction set support.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

Intel Graphics Products

Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
Introduced support for Intel Arc Graphics for future Intel Core Ultra processor (code name Arrow Lake-H).
Improved performance of fp8_e5m2 primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
Improved int8 convolution performance with weight zero-points.
Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. f16 variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- fp8, convolution, and unary or binary on the Intel Data Center GPU Max Series.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

AArch64-based Processors

Improved fp32 convolution backpropagation performance on processors with SVE support.
Improved reorder performance for blocked format on processors with SVE support.
Improved bf16 softmax performance on processors with SVE support.
Improved batch normalization performance on processors with SVE support.
Improved matmul performance on processors with SVE support.
Improved fp16 convolution with Arm Compute Library (ACL).
Improved matmul performance with ACL.
Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality

Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based implementations.
Enabled support for int8 activations with grouped scales and int8 or int4 compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
Introduces support for stochastic rounding for fp8 data type functionality.
[experimental] Extended microkernel API:
- Introduced int8 quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
[experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
Introduced int8 support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
Graph API:
- Introduced GroupNorm operation and fusions in Graph API.
- Introduced support for standalone StaticReshape and StaticTranspose operations.

Usability

Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
Added an example for deconvolution primitive.
Added examples for Vanilla RNN and LBR GRU RNN cells.
Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0.
Introduced interoperability with SYCL Graph record/replay mode.
Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
[experimental] Introduced logging mechanism based on spdlog library.
Introduced support for ONEDNN_ENABLE_WORKLOAD build knob for Graph API.
Improved performance of get_partitions() function in Graph API.

Validation

Introduced protection from out-of-memory scenarios in benchdnn Graph API driver.

Deprecated Functionality

Experimental Graph Compiler is deprecated and will be removed in future releases.

Breaking Changes

Experimental microkernel API in this release is not compatible with the version available in oneDNN v3.5.
Updated minimal supported ACL version to 24.08.1 (was 24.04).

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by tprimak over 1 year ago

onednn - v3.6-rc

Performance Optimizations

Intel Architecture Processors

Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
Improved performance of group normalization primitive.
Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
Improved performance of fp8 matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
Improved fp32 RNN primitive performance on processors with Intel AVX2 instruction set support.
Improved performance of the following subgraphs with Graph API:
- convolution and binary operation fusions with better layout selection in Graph API.
- fp8 convolution and unary or binary on processors with Intel AMX instruction set.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

Intel Graphics Products

Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
Introduced support for Intel Arc Graphics for future Intel Core Ultra Processor (code name Arrow Lake-H).
Improved performance of fp8_e5m2 primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
Improved int8 convolution performance with weight zero points.
Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. f16 variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- fp8 convolution and unary or binary on Intel Data Center GPU Max Series.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

AArch64-based Processors

Improved fp32 convolution backpropagation performance on processors with SVE support.
Improved reorder performance for blocked format on processors with SVE support.
Improved bf16 softmax performance on processors with SVE support.
Improved batch normalization performance on processors with SVE support.
Improved matmul performance on processors with SVE support.
Improved fp16 convolution with Arm Compute Library (ACL).
Improved matmul performance with ACL.
Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality

Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based implementations.
Enabled support for int8 activations with grouped scales and int8 or int4 compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
Introduces support for stochastic rounding for fp8 data type functionality.
[experimental] Extended microkernel API:
- Introduced int8 quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
[experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
Introduced int8 support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
Graph API:
- Introduced GroupNorm operation and fusions in Graph API.
- Introduced support for standalone StaticReshape and StaticTranspose operations.

Usability

Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
Added an example for deconvolution primitive.
Added examples for Vanilla RNN and LBR GRU RNN cells.
Introduced support for Intel DPC++/C++ Compiler 2025.0.
Introduced interoperability with SYCL Graph record/replay mode.
Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
[experimental] Introduced logging mechanism based on spdlog library.
Introduced support for ONEDNN_ENABLE_WORKLOAD build knob for Graph API.
Improved performance of get_partitions() function in Graph API.

Validation

Introduced protection from out of memory scenarios in benchdnn Graph API driver.

Breaking Changes

Experimental microkernel API in this release is not compatible with the version available in oneDNN v3.5.
Updated minimal supported ACL version to 24.08.1 (was 24.04).

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov over 1 year ago

onednn - v3.5.3

This is a patch release containing the following changes to v3.5.2: * Fixed correctness issue in convolution weight gradient for small shapes on Intel GPUs (49eee6a145467d133af80bb3429a1153fcae2545, 281dd3bd38049f70da38d8b9d485c39ae80be78a) * Extended MLP patterns supported by experimental Graph Compiler to cover cases relevant to ChatGLM model (ff680fc68bb633290531ec2f6c13abd39c072d50) * Fixed performance regression in bf16 depthwise convolution on Intel CPUs (d6c216a7b59359790b9e572b46ec992adb873f95)

- C++
Published by vpirogov almost 2 years ago

onednn - v3.5.2

This is a patch release containing the following changes to v3.5.1: * Fixed performance regression for some Graph API subgraphs with LayerNorm operation (82f629c1afa4ae2d50396c4e0e25cd26631daf2a) * Fixed runtime error for Graph API subgraphs including 6D LayerNorm operation (f704f0910fcbf618a7c2ca41f8239c1c02057ec7) * Fixed an issue with host compiler version detection in SYCL configurations (730b9766cf9a304dddf40a84575f2d93fdec76be) * Fixed an issue with missing DNNL_TARGET_ARCH define for builds not relying on CMake (87848b9c953c9c57b5fd9bb78b505ab486e684b1) * Fixed a test issue for matmul with low-precision scales and/or zero-points (91c35d8f5bdd7b58a8f30f1f11cb91dcb78a1dd9) * Fixed segfault issue in bfloat16 shuffle on AArch64 processors (91166816ce10dd241cacffccc971e6e6f3b546f6) * Fixed runtime issue in quantized layer normalization pattern with Graph API (0013e8ce633a8cac5edd01034d4d24c12dcb2ff8)

- C++
Published by vpirogov almost 2 years ago

onednn - v3.4.4

This is a patch release containing the following changes to v3.4.3: * Fixed an issue with host compiler version detection in SYCL configurations (fcaa1b44110280919674c801f4c063f5651b5760)

- C++
Published by vpirogov almost 2 years ago

onednn - v3.5.1

This is a patch release containing the following changes to v3.5: * Fixed potential page fault in matmul on Intel Datacenter Max Series GPUs (a9c525d5af0919f26f62eeba8973ab5bc3468e21) * Fixed potential stack overflow issue in convolution implementation for Intel GPUs (0fb7e6ed4f32e5d89832b2bd742bbf834cd296ed) * Added test cases for matmul with compressed weights (015ccb1067eb1fd470025c08517a23a6971db9b9) * Extended Graph API LayerNorm operation with zero points support (dc2701ae41345e4939eb328f1c1182d40eafd035) * Fixed primitive creation error for depthwise convolution backpropagation on Intel GPUs (4a045e43509987517bfdf1e9e778f9b429510858, b529d2241001ba77e8a2eff78cba71121da09627)

- C++
Published by vpirogov almost 2 years ago

onednn - v3.5

Performance Optimizations

Intel Architecture Processors

Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
Improved performance of group normalization primitive.
Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
Improved performance of the following subgraphs with Graph API:
- Multi-Query Attention (MQA).
- Scaled Dot Product Attention (SDPA), including the variant with select operation.
- LayerNorm + Multiply + Quantize produced by SmoothQuant algorithm.
- Convolution + Sigmoid + Multiply with mixed precisions.

Intel Graphics Products

Improved performance for Processor Graphics based on Xe2 architecture.
Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
Improved RNN primitive performance for LSTM cell case.
Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).

AArch64-based Processors

Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
Improved bf16 matmul, convolution, and reorder primitives performance with Arm Compute Library (ACL).
Improved eltwise primitive performance with gelu_erf algorithm with ACL.

Functionality

Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.
Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
- bfloat16 matmul with int8 weights on Intel CPUs.
- float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
[experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability

Extended error messages for engine and memory objects creation errors.
Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
Introduced support for clang++ host compiler in SYCL builds.
Introduced API for tensor serialization and deserialization.
Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
Introduced OpenCL runtime support for Graph API.
Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation

Extended benchdnn with support for tensor tags in RNN primitive validation.

Breaking Changes

Updated minimal supported ACL version to 24.04 (was 23.11).

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, David Svantesson @davsva01, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Fadi Arafeh @fadara01, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov almost 2 years ago

onednn - v3.4.3

This is a patch release containing the following changes to v3.4.2: * Fixed GPU detection issues on systems with several different Intel GPUs (0fb7e6ed4f32e5d89832b2bd742bbf834cd296ed)

- C++
Published by vpirogov about 2 years ago

onednn -

This is a release candidate for oneDNN v3.5. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
- Improved performance of group normalization primitive.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Improved performance of the following subgraphs with Graph API:
- Multi-Query Attention (MQA).
- Scaled Dot Product Attention (SDPA), including the variant with select operation.
- LayerNorm + Multiply + Quantize produced by SmoothQuant algorithm.
- Convolution + Sigmoid + Multiply with mixed precisions.
Intel Graphics Products:
- Improved performance for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved RNN primitive performance for LSTM cell case.
- Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
AArch64-based Processors:
- Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
- Improved bf16 matmul performance with Arm Compute Library (ACL).
- Improved eltwise primitive performance with gelu_erf algorithm with ACL.

Functionality

Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs only.
Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
- bfloat16 matmul with int8 weights on Intel CPUs.
- float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
[experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability

Extended error messages for engine and memory objects creation errors.
Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
Introduced support for clang++ host compiler in SYCL builds.
Introduced API for tensor serialization and deserialization.
Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
Introduced OpenCL runtime support for Graph API.
Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation

Extended benchdnn with support for tensor tags in RNN primitive validation.

Thanks to these Contributors

This release contains contributions from the project core team as well as @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Shreyas-fuj @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov about 2 years ago

onednn - v3.4.2

This is a patch release containing the following changes to v3.4.1: * Fixed performance regression in deconvolution on processors with Intel AVX-512 instruction set (307b35bf7fa03ef7f030481329e23dcc287bbd7f, f46fffbb35f1213e6e98c1bc1e10353232ac08ee) * Improved performance of batched matmul with binary post-op on processors with Intel AVX-512 instruction set (d39e1b7447979340ae7a882fff376fa14c12ddaa) * Fixed performance regression in softmax with destination memory format set to any on processors with Intel AVX-512 instruction set (756d3cf5a2e28c0dc052a18caf8358f3c9dc22e0) * Fixed incorrect results in int8 deconvolution with source zero points on processors with Intel AMX instruction set (d5ddbc851aa75e36ed9f651e01185443bfa903ff) * Fixed performance regression in convolution on processors with Intel AVX2 instruction set (2968c8948225d18c9df19c94534ff7dc4343700c) * Improved f8_e4m3 matmul performance on Intel Data Center GPU Max Series (068f8504de78b93fea0ed71fe87bf5bc86c79724, 668abae7f109709ed7b0ac107cb46e8983a625a6, c3972ef0c49db9724d143bd486c6c7fbdc52f8a3, ad943825dcd5b50e52d37543015814421444600a) * Fixed sporadic accuracy issues in bf16 depthwise convolution backpropagation on processors with Intel AVX-512 instruction set (01840442c5e045c7311a4144625bf01322f9e942) * Fixed primitive creation issue for fp16 pooling backpropagation on Intel GPUs (e4737d908b56e2ca5a56d580266e7f65505c4b0d) * Fixed failure for subgraphs with int8 matmul operation with experimental Graph Compiler on processors with Intel AMX instruction set (5ebde2e2ad2788fd373da15a2a9073526849ed0d) * Fixed assert in experimental Graph Compiler on Windows (f53fbd164dab47c73e4c56c97f4bdd0546e47ed3, fd903aebe6917535b49c8e16598d1912ad42b09b) * Fixed incorrect results for subgraphs with shuffle operation with experimental Graph Compiler (aef502394d43f7aae388487eee05549d28470ae4) * Improved performance of subgraphs involving int8 matmul with experimental Graph Compiler on processors with Intel AMX support (0ca5bc557e4d1e090aeca659cfbe68a8c57ef168) * Fixed page fault in fp16 matmul primitive on Intel Data Center GPU Max Series (5587f0820c2cc5b1eca159a7b78e8ae38ce7d7d6) * Fixed incorrect results in dp32 deconvolution with Arm Compute Library on AArch64 processors (b7694a00a26cfe3f0d9d9b36d16edac91bfdd65b) * Fixed performance regression in deconvolution on processors with Intel AVX2 instruction set (6f452e2ff782255ae57f91ddfaa142752de21a42)

- C++
Published by vpirogov about 2 years ago

onednn - v3.4.1

This is a patch release containing the following changes to v3.4: * Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a1e5688022a59444059e53a6a7967f679a) * Introduced memory descriptor serialization API (4cad420e673f4cd49568ea7c4dd6a55e6f55794e, 929a27ae0412a0851629da70916eee360a39baac, 9b848c859a6b1d046dd63cf20f817aa9428fb483) * Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b566bb1cd273e9bda99cc62063b7c2a7e45, 0b399ac42740a9c6ed458aacafdb31ce16205cbd, d748d642d7871608e09f5cee5d964ddcfc8a42ef, 9f4f3d510ddc9d639db052302be579621d46bb1f, 21a8caebb34a85074f3f8a5cef35ed85532a5bbe) * Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e6d835f8632ea571f3ea0e273b22488d37, 4b7236134bde1c1a71859a844eae860a71670b97, 74a343bf66a1c8f113fa8e025391aba5015c6e48) * Reduced creation time for deconvolution primitive on Intel CPUs (bec487e4ae16b3e88382adf9574e9c62cc76d1bd, 1eab00586881f4fb6966a16f71216528ec549c11) * Fixed performance regression in deconvolution on Intel CPUs (fbe5b97c966696a3f5be2240c0eb4592ed548036, 1dd3c6af03addefcf92ac45eddeb8becf63d6a6e) * Removed dangling symblols from static builds (e92c4041b12e55837452327c3ebd9411dbc2e861, 6f5621aed75226b93f07879fafa6fb799a36f042) * Fixed crash during platform detection on some AArch64-based systems (406a0798c1c5b939726a892ad5a96e20298396ca) * Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e152f21a79978b8910260e042b43941b601c) * Fixed handling of zero points for matmul in verbose logs converter (15c791686f94291eddda7a2e24835ba1113c530a)

- C++
Published by vpirogov about 2 years ago

onednn - v3.3.6

This is a patch release containing the following changes to v3.3.5: * Fixed crash during platform detection on some AArch64-based systems (3e0e69b21ba0694db95bd2af0877f936dcc86dd2) * Improved inner product performance with Arm Compute Library (ACL) (e7abee2d883d41613cf243c135037fc68d2dacd0, 214fb9e14227880097729ffffac3b666a0febcd7, 8aacc8ff0dfefddfae30681d056757dba1fb0815) * Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e04df62cf3042ebdc578a72883bde35079a) * Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad7234741459bab6afc21f571ddb645bcae)

- C++
Published by vpirogov about 2 years ago

onednn - v3.4

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
- Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
- [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
- [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.

Functionality

Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
[experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
Intel Graphics Products
- Introduced support for Intel Data Center GPU Max 1550VG
- Introduced PReLU post-op support for inner product and matmul primitives.

Usability

Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
Introduced accumulation mode control.
Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
Reduced RNN primitive memory consumption on GPUs.
Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
Extended tensor constructor in Graph API to support memory allocation and management by the library.
Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

Improved benchdnn performance by optimizing bottlenecks in validation code.
Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Known Limitations

Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.

Breaking Changes

Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov about 2 years ago

onednn - v3.3.5

This is a patch release containing the following changes to v3.3.4: * Fixed undefined behavior in 3D depthwise convolution on Intel CPUs (bbaec145f8c64818fd5c3ed2cb9e2ae69daef887) * Added warning for ACL versions newer than maximum supported (7473012743ae3227dbfa208cad260d29d86d5080) * Added citation file (fea9f88fa7f8056a5addedfdebdb2dda35ee7a9d) * Fixed SEGFAULT in int8 convolution on processors with Intel AMX support (2a8e122b63b55f897c470d23f21003bb70f0e839)

- C++
Published by vpirogov about 2 years ago

onednn - v3.4-rc

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
- Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
- [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
- [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.

Functionality

Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
[experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
Intel Graphics Products
- Introduced PReLU post-op support for inner product and matmul primitives.

Usability

Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
Introduced accumulation mode control.
Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
Reduced RNN primitive memory consumption on GPUs.
Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
Extended tensor constructor in Graph API to support memory allocation and management by the library.
Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

Improved benchdnn performance by optimizing bottlenecks in validation code.
Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes

Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 2 years ago

onednn - v3.3.4

This is a patch release containing the following changes to v3.3.3: * Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c5aeb6be1ce992d799943fdc4f3123905f) * Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38cdf1201caf8ffd2906077defdfe7f4aaa3, fa4364057891fdec528d9442c88d0715306bff2d) * Fixed segfault in 3D convolutions with different h and w parameters on Intel CPUs (b5f916ec068f783dbba2cd4f04a673e996f9efba) * Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d5388d7d749a120cf8522efd6f5aeecc09) * Reduced benchdnn memory consumption on Intel GPUs (84a8f57d45f215cf89d0f80a57a66b78eaf9b440)

- C++
Published by vpirogov over 2 years ago

onednn - v3.3.3

This is a patch release containing the following changes to v3.3.2: * Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661ff735e5448ef3a80e4e2df7a1556f8a84f) * Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd116e245e4a167a64bd39a24e957d2b939de) * Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7efadeaf42d75f75e64d095635458836cd7) * Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b23d57c38a439c50232783f654b96f575c) * Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9312a9b76a1880e1aaac513188793ecaa7) * Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84c4f09398858393035eafa2bd4a29ec0b0, 79bc6cc0477db1ce7e732f20d005ff2b9e88390e, c9c0b09c5e64114eada1b6beb7f6db36331e0fac)

- C++
Published by vpirogov over 2 years ago

onednn - v3.3.2

This is a patch release containing the following changes to v3.3.1: * Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980286c506908f98819e068a047a1d268842, ed9de2afd1fede32a317cbc5df953dfe997e78ea, 0c6bda10b3ea760205d4707a554b76045ef6f964) * Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f01ec5cf8b30ee0b474aa25417f0493897) * Updated compiler optimization flags for AArch64 processors to make build portable (8829c249b713dddc87c2669120a9798e202ac633) * Fixed segmentation fault during library initialization on AArch64 processors (3e15c6113ffeff3545775cbcca9bd84911856cb9)

- C++
Published by vpirogov over 2 years ago

onednn - v3.3.1

This is a patch release containing the following changes to v3.3: * Fixed int8 convolution accuracy issue on Intel GPUs (09c87c79bccbad8fa451b224a0f07f87095e3907) * Switched internal stream to in-order mode for NVIDIA and AMD GPUs to avoid synchronization issues (db01d62b3fc80897d88dc42f4dcdfcb0d90c131a) * Fixed runtime error for avgpool_bwd operation in Graph API (d025ef6620b131f3487bb748866ddd9d7225c09f, 9e0602ad37afa18d46f407cb52577f1afead238b, e0dc1b3d070313052f5fd6ac739778d45b57859c) * Fixed benchdnn error reporting for some Graph API cases (98dc9dbecb3f36234474c9d6e96ab6571497633b) * Fixed accuracy issue in experimental Graph Compiler for int8 MHA variant from StarCoder model (5476ef7c165d943fbce94ca0f44a13d6868e65f3) * Fixed incorrect results for layer normalization with trivial dimensions on Intel GPUs (a2ec0a0c5805314220db925e1323e4675e3ca379) * Removed redundant synchronization for out-of-order SYCL queues (a96e9b1a6769171e74b0b8e031489303438906e5) * Fixed runtime error in experimental Graph Compiler for int8 MLP subgraph from LLAMA model (595543dd093df3e92621c253d6da3f9092ec7ff8) * Fixed SEGFAULT in experimental Graph Compiler for fp32 MLP subgraph (42071057abb2fcbbca6ed67117bdb1a5ee3dc0cd) * Fixed incorrect results in experimental Graph Compiler for MLP subgraph (57e14b56d4e6fab2ab49dbd47fd579482d79535a) * Fixed the issue with f16 inner product primitive with s8 output returning unimplemented on Intel GPUs (bf12207b0312c0174f0c47ae0d3abd70edc31957, 800b5e9613bd0994af82706ef024ad2b453be2b6, ec7054a2c79ae33d3db4ff04ce11360c2c896d56) * Fixed incorrect results for int8 deconvolution with zero-points on processors with Intel AMX instructions support (55d2cecd698f865efac2e1dbf2f701b4b8095df1)

- C++
Published by vpirogov over 2 years ago

onednn - v3.3

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.

Functionality

Introduced group normalization primitive support. The functionality is currently available on CPUs.
- Intel CPUs:
Introduced support for zero points in int8 convolution with groups and 3D spatial. # Usability
Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with any memory format tag.
Introduced examples for Graph API.
Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

Extended benchdnn performance reporting with primitive creation time.
Introduced cold cache mode in benchdnn.

Known Limitations

Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.
Int8 softmax may fail crash on Windows in SYCL debug configuration.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Ilya Lavrenov @ilya-lavrenov, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, Renato Barros Arantes @renato-arantes, @snadampal, @sparkyrider, and Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 2 years ago

onednn - v3.3-rc

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.

Functionality

Introduced group normalization primitive support. The functionality is currently available on CPUs.
- Intel CPUs:
Introduced support for zero points in int8 convolution with groups and 3D spatial. # Usability
Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with any memory format tag.
Introduced examples for Graph API.
Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

Extended benchdnn performance reporting with primitive creation time.
Introduced cold cache mode in benchdnn.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, @snadampal, @sparkyrider, Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 2 years ago

onednn - v2.7.5

This is a patch release containing the following changes to v2.7.4: * Fixed a correctness issue in fp32 batched matmul with transposed source tensor on processors with Intel AVX-512 instruction set support (1a9b80d7ad6856437c3e0b504bb53dca772eb0fe) * Improved batched matmul performance on processors with Intel AMX instructions support (8c20f62dbcd4d622c8a279b7c81dacb629f1de41, acb8e12b0f3e70d2e80543e31a91362c8852bbaf) * Fixed a correctness issue in int8 convolution primitive with zero points on processors with Intel AVX2 and Intel DL Boost support (0abbf225ce906987bc3728252b5842fb0239daab, d3a9f02e50334a0ebe102dd8bdb7887deeb12ec5) * Improved convolution performance with small number of input channels on processors with Intel AVX-512 instruction set support (fc7fced9988124027220fb53dfb16022c9be35c0)

- C++
Published by vpirogov almost 3 years ago

onednn - v3.2.1

This is a patch release containing the following changes to v3.2: * Fixed a potential issue SEGFAULT when oneDNN primitives created in parallel (0a6202f5000cf347995ab744c25aa26cabf2482d) * Replaced deprecated SYCL API get_pointer with get_multi_ptr (fdbff4591f952d02a0c934f854a9b225a7097a21, 51ed43bb5cb08f38b0b652255a13bb4072b2ee57) * Fixed an error in device indices detection for persistent cache (25575c2d20a9885640c89771c99a0d27b5444b4d) * Improved benchdnn performance results accuracy for Graph API (9dfe343992209ecc6eb1265a140b6f0db228d90a) * Fixed an issue with profiling API not respecting ONEDNN_EXPERIMENTAL_PROFILING build knob. This behavior manifests in apparent memory leak when oneDNN primitives are executed on a queue with enabled profiling (8d796efb609c33ecdd31e3e7b26d94d959dd83b9, 51a8f7ad892b1174d32cba8358804fad09b58f76, 2ca29381eeb5dde64d90468e440f87b6f9ad01d9) * Fixed a correctness issue in resampling primitive with binary and/or sum post-op on Intel CPUs (65ccd2506eeafb44822c682acfef97ef18bea09f, 4a0e087b405f4ebc682cf82c4a5bb96e9b9976d4, f333bb8c191fbfab368645aeac1c3a0d1892eda4) * Fixed a correctness issue in int8 matmul with zero-points for processors with Intel AVX2 and Intel DL Boost instructions support (ec0b2ee85fc2a2dbdeec10035c5ef5813d8fb5ea, 6d2e567c9361992adf235545c9fc2047184ed6e6) * Fixed a correctness issue in fp32 batched matmul with transposed source tensor on processors with Intel AVX-512 instruction set support (36f355e0773f79cca5a639a5a3558f45da57c35d) * Fixed a correctness issue in matmul and inner product with post-ops on processors with Intel AVX2 and Intel DL Boost with fp16 and bfloat16 instruction set support (b76d4cae333fc4e015d47eb737e10551daf30334) * Fixed a potential out of bounds issue during GPU kernel creation (190a9b28170f5326241c9c4ab6bc7964877e953d) * Updated build system to use TBB-provided CMake config file when available (40112196287e8866a7259df35f817229454d0c96)

- C++
Published by vpirogov almost 3 years ago

onednn - v3.2

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
- Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
- Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced creation time for matmul, inner product, and RNN primitives.
AArch64-based Processors:
- Improved convolution performance with post-ops on processors with SVE support.
- Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
- Improved fp32 deconvolution performance for math mode bf16 or any with ACL.
IBM Z Platform:
- Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality

[experimental] Introduced Graph Compiler backend for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.
Extended Graph API with boolean data type, select, and pow operations.
Introduced support for binary and eltwise post-ops in softmax primitives.
Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.
Intel Graphics Products:
- Introduced mixed precision support for binary primitives.
NVIDIA GPUs:
- Introduced bfloat16 support for deconvolution and softmax primitives.
AMD GPUs:
- Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability

Extended verbose mode with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
Reduced stack consumption to less than 20 KB across implementations.
[experimental] Introduced profiling API for SYCL and OpenCL applications.

Validation

Introduced fast performance validation mode (--mode=F) in benchdnn. Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
Reduced benchdnn memory consumption in performance validation mode.
Introduced smoke test set for benchdnn. This test set provides basic validation for all primitives.

Known Limitations

On future Sierra Forest platforms, fp32 matmul with bfloat16 binary post-op may produce incorrect results on processors with Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Deep Learning Boost (Intel® DL Boost) support.
On SKX+ platforms, fp32 convolution forward propagation with strides has performance regression on processors with Intel® AVX-512 instructions support.
On all platforms, resampling primitive with binary post-op may produce incorrect results on CPUs.
On all GPU platforms, extensive use of the RNN primitive on Intel GPUs with default primitive cache settings may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
On DG2 and ATS-M platforms
- Convolution and deconvolution primitives on Intel® Arc™ GPUs on Windows may lead to memory corruption under heavy repeated use.
- The bfloat16 matmul primitive may crash on Intel® Arc™ GPUs on Windows.
- Pooling, resampling, prelu, batch normalization, and layer normalization may sporadically produce incorrect results on Intel® Arc™ GPUs on Windows.
- oneDNN Graph partitions containing ConvTransposeBackwardWeights or int8 matmul operations may produce incorrect results on Intel® Processor Graphics on Windows.
On PVC platforms
- The bfloat16 matmul primitive has performance regression with shapes 14x128:128x200:14x200 and 200x128:128x200:200x200 on the Intel® Data Center GPU MAX Series.
- oneDNN primitives may crash or produce incorrect results with tensors exceeding 4 Gb in size.
- The softmax primitive with a NHWC memory format may produce incorrect results on the Intel® Data Center GPU Max Series.
On GEN12 platforms, the inner product weight gradient may produce incorrect results on Intel® Processor Graphics on Windows.

Thanks to the Contributors

This release contains contributions from the project core team as well as Abdelrauf @quickwritereader, Alexey Vishnyakov @SweetVishnya, Annop Wongwathanarat @annop-w, Anthony Roberts @anthony-linaro, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Ilya Lavrenov @ilya-lavrenov, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, RambabuSwargam @RambabuSwargam, Sai Teja @saiteja13427, Taiju Tsuiki @tzik. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 almost 3 years ago

onednn - v3.1.1

This is a patch release containing the following changes to v3.1: * Fixed correctness issue in pooling primitive with post-ops on Intel GPUs (4b7bc1a7bf16909003f63bf66d3d730cee00e5db) * Fixed segfault in bfloat16 convolution on processors with Intel AMX support (461d55e65f2bc0f45fcdfc3405493226218d22ee) * Fixed correctness issue in deconvolution primitive with post-ops on Intel GPUs based on Xe-LP architecture (c8943f588e99f6251a443ee4eb5c274e9c942947, ad3c62f104b07d30cc0f5cf34ca7bf127041e4dc) * Fixed performance regression in int8 convolution primitive with scales (7fa3b6f335893270cdd079f4f8aadd36cf8f490b, bb3ecc460605eae3ca8a8ee79a8d9122f195730b) * Fixed correctness issue in int8 convolution primitive with zero points on processors with Intel AVX2 and Intel DL Boost support (d721767a554f9a4da70bd6bc1c27c00b1ea80cc2, f6365b1b2c6e6d79e59207dad090b9643224f147) * Fixed performance regression in int8 inner product on processors with Intel AVX-512 and Intel DL Boost or Intel AMX support (2ede31e834a25ca14c648e8617b972148c94554c) * Fixed segfault in pooling primitive with post-ops on processors with Intel SSE4.1 support (d712173a5b9df2bdefd12cc94be2e83e64cfb433, e4085a706dd0b41c3d8171193b816a3c4e52c01d) * Fixed integer overflow in eltwise primitive on Intel GPUs (1932b3d04e574745d54802ee19e18bcbe8887e2d, be05c3392eaf86f2d897c5ec42a8860361c290b8, 148006b86f66e4af8f3ebd7db94980de487b9287, 2e643692480be21019b2b71db69e07729bfbf26c, b4423fbc11e574697d97eda18d4b0d8d7b1f60f3, 87fd48f48847463cbd1c42a39c9aa092158dbf2f, 9a66ac6f394071b05285b063a393acd297e3c662, 6ce52eb340486373670a9975c54449cf14a73d4f, 36bf079e7e99e0408ec11fe94cd64439f30b4014, 161d2b6416f4e9c17eabd1d45b8a3aeb2d4e9dd0, a5ef0788afcb719d22a311f91b31f3afca392a7c, d058bd8898b92330546d3f8d52335631fda5051a) * Fixed primitive creation error in large 3D convolutions on Intel GPUs (7c23d9e85ef328081f7d9836ebfffda755f4b496) * Fixed performance regression in fp32 convolution primitive weight gradient on Intel GPUs (ff209f967c2bdfa1139779cf59dced374e2064c5, 87108392da71b06594356a18232ac1378e28adfc) * Fixed primitive creation error in int8 convolution with zero points on Intel GPUs (cb9169397ceee206fece71f73b5d627ee9eea33f, 85e58af6b5cb1a9cd42cd602832c035a3b3a660f) * Fixed correctness issue in fp32 convolution with Winograd algorithm on Intel GPUs (97ac88509bf8799fd03eb768faec302d44ce38dc) * Fixed primitive creation error in depthwise convolution on Intel GPUs based on Xe-LP architecture (51d608d24f09d6b0ad2d60008f09646dbf79ee60) * Fixed segfault during Graph partition compilation (a5d35682307ec81107f603b66c5f4ca95f421fbb) * Fixed crashes in inner product with unsupported weight formats on Intel64 CPUs (c0f4e93903f1c32bef8378d58177ef971c400e90) * Fixed an issue with compilation of Graph partitions containing matmul and using destination tensor layout any on Intel GPUs (ab2041d39862de747535037eb5a73c675d93d323, f2c457d72896d6c86245a6c6e80539b842aec369) * Improved accuracy of eltwise primitive with gelu_erf algorithm on Intel64 CPUs (e67abefadbb4fd73ea6a4d3981965bc56eb77b97) * Fixed correctness issue in int8 matmul and inner product primitives on Intel GPUs based on Xe-HPG and Xe-HPC architecture (36aa6224ebae1413a6badd43ffc96d3412c8f8ec) * Fixed potential correctness issue in bfloat16 convolution weight gradient on processors with Intel AMX support (c93e673bba299fdc62733f22d65d91f4dbc300dd, 8da108375bc02b08a385b167a49aa8d1189b66d6, f7acf9877b368a5f704dcc9efcb913345b477bbc) * Fixed memory corruption in inner product weight gradient on processors with Intel AMX support (b56a89e1b977d793f2de89dc95bb7f07f2449cd8) * Fixed integer overflow issue in convolution primitive on Intel GPUs (774deabcbb9dc3e452bdafcde5e92a55c3701309, 663c2e44272c57a97e5f20e3a7a28cb9ac91ae01, 12d57430c66eb4d83532a2338443faae7be8ea5c, 31ac0e045981b03434c7592fe84af97a79a3d4a8, e3cb07d60473c23829db987384e5366b924e22c4) * Fixed correctness issue in matmul primitive with broadcasted bias on Intel GPUs (3ba7e8b9c14948da35c86d4d74725f0d23511fc8) * Fixed correctness issue in inner product primitive with post-ops on processors with Intel AVX2 support (69260f661030f66b34fefeab97044c81769462a9) * Fixed out of bounds prefetching in matmul and inner product primitives on Intel GPUs (2b8f6b16dd894f7c13c33a9fd5c497cff10d66c2) * Fixed dispatching issues for fp32 inner product implementation on processors with Intel AVX2 and Intel DL Boost supprt (f27dedbfc093f51032a4580198bb80579440dc15, f8d7c2e40a965fc52521d4ba9c793d8adc2be4e1) * Fixed division by zero issue in eltwise and eltwise post-op on Intel GPUs (f5654f55582f003c22aee23e5a91acfead8d1e1b, a18c19e654483b547bbe791d0640eceef4ef2e79, a7c8cbc428ad361e2f290605be1280268eb8ea56, 44355a60e31fd20bf6fa029af5bf3eebc533ec2c) * Fixed correctness issue for 3D convolution primitive with post-ops (e6b93af5bdb32691ad90d3f537158649b61a6fc4)

- C++
Published by vpirogov almost 3 years ago

onednn - v3.2-rc

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
- Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
- Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced creation time for matmul, inner product, and RNN primitives.
AArch64-based Processors:
- Improved convolution performance with post-ops on processors with SVE support.
- Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
- Improved fp32 deconvolution performance for math mode bf16 or any with ACL.
IBM Z Platform:
- Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality

[experimental] Introduced Graph Compiler backend for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.
Extended Graph API with boolean data type, select, and pow operations.
Introduced support for binary and eltwise post-ops in softmax primitives.
Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.
Intel Graphics Products:
- Introduced mixed precision support for binary primitives.
NVIDIA GPUs:
- Introduced bfloat16 support for deconvolution and softmax primitives.
AMD GPUs:
- Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability

Extended verbose mode with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
Reduced stack consumption to less than 20 KB across implementations.
[experimental] Introduced profiling API for SYCL and OpenCL applications.

Validation

Introduced fast performance validation mode (--mode=F) in benchdnn. Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
Reduced benchdnn memory consumption in performance validation mode.
Introduced smoke test set for benchdnn. This test set provides basic validation for all primitives.

Thanks to the Contributors

This release contains contributions from the project core team as well as Abdelrauf @quickwritereader, Alexey Vishnyakov @SweetVishnya, Annop Wongwathanarat @annop-w, Anthony Roberts @anthony-linaro, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Ilya Lavrenov @ilya-lavrenov, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, RambabuSwargam @RambabuSwargam, Sai Teja @saiteja13427, Taiju Tsuiki @tzik. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 almost 3 years ago

onednn - v2.7.4

This is a patch release containing the following changes to v2.7.3: * Fixed potential NaN issue in convolution weight gradient on Intel CPUs (6d80bb48714f8f8d030f055435f5bfde3a382f15, 4c34f89653259b2e15e277ff0663d6705f093e1b, 017950a16168640764d17558e41010d0ae038377, 796a600c3de2993b5d5819995ad13eb70d097496) * Improved bfloat16 convolution weight gradient performance for processors with Intel AMX support (21bdc21f37ff835b9ce54d4b713d7bfd65060e30, 82cb7d37f861a471215b242e8df0330523cdf223, b2e948f931367c81a6887d4e0e544a9f50dcd673, 0a33f70c1b283d18631d299d3c907743d215e80d, ff05d0e8c2db056b0857bcbed22c5097f76529da) * Fixed out of bounds writes in bfloat16 inner product weight gradient for processors with Intel AMX support (caead724fc6d309c7706760a520908e28b8f8b0b) * Fixed illegal instruction in matmul for processors with Intel AMX support (be942a240e775dfda47bfff5622106851df218e5, 28ddb5bc91b01e266575047a676569c4af35a5eb, d264ba494a9f6b15d3eb21ec26e4606dd8d458c8) * Fixed segfault in convolution with depthwise post-op for processors with Intel SSE4.1 support (f7081009737b836f23ef8adce70994815acfa842) * Worked around segfaults for builds with Intel C/C++ Compiler 2021 for macOS (1382605c20bcdac9aa17c62cc38924138bc57db1) * Fixed segfault in bfloat16 convolution with strides for processors with Intel AMX support (c3b1dcd2605cae5609d7175fcf5223da16e03fb9) * Fixed correctness issue in int8 convolution with zero points for processors with Intel AMX support (5e76d8b07a431051b7d6a612c4933e36621fbc39) * Fixed assertion fail in int8 convolution for processors with Intel AMX support (05629a5ccfae9250e6495ffc7d51152025fcfee1) * Fixed incorrect results in vanilla GRU for Intel CPUs (2089770c4818be8933c5e9d1dd3cbaeba1457667) * Improved bfloat16 convolution performance for cases with large number of channels and spatial dimensions (c67f46b0df29c3a7c6cbd0a9f1ebbc9adf4457e8, c9cb51d6bfb68aee8377e7781a5c4512f6aa4bea, 4e2c5730426422fc362c02a963b66072c083acaf, 474527f47acb1aeff2bf52efd64e09ac95d8ef5b, 87e8ea9d8e0499b19149c69748ef8503ad2fb75b) * Fixed an issue with incorrect header files location when using oneDNN as subproject (be6abca883303e0cb4d2edac28da929a21d5d2a2)

- C++
Published by vpirogov about 3 years ago

onednn - v3.1

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced initial optimizations for future Intel Xeon Scalable processor (code name Sierra Forest). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Improved concat primitive performance with per-argument scales on Intel GPUs.
AArch64-based Processors:
- Improved layer normalization primitive performance with Compute Library for the Arm Architecture (ACL).
AMD GPUs:
- Introduced optimized matmul implementation.
RISC-V-based Processors:
- Improved pooling primitive performance for processors with RISC-V vector extension (RVV) support.

Functionality

Enabled Graph API as a production feature. Graph API is intended to simplify oneDNN integration into frameworks.
Added an option to zero-out weight gradient in RNN primitive. See details in corresponding RFC.
[experimental] Added support for sparse memory and dense by sparse matrix-matrix multiplication support in the matmul primitive. The functionality is supported on processors with Intel AVX2 and Intel AVX-512 instruction support.
Introduced out-of-order queues support for OpenCL runtime. See the OpenCL Interoperability section in the Developer Guide for more details.
Added support for the non-zero alpha parameter in the batch normalization ReLU post-op on Intel GPUs.
Enabled the layer normalization primitive with f64 datatype support on Intel GPUs.
Added support of per-argument scales in matmul, convolution, inner product, and reorder primitives on NVIDIA GPUs.

Validation

Extended benchdnn with functional and performance validation for Graph API.

Breaking Changes

Builds with OpenCL runtime will fail unless Graph API is disabled with ONEDNN_BUILD_GRAPH=OFF.

Known Issues and Limitations

Graph API constant cache feature is disabled with SYCL CPU runtime due to an issue with the oneAPI DPC++ Compiler runtime. This will result in lower performance for some scenarios.

Thanks to the Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, Annop Wongwathanarat @annop-w, @arlesniak, @bdmoore1, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Pavel Zamelin @pazamelin, Pawel Piotrowicz @pawelpiotrowicz, Peter Caday @petercad, @ranzhejiang, and Sanchit Grover @sanchit-grover-intel. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 about 3 years ago

onednn - graph-v0.9

This is the Beta Update 3 release of oneDNN Graph API based on oneDNN v3.0.1.

Performance Optimizations

Improved multi-level perceptron (MLP) and residual block subgraphs performance with oneDNN Graph Compiler backend on 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved dynamic shape performance for MLP and multi-head attention (MHA) patterns with oneDNN Graph Compiler backend.
Improved performance of oneDNN Graph Compiler built-in code generator.

Functionality

Extended the set of multi-head attention (MHA) variants supported by oneDNN Graph Compiler.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by vpirogov about 3 years ago

onednn - v3.1-rc

This is a release candidate for oneDNN v3.1. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced initial optimizations for future Intel Xeon Scalable processor (code name Sierra Forest). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Improved concat primitive performance with per-argument scales on Intel GPUs.
AArch64-based Processors:
- Improved layer normalization primitive performance with Compute Library for the Arm Architecture (ACL).
AMD GPUs:
- Introduced optimized matmul implementation.
RISC-V-based Processors:
- Improved pooling primitive performance for processors with RISC-V vector extension (RVV) support.

Functionality

Enabled Graph API as a production feature. Graph API is intended to simplify oneDNN integration into frameworks.
Added an option to zero-out weight gradient in RNN primitive. See details in corresponding RFC.
[experimental] Added support for sparse memory and dense by sparse matrix-matrix multiplication support in the matmul primitive. The functionality is supported on processors with Intel AVX2 and Intel AVX-512 instruction support.
Introduced out-of-order queues support for OpenCL runtime. See the OpenCL Interoperability section in the Developer Guide for more details.
Added support for the non-zero alpha parameter in the batch normalization ReLU post-op on Intel GPUs.
Enabled the layer normalization primitive with f64 datatype support on Intel GPUs.
Added support of per-argument scales in matmul, convolution, inner product, and reorder primitives on NVIDIA GPUs.

Validation

Extended benchdnn with functional and performance validation for Graph API.

Breaking Changes

Builds with OpenCL runtime will fail unless Graph API is disabled with ONEDNN_BUILD_GRAPH=OFF.

Known Issues and Limitations

Graph API constant cache feature is disabled with SYCL CPU runtime due to an issue with the oneAPI DPC++ Compiler runtime. This will result in lower performance for some scenarios.

Thanks to the Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, Annop Wongwathanarat @annop-w, @arlesniak, @bdmoore1, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Pavel Zamelin @pazamelin, Pawel Piotrowicz @pawelpiotrowicz, Peter Caday @petercad, @ranzhejiang, and Sanchit Grover @sanchit-grover-intel. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 about 3 years ago

onednn - v3.0.1

This is a patch release containing the following changes to v3.0: * Fixed potential correctness issue in convolution weight gradient with 1x1 filter and strides (e58996692802f4a94651f6baa6e3f0debf93b537) * Improved convolution, deconvolution, inner product, and matmul primitives performance with scales on Intel CPUs (38319f1f822387bd755183bcac2ec3d0745a88b4, 18de927dc205543701942f0f26d61f72c51f5f0b, b6170d1b79332d8ba0f72227cb5edd2aced837c0, 85171b0cc057d5ba682dee582cd72c48543389db) * Reverted MEMFD allocator in Xbyak to avoid fails in high load scenarios (eaaa41b8a30101640094e46af7f27969ed105ee2) * Fixed array out of bounds issue in bfloat16 convolution weight gradient on Intel CPUs (a17a64c330d1153fdea3d81f1420fb38c50248bd) * Improved compatibility with future versions of Intel GPU driver (eb7a0a07df12874a40c0f135d8bf16116594e0e8) * Fixed segfault in fp16 and bfloat16 convolution backward propagation on systems with Intel AMX support (293561b6a2644ef05d8d664cd81c1bcde876b481) * Fixed build issue with GCC 13 (1d7971ce488da657e23f08488cdb6ef8e484c5e8) * Fixed correctness issue in int8 RNN primitive Vanilla GRU flavor on Intel CPUs (f4a149c16faff0fb51fb292d12a7b51f6fac53bf, fbf8dca1ba9b565ddedd1cb291d3b466d0a5a45b) * Added check for unsupported arguments in binary primitive implementation for AArch64-based processors (5bb907077cd7b4c3983f7215d5509b17f3da67e2) * Fixed correctness issue in int8 convolution with zero-points on Intel Data Center GPU Max Series (96e868c473bb0e2a9b1a42b51e8f91997b52b471) * Fixed runtime error in convolution primitive with small number of channels on Xe-based graphics (068893e1c792c8e9ad5b17bc6e494359b32f910f) * Removed use of OpenCL C variable length arrays in reduction primitive implementation for Intel GPUs (41e8612f212d939643932ef309cd78bd4194f42d) * Fixed correctness issue in matmul and inner product primitives on Intel Data Center GPU Max Series (a1e6bc57b233d85a6f382db611879614236d9b05, dbb7c284e0834cd0fe84c8311484880802fa9af0) * Fixed segfault in fp16 and bfloat16 convolution backward propagation on future Intel Xeon processors (code name Sierra Forest) (399b7c5af4c5238f9956d71270adbd44f3cb25a3) * Fixed runtime error in Graph API for partitions with quantized matmul and add operations (f881da5be31abc71f90a1a750c50ec2ea5dbc516, 699ba755fde86aea3714bbce75d5b0b274302545, b8d21a58d8247097ed26816b730e3cd4c19f61c, 9421fb2a453aee957a0c1dc10be5675e5f916c2e) * Fixed convolution performance regression on Xe-based graphics (1869bf26a92f8d8f36853e537f9727412a4d1f94) * Improved convolution performance with OHWI and OIHW weight formats on Intel Data Center GPU Max Series (2d0b31ee82dc681b829f67100c05ae4e689633e6, 5bd5d52e7ee832fb0d5ece6d42a6b230023c9dd0) * Fixed include files handling in build system affecting CMake projects relying on oneDNN (c61645392fde55ac361c95a752df0cfa7ef24345) * Added tbb::finalize to tests and examples to address intermittent test crashes with TBB runtime (891a41560382cc0f991c428392078d13ccb76129, c79e54322f251aa70783ca1b837ce0d558bf3396, 8312c3addc597e6565cf1233801234c2ffafd092, 1a32b95a2c61d094206ed49d69843fdcdeb2ffcd, bd0389d81509baf6696d3927d0da4cce4c06d2d4, f05013d0e419df22ec2755dc5d74f5974871cf9e, ab7938f1b889aa43f155216f774297e8c765cd97, 31c9e7b3c1a7e262cecafe98bed128843f1c2969, f3261e4556935424946697be4b336020653b41a5, d58ac41a12179f8cca48962c4b5a44940bea97d7, f8c67b9026dc2945ed66a8f1c276611c063dae4d, 258849b71c24a89b08ac12972ec1fcaa72a9da39, b20a8c786c5a2cb676a2a8b599edf5cfd7ee0c3a) * Fixed segfault in fp16 convolution primitive on future Intel Xeon processors (code name Granite Rapids) (a574ffff870318cc104d8af4a2368d47b433b27f) * Fixed correctness issue in fp16 convolution primitive on future Intel Xeon processors (code name Sierra Forest) (f165ed8a8872e72a7d9651c3dd38bd6c2909fdce) * Fixed correctness issue in int8 convolution primitive on Intel CPUs (ca1592237b87cae5e4a55fb464ad90fb9f91957d, 27845b8e66d354549ac6c6fceeb92c267a9e910f) * Fixed correctness issue in int8 convolution primitive on Intel Data Center GPU Max Series (8bb651cb99e2875aea44b907bdc54418b2d4932a) * Fixed correctness issue in resampling primitive with post-ops on Intel CPUs (aa52a5128d44c6d745b89beabcd47f428665843e) * Addressed excessive memory consumption in 3D convolution on Intel CPUs (3d6412af5cb99863ede8753238533dcabcd3c5d9, 097acb5e108eb57b38a8a2409b083a1819b9f962, fd696639c70c4cd92e2aaf871bc4165c269d29f7) * Fixed segfault in convolution with sum and relu post-ops on Intel CPUs (63ad769939dd8307935caac67c0fc7c9bc9206de, 1b1303748b80360e5f93740d6ea03063132fd8f8, 0a8116b3de98243a234680d8cda869d2f20dd178, 9972cb80a29da9f14efbe8518bc10a21f7ae6e36) * Addressed convolution performance regression with small number of channels on Intel GPUs (d3af87710fcae9561ae22017d45bd670f8858272) * Worked around MSVS 2019 bug resulting in build fails on Windows (40247753290e3e886b9235c5f80a2997eb85372a) * Updated code base formatting to clang-format 11 (23576f935fcef245b26cc78ef74935ea6bb7e6b7, 0b1bf845e05da75e4d994e01a0d7996b64787ece)

- C++
Published by vpirogov over 3 years ago

onednn - graph-v0.8.1

This is a patch release containing the following changes to graph-v0.8:

Upgraded oneDNN dependency from v2.7.2 to v2.7.3 (93237aa, 260bdb5)
Fixed a correctness issue of quantized Convolution + Add fusion (26a9a5b, beba352)
Fixed query_dynamic_outputs() interface implementation in graph compiler backend (8dbca04)

- C++
Published by vpirogov over 3 years ago

onednn - v2.7.3

This is a patch release containing the following changes to v2.7.2: * Fixed segfault in int8 convolution with binary post-ops on Intel CPUs (c8d40c0719f9d9cffa1c5eb04f3f40fa1f9546b8) * Applied workaround for tanh post-op on some Xe architecture based GPUs (3eb3267dc3bcfe64a081731ac9d08c84bc6827f7) * Disabled fp16 post-ops with Compute Library for Arm Architecture (ACL) (f7b7dc0a8b3125602295047cdd7feb3cbb8d9a06) * Fixed incorrect results for sequence of eltwise post-op with same algorithm but different parameters (02c26781171f6350634b41d80cbff7ae5092c1a1, 1c36e27520617e23b74ed32e675804ac7806576e, 81ba0fe626c93e51935d5e8776dd7e8bf4105487) * Fixed issue in convolution with groups and plain activation layout on Intel GPUs (df6f2e34bfb1e3d6bcd5498a4febb149b2be8b2b, d0c14c204782945b3732bd83b7329c314c3339c1) * Fixed reorder failures on Xe HPC architecture based GPUs (c3cb1d5fa7e2e41c7059fa7e5ebcee34aa3e5242) * Fixed thread safety issue in convolution primitive (2955c9d5d5f97f03c4068af37f6783f0be256695) * Fixed scratchpad allocation issue in matmul (989acd3b0dbd304fe47ac7837bb33e73a4ca7cd6) * Disabled concat batching with scales on Intel GPUs since implementation doesn't support it yet (8aab73fe1897542c5ec740ac718b00e7d72edd92, 1eac450ca742cd9905addf36ee038a8e17e03474, 82838de623057ffd1dfc0f879afcd02e72f9538f) * Fixed segfault and correctness issues in convolution primitive with sum and relu post-ops on Intel CPUs (fc335be0d1376f1dca527bd543f929739dffd55f, 0f4697a87c0f550339598c1918d5479801337426, 60f1727fcaf06416c5464b44c177ec16829bd2c1, d28f2c1757e2cc6b792e4fd5de40987e811d086d, 4761ee91b3729d124135273a7450d3d2cf0dce53, f674fbf917e92b2623184ad8c603f20ae4fe0ad7)

- C++
Published by tprimak over 3 years ago

onednn - graph-v0.8

This is the Beta Update 2 release of oneDNN Graph API based on oneDNN v2.7.2.

Functionality

Added HardSigmoid operation.
Added block tensor layout support to improve performance on Xe architecture-based GPUs.
Added support of IOX and XOI weight formats for ConvTranspose operation.
Added query_dynamic_outputs API to support dynamic shapes in the graph. This functionality allows Graph API to infer output tensors shapes based on input tensors.
Experimental: Introduced dynamic shapes support for MHA via oneDNN Graph Compiler.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
MHA and MLP fusion are not activated on machines without Intel AVX-512 support.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by vpirogov over 3 years ago

onednn - v3.0

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced FP16 support and initial optimizations for future Intel Xeon Scalable processor (code name Granite Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
AArch64-based Processors:
- Improved reorder performance for processors with Scalable Vector Extensions (SVE) support.
- Improved pooling performance with post-ops for processors with SVE 512 support.
- Improved batch normalization performance with non-default flags for processors with SVE 512 support.
- Improved performance of FP16 functionality with Compute Library for Arm Architecture (ACL).
- Improved deconvolution performance with ACL.
PowerPC64-based Processors:
- Improved int8 GEMM performance. # Functionality
- Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
- [experimental] Introduced Graph API support that simplifies oneDNN integration into applications. The functionality is disabled by default and can be enabled at build time with ONEDNN_BUILD_GRAPH=ON flag.
- Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
- Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
- Extended threadpool API with a function to indicate maximum available concurrency.
- Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.
- Introduced pooling and reduction primitives support on AMD GPUs.
- Introduced reduction primitive support on NVIDIA GPUs.

Usability

Extended the set of supported format tags to cover formats used in applications.

Validation

Extended the GoogleTest (gtest) suite with support for Parametric Rectified Linear Unit (PReLU) primitive.

Breaking Changes

Removed deprecated APIs.
Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
Removed support for Intel DPC++/C++ Compiler 2022 and SYCL 1.2.1 (aka SYCL 2017) standard support. Use Intel DPC++/C++ Compiler and SYCL 2020 standard instead.
Removed Winograd convolution implementation for int8 data type.
Updated minimal supported ACL version to 22.08 (was 22.05).

Thanks to the Contributors

This release contains contributions from the project core team as well as @akshatasangelkar, Aryan Karumuri @AryanKarumuri, Crefeda Rodrigues @cfRod, Divakar Mariyanna @bmdivakar, Gordon Fossum @austinpagan, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, lilianhuang @lilh9598, Milos Puzovic @milpuz01, Mona Minakshi @monaminakshi, Nathan John Sircombe @nSircombe, Peter Caday @petercad, and Sreekanth Yalachigere @sreekanth-yalachigere. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 3 years ago

onednn - v3.0-rc

This is a release candidate for oneDNN v3.0. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced FP16 support and initial optimizations for future Intel Xeon Scalable processor (code name Granite Rapids).
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
AArch64-based Processors:
- Improved reorder performance for processors with Scalable Vector Extensions (SVE) support.
- Improved pooling performance with post-ops for processors with SVE 512 support.
- Improved batch normalization performance with non-default flags for processors with SVE 512 support.
- Improved performance of FP16 functionality with Compute Library for Arm Architecture (ACL).
- Improved deconvolution performance with ACL.
PowerPC64-based Processors:
- Improved int8 GEMM performance. # Functionality
- Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
- [experimental] Introduced Graph API support that simplifies oneDNN integration into applications. The functionality is disabled by default and can be enabled at build time with ONEDNN_BUILD_GRAPH=ON flag.
- Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
- Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
- Extended threadpool API with a function to indicate maximum available concurrency.
- Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.
- Introduced pooling and reduction primitives support on AMD GPUs.
- Introduced reduction primitive support on NVIDIA GPUs.

Usability

Extended the set of supported format tags to cover formats used in applications.

Validation

Extended the GoogleTest (gtest) suite with support for Parametric Rectified Linear Unit (PReLU) primitive.

Breaking Changes

Removed deprecated APIs.
Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
Removed support for Intel DPC++/C++ Compiler with SYCL 1.2.1 (aka SYCL 2017) standard.
Removed Winograd convolution implementation for int8 data type.
Updated minimal supported ACL version to 22.08 (was 22.05).

Thanks to the Contributors

This release contains contributions from the project core team as well as @akshatasangelkar, Aryan Karumuri @AryanKarumuri, Crefeda Rodrigues @cfRod, Divakar Mariyanna @bmdivakar, Gordon Fossum @austinpagan, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, lilianhuang @lilh9598, Milos Puzovic @milpuz01, Mona Minakshi @monaminakshi, Nathan John Sircombe @nSircombe, Peter Caday @petercad, and Sreekanth Yalachigere @sreekanth-yalachigere. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 3 years ago

onednn - graph-v0.7.2

This is a patch release containing the following changes to graph-v0.7.1: * Upgraded oneDNN dependency to v2.7.2 (dec9f8cc6)

- C++
Published by vpirogov over 3 years ago

onednn - v2.7.2

This is a patch release containing the following changes to v2.7.1: * Fixed segfaults in deconvolution backpropagation with ACL on AArch64-based processors (f02e6f3f262813b8d0b6cb1f7b55fcc08b4b5bac) * Fixed code generation issues in Intel AVX2 convolution implementation (2ba25236bc417c4d5fe1729ddf9e01f1d1d25fb3, b60633f79947199a1f0cfce7aa42b0ae14690401, 844326b853ba9ca9b7a34ec08ca6e2e28d7332e8, 2009164c2ae90e1e938ab8823c817a6c95fccc11) * Fixed correcteness issues and runtime errors in deconvolution with binary post-ops on Intel GPUs (dd54d3906c9613a967b709907306b946cfe32cac) * Improved performance of convolutions with small number of channels and large spatial sizes on systems with Intel AMX (26f97dc7a47aa2c0f0e13e6ff61dd3fc28fa077b, 4cb648d9e3620876fa7d7dca38a902643cd97dbc) * Fixed runtime error in int8 convolutions with groups on Xe architecture based GPUs (e5a70f43639ba968869a99931d77116791ace355) * Improved inner product weight gradient performance on Xe architecture based GPUs (9e9b859fddc6f813f9b9cac093d7d131c84054ab, 12ec4e3a51ddc105e86e9d29661690750560cd1c) * Improved batch normalization performance with threadpool threading (4fd5ab2dd312b2b79e8f2f1b18b39a94fee39e84) * Improved inner product performance with binary post-ops in broadcast mode on Intel CPUs (d43c70d4aafd58c241d456453994f4c7fe6aefff, 49ca4e17e7fd889c6c153f52dffa6f4d4a10e7c9) * Fixed segfaults and correctness issues in sum primitive with threadpool threading (ee7a3219db8bcdb7870b65b6ee0aadfba2275513) * Extended persistent cache API to cover engine objects (58481d606c19f4e46c1cd7dbfd4aba819ae024d3, 5f69dade29e317eab37455d477892996e80aea75, 16c0a95180a362c079fb2d3f01a4cea084b99628, 068071b326f253791ae767cae25258e6d47426ad) * Added support for newer versions of Intel GPU drivers (71443935355ef4fc52b510be761c487de8677386) * Updated ITT API version to 3.23.0 (d23cc9503f94ea9267bc8b6e654a912caa70e333) * Fixed convolution correctness issue on Intel Data Center GPU Flex Series (365ac202ca2f58078549116a0650a91566a256b6) * Fixed fp64 convolution correctness issue on Intel Data Center GPU MAX Series (9d4bf94d89b945cb703a7b4d04d539daf7aab8b5, 67054032e4b1b4eae11f006e3857fe20a0d7b16a) * Fixed correctness issues in reduction primitive with binary post-op on Intel GPUs (ae9d075dbba068287b6cb280f0f22d3cdcbfcb36, e3b80c58f493e7972eb4d0317518534c1d8412e9) * Improved convolution performance on on Intel Data Center GPU MAX Series (90be8d501f3b35e88f997bf9e0fd139a740f72f7, caf4863f40dd06b807d2bb1abb487aad21d586a6) * Fixed build errors with ONEDNNENABLEPRIMITIVEGPUISA build option (de2db042bbb733de7c925224934ded766de74d68) * Fixed correctness issues in convolution with per-tensor binary post-ops on Intel CPUs (9cf9c189f6f674bba38ea11217f4b06acab87194) * Improved convolution performance on Intel Data Center GPU Flex Series (8b08a07574888bc265818a751eab82aa28115d72)

- C++
Published by vpirogov over 3 years ago

onednn - graph-v0.7.1

This is a patch release containing the following changes to graph-v0.7:

Fixed a build issue in compiler backend (70258d306)
Optimized for zero points folding (d6f12b50c)
Fixed a primitive descriptor cache issue in reorder fusion (08876524d)

- C++
Published by vpirogov over 3 years ago

onednn - v2.7.1

This is a patch release containing the following changes to v2.7: * Fixed performance regression for batch normalization primitive in TBB and threadpool configurations (cd953e4ca7390387b53fba7105f81a6fc1fc0382) * Improved grouped convolution performance on Xe Architecture GPUs (d7a781e166ef3206d9b0ab79a69d76034d663c20, cb1f3fe27f466a26b484ed063546bd0b6c4cd306, 4e844740d6b26709c0aa3c2604ed52130560208a, 7ba3c40f65425c4bc2b922ae7b2cdd8cb8e5181c) * Fixed runtime error in int8 reorder on Intel GPUs (53532a9944b2e4694d4c0135f0a1a5102ca97613) * Reverted MEMFD allocator in Xbyak to avoid segfaults in high load scenarios (3e29ae26dba137a6232669bd1c5d42ad4449b794) * Fixed a defect with incorrect caching of BRGEMM-based matmul primitive implementations with trivial dimensions (87cd9796a98497ab9a3ff5250ad3a396199590fb) * Improved depthwise convolution performance with per-tensor binary post-ops for Intel CPUs (f430a5a4c883ef846f938f571020565d41719e9c) * Extended threadpool API to manage maximum concurrency (8a1e9595f131e1303887fba407a03dbd64ac301e, 64e559454787651186ed6a32e4eef2a17132b9b6) * Fixed potential integer overflow in BRGEMM-based convolution implementation (25ccee38b97e935e6c3c729b9134804c6a2ea6a7) * Fixed performance regression in concat primitive with any format on Intel CPUs (2a60adec0e73895caefb3dc7d1de74b5eac8c6da, feb614d5fef07fb2a188ceef15ebeaf9f9f45acf) * Fixed compile-time warnings in matmul_perf example (b5faa77a4a651f1e44fa77348eded54ea3ec3eef) * Fixed 'insufficient registers in requested bundle' runtime error in convolution primitive on Xe Architecture GPUs (4c9d46acc35126fec2b59125403566a90b6bed36) * Addressed performance regression for certain convolution cases on Xe Architecture GPUs (f28b58aec55c5087127702f7c0a38d21b3006d35, 18764fbef1f1f90bc696fe35d059685b2b37f149) * Added support for Intel DPC++/C++ Compiler 2023 (c3781c671dcc23c0fa16eb648c98ef33b79c737b, a1a8952656b2e84a4124cc0d2f8c7aae10e62a46, 9bc87e635dbeffd77808c70fbd51ac5dc834b582, e3b19871cab6c9b5c317cddb18f4264575868ed7) * Fixed int8 matmul and inner product performance regression on Xe Architecture GPUs (3693fbf0e8b0cd3bcc2308a4504772c0af2eaf88, c8adc179133f7212523f4ecb1cdab648b0cec796) * Fixed accuracy issue for convolution, inner product and matmul primitives with tanh post-op on Xe Architecture GPUs (88b4e57718014bd50f78461a5c80dc680074f9b6, 83ce6d27a8699d7ab0d1ee450e2e7e9ec87a6e13, 6224dc6b3e2073c98f4b8278bf7e87769dd85a55, 10f0d0ade797a90c93b7450c1e0b151dc415dab3) * Suppressed spurious build warnings with GCC 11 (44255a8a57dc40ccc8f7b464e5638d6715216756)

- C++
Published by vpirogov over 3 years ago

onednn - v2.6.3

This is a patch release containing the following changes to v2.6.2: * Fixed potential integer overflow in BRGEMM-based convolution implementation (deb5595a0f96b54f9106cb846e6fc4e0af49aadf) * Fixed a defect with incorrect caching of BRGEMM-based matmul primitive implementations with trivial dimensions (305bed526492f2400a1a7fdfcb54b0ee41adc67e) * Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (ba8632592018070a46e4d349bbe3628756022c15) * Fixed segfault in pooling primitive on CPUs (689d874bbf0a3e1bdc75e99ad2453e6aac9cfe84)

- C++
Published by tprimak over 3 years ago

onednn - graph-v0.7

This is the Beta Update release for oneDNN Graph API based on oneDNN v2.7 release.

Functionality

Added operations Select, LogicalAnd, LogicalOr, LogicalXor, LogicalNot, Greater, GreaterEqual, Equal, NoeEqual, Less, and LessEqual.
Added boolean data type to support logical operations.
Added support for passing compilation context to the compile API. This feature allows passing additional information, like tensor shape context, for the backend to generate better kernel code.
Introduced convolution block fusion via oneDNN Graph Compiler.
Experimental: Introduced dynamic shapes support for multi-level perceptron (MLP) block via oneDNN Graph Compiler.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
MHA and MLP fusion are not activated on machines without Intel AVX-512 support.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by vpirogov over 3 years ago

onednn - graph-v0.6

This is the Beta release for oneDNN Graph based on oneDNN v2.7 release.

Functionality

Introduced FP32, BF16, FP16, and INT8 inference support on GPU.
Introduced FP32 and BF16 training support on GPU.
Introduced support for floating point math mode at graph construction phase. The mode allows the implementation to use low precision datatype for computations when possible.
Added graph::finalize() function to indicate that the user has finished adding operations into the graph and the graph is ready for partitioning.
Added operations AbsBackprop, Mish, MishBackprop, and LeakyReLU.
Updated API and operation definitions to comply with oneDNN Graph Specification 1.0-beta.

Usability

Integrated Graph component headers, source and build system into oneDNN:
- Headers moved to include/oneapi/dnnl.
- Source moved to src/graph.
- Graph functionality is included into single shared object or dynamic library produced by the build system.
Aligned API with oneDNN:
- Shared common dnnl::engine and dnnl::stream. The original dnnl::graph::engine and dnnl::graph::stream API were removed.
- Added a new make_engine_with_allocator() API to create dnnl::engine with dnnl::graph::allocator.
- A few common basic types were shared between oneDNN and oneDNN Graph, including dnnl_status_t, dnnl_data_type_t, and dnnl_dims_t, etc.
Introduced ONEDNN_BUILD_GRAPH build option to manage Graph component build.

Validation

Introduced ONEDNN_GRAPH_DUMP environment variable that serialized library graph and subgraph into JSON files.
Added the initial version of benchdnn graph driver which can be used to benchmark the performance with a dumped graph JSON file.

Breaking changes

Removed operations HardTanh, Index, Pow, etc. Please check the operation kind list for details.

Known Issues and Limitations

Graph Compiler component is not included with this release. It will be reinstated in oneDNN Graph Beta Update release.
The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
Build option ONEDNN_BUILD_GRAPH is not compatible with some of the build options supported by the build system including ONEDNN_GPU_RUNTIME=OCL, ONEDNN_ENABLE_WORKLOAD=INFERENCE, ONEDNN_ENABLE_PRIMITIVE, and others.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by vpirogov over 3 years ago

onednn - v2.7

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
- Introduced performance optimizations for bf16 floating point math mode on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.
Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Introduced performance optimizations for tf32 floating point math mode on future Xe Architecture graphics (code name Ponte Vecchio). The tf32 math mode allows oneDNN to use tf32 arithmetic in computations on fp32 data.
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M)
AArch64-based Processors
- Improved convolution and binary primitive performance for processors with SVE 512 support.
- Improved shuffle and eltwise primitives performance for processors with SVE 256 and SVE 128 support.
- Improved PReLU, batch normalization, and pooling primitives performance via Compute Library for the Arm Architecture (ACL).
- Improved performance of inner product, matmul, convolution, and batch norm primitives with post-ops via ACL.
PowerPC64-based Processors
- Introduced performance optimizations for int8 and bfloat16 GEMM. # Functionality
- Introduced runtime output scales support in all primitives.
- Introduced scales support in concat primitive.
- Extended floating point math mode API with tf32 data type option.
- Extended eltwise primitive with support for hardsigmoid algorithm.
- Extended layer normalization primitive with support for mixed source and destination data types.
- Extended depthwise post-op with support for arbitrary padding size. The implementation is available only on Intel processors.
- Added limited fp64 data type support in convolution primitive. Optimized implementation is available for future Xe Architecture graphics (code name Ponte Vecchio).
- Extended int8 convolution and deconvolution implementations on GPUs with arbitrary destination data type support.
- Extended batch normalization primitive with dnnl_fuse_norm_add_relu flag that allows to fuse sum and relu operations. The implementation is available for Intel GPUs.
- Extended GPU deconvolution primitive implementation with support for output scales and zero points.
- Introduced threadpool threading support for AArch64-based processors.
- Introduced Unified Shared Memory (USM) support for SYCL backend on NVIDIA GPUs.
- Introduced initial support for AMD GPUs via MIOpen library. Supported primitives include Local Response Normalization (LRN), softmax, and eltwise.
  # Usability
- Added matmul_perf example that benchmarks matmul primitive for all supported data types.
- Introduced annotations for JIT kernels to allow profilers like Linux perf to correctly label JIT code.
- Extended verbose logs converter with RNN primitive support.
- Added verbose output for dnnl_*gemm* calls.
- Removed Level Zero headers from the list of build time dependencies.
- Adjusted NVIDIA GPU implementation to comply with oneDNN numerical behavior. Implicit downconvert to fp16 and tf32 are now managed via math mode API.

Validation

Added benchdnn driver for validation of internal BRGEMM implementation.
Improved benchdnn reference implementation performance with threadpool threading model.
Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (mode=po).

Deprecated Functionality

Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in the future releases.
Static output scales are deprecated and will be removed in the next release.
Convolution Winograd algorithm implementation for int8 data type is deprecated and will be removed in the next release.

Breaking Changes

Changed formula for AUGRU RNN cell to align with Tensorflow. See proposal for details.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aidan Belton @AidanBeltonS, @akshatasangelkar, Alex Bojan @lb991, Crefeda Rodrigues @cfRod, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Divakar Mariyanna @bmdivakar, Emilio Cota @cota, Gordon Fossum @austinpagan, Hugh Delaney @hdelan, Jacek Czaja @jczaja, @jakpiase, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Kotha Sowmya @Sowmyakotha1999, Louie Tsai @louie-tsai, Mark Ryan @markdryan, MITSUNARI Shigeo @herumi, Mona Minakshi @monaminakshi, @NaNAGISaSA, Nathan John Sircombe @nSircombe, Peter Caday @petercad, @pgorlani, Sreekanth Yalachigere @sreekanth-yalachigere, Tadej Ciglarič @t4c1, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 3 years ago

onednn - v2.7-rc

This is a release candidate for oneDNN v2.7. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
- Introduced performance optimizations for bf16 floating point math mode on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.
Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Introduced performance optimizations for tf32 floating point math mode on future Xe Architecture graphics (code name Ponte Vecchio). The tf32 math mode allows oneDNN to use tf32 arithmetic in computations on fp32 data.
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M)
AArch64-based Processors
- Improved convolution and binary primitive performance for processors with SVE 512 support.
- Improved eltwise and shuffle primitives performance for processors with SVE 256 and SVE 128 support.
- Improved PReLU, batch normalization, and pooling primitives performance via Compute Library for the Arm Architecture (ACL).
- Improved performance of inner product, matmul, convolution, and batch norm primitives with post-ops via ACL.
PowerPC64-based Processors
- Introduced performance optimizations for int8 and bfloat16 GEMM. # Functionality
- Introduced runtime output scales support in all primitives.
- Introduced scales support in concat primitive.
- Extended floating point math mode API with tf32 data type option.
- Extended eltwise primitive with support for hardsigmoid algorithm.
- Extended layer normalization primitive with support for mixed source and destination data types.
- Extended depthwise post-op with support for arbitrary padding size. The implementation is available only on Intel processors.
- Added limited fp64 data type support in convolution primitive. Optimized implementation is available for future Xe Architecture graphics (code name Ponte Vecchio).
- Extended int8 convolution and deconvolution implementations on GPUs with arbitrary destination data type support.
- Extended batch normalization primitive with dnnl_fuse_norm_add_relu flag that allows to fuse sum and relu operations. The implementation is available for Intel GPUs.
- Extended GPU deconvolution primitive implementation with support for output scales and zero points.
- Introduced threadpool threading support for AArch64-based processors.
- Introduced Unified Shared Memory (USM) support for SYCL backend on NVIDIA GPUs.
- Introduced initial support for AMD GPUs via MIOpen library. Supported primitives include Local Response Normalization (LRN), softmax, and eltwise.
  # Usability
- Introduced annotations for JIT kernels to allow profilers like Linux perf to correctly label JIT code.
- Extended verbose logs converter with RNN primitive support.
- Added verbose output for dnnl_*gemm* calls.
- Removed Level Zero headers from the list of build time dependencies.
- Adjusted NVIDIA GPU implementation to comply with oneDNN numerical behavior. Implicit downconvert to fp16 and tf32 are now managed via math mode API.

Validation

Added benchdnn driver for validation of internal BRGEMM implementation.
Improved benchdnn reference implementation performance with threadpool threading model.
Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (mode=po).

Deprecated Functionality

Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in the future releases.
Static output scales are deprecated and will be removed in the next release.
Convolution Winograd algorithm implementation for int8 data type is deprecated and will be removed in the next release.

Breaking Changes

Changed formula for AUGRU RNN cell to align with Tensorflow. See proposal for details.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aidan Belton @AidanBeltonS, @akshatasangelkar, Alex Bojan @lb991, Crefeda Rodrigues @cfRod, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Divakar Mariyanna @bmdivakar, Emilio Cota @cota, Gordon Fossum @austinpagan, Hugh Delaney @hdelan, Jacek Czaja @jczaja, @jakpiase, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Kotha Sowmya @Sowmyakotha1999, Louie Tsai @louie-tsai, Mark Ryan @markdryan, MITSUNARI Shigeo @herumi, Mona Minakshi @monaminakshi, @NaNAGISaSA, Nathan John Sircombe @nSircombe, Peter Caday @petercad, @pgorlani, Sreekanth Yalachigere @sreekanth-yalachigere, Tadej Ciglarič @t4c1, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 over 3 years ago

onednn - v2.6.2

This is a patch release containing the following changes to v2.6.1: * Removed unused variables (2500b0f6c1931f4b0b22b5fc92fcc87c6b875a3f, b4e00322c93984082b987408af8a2e341c7fd6c2) * Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (207af06637ccf36fb08c5fd93b55d52a578cfa5a) * Fixed correctness issue in bfloat16 matmul implementation for processors with Intel AMX support (404b762f27350d5ad59225d966310b481951451e) * Fixed correctness issue in int8 reorder implementation with zero points (b340cba1cadc8fc6424945b5b2a09960bd8d47ec) * Improved int8 matmul and inner product primitives performance with small matrices for processors with Intel AMX support (73b75723921e9881b88b027a8f1b2d42251f6403, 58b386a21cfc9dbb7c331626e9e4752751cdf415) * Improved int8 convolution performance for processors with Intel DL Boost support (f35a62f9b3c1db5ce8a2704e530e050b2f4b1807) * Aligned AUGRU formula with Tensorflow definition (e47c6c570d97545b56f3afef77ce9fbd63ea320b, 4ba0a577947733690cdd0f9ecf269121148a28e1, b311e24ac3b669d6200b595201107601b6ce1f58) * Suppressed 'unvectorized loop' warning for Intel C/C++ Compiler (3932d0493586963df3cefb3c8f35cb6503cd444e)

- C++
Published by vpirogov over 3 years ago

onednn - graph-v0.5.2

This is a patch release containing the following changes to graph-v0.5.1:

Deprecated quantized ReLU fusion patterns (85405a94)

- C++
Published by tprimak almost 4 years ago

onednn - v2.6.1

This is a patch release containing the following changes to v2.6: * Extended depthwise convolution post-op with support for arbitrary filter size, stride, and padding (79b019b102c5d68843d52473f7d26a80597d84d2) * Improved GEMM performance with threadpool threading on system with Intel AVX2 instruction set (2be0060dbf0291687bb8243068121d6cdda30ec2) * Fixed runtime error in GPU reduction primitive for specific tensor sizes (efbf9b5e8c12666314f3484ce279cee0a1a91a44) * Improved convolution performance on GPUs with Xe-HPG IP (f8de0c93e9ff53a7d0a41b97aabc85e828020881, c1fb8acd0f74f63db021d41dedcd54546aab5289) * Updated ITT API to 3.22.5 (9b186765dded79066e0cd9c17eb70b680b76fb8e) * Fixed correctness issues in reorder implementation for non-x64 systems (9961b8698b603842c79b492d82a05ba8dccb15da, 102063159c37b63c80fe6310e4d0481370a8ff02, 8b960dfaf43c417ed86b7da25451c12151c1a87b, ef1d9fa441f2e4e5c06a34042934cc272171a2e1, 8edd85907f42b72f9ace5dbc2bfcf43a63ce3d1b, 39edcf61e162d7f3a7449e05bfedccd1301fe34e, 3e0a0d9dbff6dd1c5e5d94f3c29727d489af7917, 1dff6251dd262c3bf1c5ec36a24ad9c2c46f2624, 8661958a4f4fce5c3f1dd65f30b03d9742579179) * Fixed handling on inf and -inf values in eltwise log algorithm (732cbdd2651bc8ea4c7ae125c29e542fecd79b8e, 3fd0f2e44c84869181aa2506e8924c37e9267b64) * Improved depthwise convolution performance on GPUs with Xe-HPG IP (7a6fe1d964d423a22d9e3525f7851a7d221460ad) * Addressed fails in test_isa_hints gtest on GPUs (78c1c68305f81cb087f3e4dc2cebb07cace1ef4d) * Fixed issues with bfloat16 GEMM producing NaNs in certain cases on GPUs with Xe-HPC IP (5d659707f0cd9bc432e5f74d6e9d8b3bbc4776ad) * Changed default layout to blocked for depthwise convolutions to avoid spurious reorders (78f231b03f4a1126991f4e725b75c090925fd870) * Addressed issue with incorrect values in padded areas for convolution with post-ops on GPUs (2e4ad3ab7182cbc666af3a5c32d59bbd7cf710b7) * Fixed build issues with -Werror=odr option (27668dd728a3a3460315e44275490daab317fa8d) * Addressed issues detected by clang USAN in BRGEMM kernel (2bbaa3092b27dc0bf08dc2c534e3ee761d6fb6e0, 9b3826f762de28b2c35aa8f9249b916973b7b140, b59b02716367e64e35264093828da1c0b3edc646)

- C++
Published by vpirogov almost 4 years ago

onednn - graph-v0.5.1

This is a patch release containing the following changes to graph-v0.5:

Fixed the layout propagation of Reshape and Transpose operators in oneDNN backend (3b681d4, 09863f9)
Enabled scalar Divide + MatMul fusion in oneDNN backend (d4c7dc6)
Enabled Convolution + LeakyReLU fusion in oneDNN backend (b0f4dbb, c8fb4c13, e15979e)
Improved the document of fusion patterns (b9a52384)
Fixed operands swapping for binary operators (a07bfdac, d2567d7)
Worked around a false positive build issue in GCC11 for compiler backend (17a40d0)

- C++
Published by vpirogov almost 4 years ago

onednn - graph-v0.4.3

This is a patch release containing the following changes to graph-v0.4.2:

Upgraded to oneDNN v2.5.4 patch release (3418ec1)
Fixed compiler backend to build with downstream projects when LLVM is used (c73dd858)
Fixed the layout propagation of Reshape and Transpose operators in oneDNN backend (cbdb736f)

- C++
Published by vpirogov about 4 years ago

onednn - graph-v0.5

This is the Alpha release for oneDNN Graph API based on oneDNN v2.6 release.

Functionality

Introduced FP32 and BF16 training support on CPU.
Introduced multiple layer perceptron (MLP) fusion supported by oneDNN Graph compiler with optimized code generation (experimental).
Updated API to comply with oneDNN Graph API specification v1.0-alpha.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
MHA and MLP fusion are not activated on machines without AVX-512 support, as oneDNN Graph compiler generates AVX-512 and newer instructions.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by vpirogov about 4 years ago

onednn - v2.6

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon® Scalable processors (code name Sapphire Rapids). The functionality requires Linux kernel 5.16 or later.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Improved performance for future Intel Arc graphics (code name Alchemist and DG2).
AArch64-based Processors
- Improved binary primitive performance with Arm Compute Library (ACL).
- Improved shuffle primitive performance for processors with SVE 512 support.

Functionality

Introduced bfloat16 destination support for int8 convolution, matmul and inner product primitives for processors with Intel AVX-512 support and or future Intel Xeon® Scalable processors (code name Sapphire Rapids)
Extended RNN primitive with support for AUGRU cell.
Added support for non-zero negative slope in ReLU post-op for batch normalization primitive.
Introduced support for mixed source and destination data types in softmax primitive.
Introduced persistent cache API. This functionality allows to serialize and reuse JIT kernels.

Usability

Added build time options to manage the set of supported instruction set architectures on Intel Graphics Products. See ONEDNN_ENABLE_PRIMITIVE_GPU_ISA for more details. This feature further reduces the binary footprint.
Extended built time options ONEDNN_ENABLE_PRIMITIVE and ONEDNN_ENABLE_WORKLOAD to GPU implementations. This feature further reduces the binary footprint.
Reduced stack consumption in GEMM implementation.
Added command line help to benchdnn.

Deprecated Functionality

Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.

Breaking Changes

Removed performance optimizations for Intel Xeon Phi processors. oneDNN will continue to be functional on these processors using Intel AVX2 codepath.

Thanks to the Contributors

This release contains contributions from the project core team as well as Arthur Mitrano @aaraujom, Aslan @aslanxie, Attila T. Áfra @atafra, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Joel Dippold @jedippold, Jonathan Deakin @jondea, Jonathan Louis Kaplan @JLouisKaplan-Arm, Kentaro Kawakami @kawakami-k, Luke Ireland @LukeIreland1, Mesut Meterelliyoz @mmeterel, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tengfei Han @Tengfei09, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 about 4 years ago

onednn - v2.5.4

This is a patch release containing the following changes to v2.5.3: * Improved performance for batch normalization for tbb/threadpool (421a2cef07e2fe730a8ee6bbd0c55ad7e154eb3c, 7b7b763e8e264ec46bba1772d28aad07abf20d50) * Fixed implicit conversion from double to float in examples (866b9ac4429d2a3e9751546ba101d0df11cfb519) * Fixed issue in int8 matmul primitive for specific shapes (035c2d42e99e79956e4fe833f01b7b6e5509913c, 9a1bf19b40ed5493d8bbcc3ef5cb4d276a85e78e) * Fixed performance regression for matmul primitive with binary post op and broadcast (dcd61efe83cb5b64e60fa6d294320a34fb8734c3, 31dec32e71624e9c7f6b4da39a0d2da757b06906) * Fixed performance regression in binary primitive when using NHWC layout (228493c38711bf62d4dd4b534af890c2ae6b2ad1)

- C++
Published by tprimak about 4 years ago

onednn - v2.6-rc

This is a release candidate for oneDNN v2.6. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon® Scalable processors (code name Sapphire Rapids). The functionality requires Linux kernel 5.16 or later.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Improved performance for future Intel Arc graphics (code name Alchemist and DG2).
AArch64-based Processors
- Improved binary primitive performance with Arm Compute Library (ACL).
- Improved shuffle primitive performance for processors with SVE 512 support.

Functionality

Extended RNN primitive with support for AUGRU cell.
Introduced support for mixed source and destination data types in softmax primitive.
Introduced persistent cache API. This functionality allows to serialize and reuse JIT kernels.

Usability

Added build time options to manage the set of supported instruction set architectures on Intel Graphics Products. See ONEDNN_ENABLE_PRIMITIVE_GPU_ISA for more details. This feature further reduces the binary footprint.
Extended built time options ONEDNN_ENABLE_PRIMITIVE and ONEDNN_ENABLE_WORKLOAD to GPU implementations. This feature further reduces the binary footprint.
Reduced stack consumption in GEMM implementation.
Added command line help to benchdnn.

Deprecated Functionality

Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.

Breaking Changes

Removed performance optimizations for Intel Xeon Phi processors. oneDNN will continue to be functional on these processors using Intel AVX2 codepath.

Thanks to the Contributors

This release contains contributions from the project core team as well as Arthur Mitrano @aaraujom, Aslan @aslanxie, Attila T. Áfra @atafra, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Joel Dippold @jedippold, Jonathan Deakin @jondea, Jonathan Louis Kaplan @JLouisKaplan-Arm, Kentaro Kawakami @kawakami-k, Luke Ireland @LukeIreland1, Mesut Meterelliyoz @mmeterel, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tengfei Han @Tengfei09, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by harrymao2022 about 4 years ago

onednn - graph-v0.4.2

This is a patch release containing the following changes to graph-v0.4.1: * Fixed compiled partition cache by checking CPU threading number (68f262a, 343246e) * Enabled binary add and multiply patterns (71a0cfef) * Fixed the MHA (multi-head attention) patterns in compiler backend and benchdnn graph (45bbcb3, caaf841) * Fixed the build issues for semi-compiler backend (62dd2ca, 738276a, 347f1a9, 2123326)

- C++
Published by vpirogov about 4 years ago

onednn - v2.5.3

This is a patch release containing the following changes to v2.5.2: * Fixed accuracy issue in GELU post-op (3ff2c3d6cfe1bd0b2eee015f97c6fe0515f14bff) * Added ability to enable code only on non-x64 systems (ff7ae00c074ea09048bcc27f905385d1f6a6b830) * Fixed issue in reorder primitive on non-x64 systems (5917860513ec94274a157d9642695393858cd205) * Fixed build issue on OSX11 and older cmake (d9c8bbeb884ab916d47aa57b0bec9abbd7f89542) * Fixed assert in reorder primitive (79090bc2907c7b201aa5ecd9a7ffe3a347fe2444) * Documentation fixes (d29075852b8713c487253e7c664b2c78e3663327, ee7eacb012b95a677c3c1dae0d05e730d99f180e, 543b8f86f4659d36dddcff1d2bddb810d3740889) * Fixed potential division by zero in example for binary primitive (2fffd96b7461ec1088b271e2df9d1fe077d45ff0) * Fixed SIGFPE issue in reorder primitive (8c291fca11b08786948a1e198c7bca051e525347) * Fixed potential size overflow in inner product primitive (c10f74a0e71d79ad06d813e403e63e2ee0e2b260) * Added logic to reduce the number of threads (tasks spawned for threadpool) for small shapes (8f885e76a2221565ababd3718a2c1441b1300780, 405398994009fb97ea137a7e300a494489c29bc7, 49ec406751d2ba03e9166d36aec94e4d6dd236bd, 2977360e146148f17d60861829a41c674856e8f6) * Fixed SEGFAULT issue in matmul primitive (62c1170d7741c261722167d06f28d7f5e18d14ee, a993d522ff68f310186b73c5e1ec473c221c7869) * Added bf16 support for sum post-op (3d2c37e4b069d4188741c1a8c40e6eb7404e68a2) * Added fp:precise compiler flag for Intel Compiler identified as IntelLLVM (1558a4bfd894d73f55030d06df73584af71525d6) * Fixed issue in bf16 convolution primitive when fused with binary (b379fd9c3715af38fb2067f39aea19fa90191024) * Fixed issue in backward depthwise convolution (d5e4122f6429cb73312620c9100a65a0ad66a0a7, f5cac2346d6198a958f50c8be7cbf968191018aa, eeaa19c4e87c2ee96a50a0df0dadd8d045e94774) * Fixed SEGFAULT in int8 convolution with eltwise post_op (32a629fef18b087554dabcaa983d0158654b2fe3) * Fixed NaN issue in bf16 backward inner product (0c5e49205d63f05b72779c2e2f9419bb42144e64) * Fixed performance regression for binary with broadcast (f79b03072dbdc373ce3a5435c41d899ddee9eddb, 58ce3c1de0e8e8c387275f0d649a3a26b726c640)

- C++
Published by tprimak about 4 years ago

onednn - graph-v0.4.1

This is a patch release containing the following changes to graph-v0.4: * Upgraded oneDNN to v2.5.2 (b557b497, 6aae6f7a) * Enabled MatMul + Div + Add fusions (effa3350, 3f5a8f7a, f9ffcc5c)

- C++
Published by vpirogov over 4 years ago

onednn - v2.5.2

This is a patch release containing the following changes to v2.5.1: * Fixed performance regression in binary primitive with broadcast (b9721743614f9dcb477a86d82fc19a96dc7e5736, ff751229eeb7ff546491f54b1060c03ec241c673) * Fixed issue with SYCL device properties initialization (cabc5ca62e1b109161bc7cfccaa0ca5ba1f7b639, 095f13e77b9440307b152590f61c6e21e0c026a5) * Fixed issue in matmul primitive with zero points (3157354dd5498fa83f2d8da17b25138d78a5c13b) * Fixed segmentation fault in depthwise convolution primitive for shapes with huge spatial size for processors with Intel AVX-512 support (68347644ace88ef9dc7bcbba928674e3f9ac1b08, 1d2addcf5a11a6bf034001e809cbea1c89942f0f) * Fixed issue in forward convolution primitive for processors with Intel AVX2 support (d691137c245efab99651c95387e1713d3cf91fb7) * Fixed performance regression on GPUs with SYCL runtime (d8364e5b4c88f27143894bb7835c65eb22770e16)

- C++
Published by tprimak over 4 years ago

onednn - graph-v0.4

This is a technical preview for oneDNN Graph API based on oneDNN v2.5.

Functionality

Introduced bf16 inference support.
Introduced multi-head attention (MHA) fusion supported by oneDNN Graph compiler with optimized code generation (experimental).
Updated API to comply with oneDNN Graph API specification v0.9.

Known Issues and Limitations

Some subgraphs might not be recognized as a partition even if it matches the general pattern description due to internal implementation.
The weight’s opaque layout can be queried only from a compiled partition, which requires that tensor shapes must be known at compilation time.
MHA fusion is not activated on machines without AVX-512 support, as oneDNN Graph compiler generates AVX-512 and newer instructions.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by tprimak over 4 years ago

onednn - v2.5.1

This is a patch release containing the following changes to v2.5: * Improved performance of binary primitive and binary post-op with broadcast over batch and spatial dimension (6d4b092ac3aaf81cf71e85e5a639c46f942c1e5c, c4dc38a70de60a92e541e581994f6a53e90c8110, be261ab0e3dae81fbf2b41b2a4038ffb940c5c75, 3ec15b6976eab124db4a5d22a02d8d1e8a2c2001, f1c2f9f3400446addd636194305138b4e6ce8a0b) * Fixed undefined behavior for cases when different number of threads used at primitive creation and execution (0af92ec8ad0575883c04bc14436f13a5cc02d8fa, ba2e5a95d5585d28630971eba0edf59caa3673b0, 8863e34de693072cf5d299503a2601ab4cfacabe, 57b1e7ad3d80d61b2b3f820fd08090e548acd9b7, 72b54def7d421ee57acaa60642935c8348610632, 9b394dd5e8f661a7c0582daae9dc8fc562bf8220, 2d4d88a7c7e701aacdcde164bfbc166a73e9feef, 4c3e771c109bdc09f909baf339721dea26ceb6c6, 2458105c93b5451370efee2117a137e6223a63df, 67990405dc63d01f11cc5b464e1ce0a5106e0232, edc40fd6e65ee2a7cadd33129d4873fe4e6f6ccf) * Replaced deprecated SYCL APIs with SYCL 2020 alternatives (2c2f4a4707484e15c59adb0aac3563a2ca4f202c, a090db862cc7ae8e77365f61cb8d716b3af3af99) * Fixed documentation formatting issues (812085dd49ffe432b49ed6b86f28c37734fc2eeb, 591a0f449295c0d971348bbcfd3a3acf454158fd, 7eadf81d3bba83f4e38044144e165213dce09234, 75a2f06b7ad30b03b0f3eee20f786b94d744a5fb, b73c8a7034b3a7b5c3e95d181c267aea0d411092, ca1eb7710121ef2bacaca79d536471abf31daf25) * Updated Microsoft Visual Studio build instructions (add953a66fdda58237aad5c57b93e106886b3b45, 42b9904847ae8d206a87fdb0222708a4334b676a)

- C++
Published by vpirogov over 4 years ago

onednn - v2.5

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). The functionality is now enabled by default and requires Linux kernel 5.16.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
Intel Graphics Products
- Introduced initial optimizations for future Xe Architecture graphics (code name Ponte Vecchio).
- Improved pooling and layer normalization primitives performance.
AArch64-based Processors
- Improved softmax and logsoftmax primitives performance with Arm Compute Library (ACL)

Functionality

Introduced support for compiler with SYCL 2020 standard support.
Introduced support for the ICX/ICPX and DPCPP compiler drivers distributed with Intel oneAPI DPC++ Compiler on Windows.

Usability

Added compile time option to manage the set of supported instruction set architectures on Intel64/AMD64 processors. See 'DNNLENABLEPRIMITIVECPUISA' for more details. This feature further reduces the binary footprint.
Added environment variables and build options with ONEDNN prefix.
Introduced support for QNX operating system.
Introduced support for RISC-V architecture.

Breaking Changes

The Intel MKL-DNN compatibility API is removed. See Transition from Intel MKL-DNN to oneDNN page for instructions on moving to the new API.
Updated minimal supported ACL version to 21.11 (was 21.08).

Deprecated Functionality

Support for Intel Xeon Phi processors is deprecated and will be removed in the next release.
Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aaron Franke @aaronfranke, Arthur Mitrano @aaraujom, Crefeda Rodrigues @cfRod, Diana Bite @diaena, Joel Dippold @jedippold, Joe Konno @thac0, Jonathan Deakin @jondea, Luke Ireland @LukeIreland1, Mark Ryan @markdryan, Mesut Meterelliyoz @mmeterel, Michel Migdal @Michoumichmich, Nathan John Sircombe @nSircombe, Pablo Romero @pablorcum, Peter Caday @petercad, Sergey Razumovskiy @srazumov, and Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov over 4 years ago

onednn - v2.5-rc

This is a release candidate for oneDNN v2.5. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
Intel Graphics Products
- Introduced initial optimizations for future Xe Architecture graphics (code name Ponte Vecchio).
- Improved pooling and layer normalization primitives performance.
AArch64-based Processors
- Improved softmax primitive performance with Arm Compute Library (ACL)

Functionality

Introduced support for compiler with SYCL 2020 standard support.
Introduced support for the ICX/ICPX and DPCPP compiler drivers available in the Intel oneAPI DPC++ Compiler.

Usability

Added compile time option to manage the set of supported instruction set architectures on Intel64/AMD64 processors. See 'DNNLENABLEPRIMITIVECPUISA' for more details. This feature further reduces the binary footprint.
Added environment variables and build options with 'ONEDNN' prefix.
Introduced support for QNX operating system.
Introduced support for RISC-V architecture.

Breaking Changes

The Intel MKL-DNN compatibility API is removed. See Transition from Intel MKL-DNN to oneDNN page for instructions on moving to the new API.

Deprecated Functionality

Support for Intel Xeon Phi processors is deprecated and will be removed in the next release.
Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aaron Franke @aaronfranke, Arthur Mitrano @aaraujom, Crefeda Rodrigues @cfRod, Diana Bite @diaena, Joel Dippold @jedippold, Joe Konno @thac0, Jonathan Deakin @jondea, Luke Ireland @LukeIreland1, Mark Ryan @markdryan, Mesut Meterelliyoz @mmeterel, Michel Migdal @Michoumichmich, Nathan John Sircombe @nSircombe, Pablo Romero @pablorcum, Peter Caday @petercad, Sergey Razumovskiy @srazumov, and Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov over 4 years ago

onednn - v2.4.4

This is a patch release containing the following changes to v2.4.3: * Fixed incorrect results for reorder with zero-points on CPUs (ee63629a52dfb355d9f93ed5ac0c72440ad24b65) * Fixed an issue with reorder with zero-points not respecting rounding mode on processors without Intel DL Boost support (a165c4a5284820b341d7e5e114f0de689b87f38f) * Fixed correctness issue in bfloat16 inner product weight gradient on processors with Intel DL Boost support (b782f190bd40f5b3bbae2d66c5922cc8793f236c) * Improved bfloat16 inner product weights gradient performance on processors with Intel AMX support (ebf9f817f12c2db4447eb4000aadab411879fe36) * Fixed potential undefined access in convolution, inner product, matmul, and RNNs primitives on processors with Intel AMX support (dcd98ad372108f6fa33510b6e9acd7bae491df83)

- C++
Published by vpirogov over 4 years ago

onednn - graph-v0.3

This is a technical preview for oneDNN Graph API based on oneDNN v2.4.

Functionality

Introduced int8 inference support.
Updated API to comply with oneDNN Graph API specification v0.8.

Known Issues and Limitations

Some subgraphs might not be recognized as a partition even if it matches the general pattern description due to internal implementation.
The weight’s opaque layout can be queried only from a compiled partition, which requires that tensor shapes must be known at compilation time.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Tian Feng, Zhang Guoming, Jiong Gong, Chunyuan Wu, Nishant Patel, Yiqiang Li, Yang Sheng, Yunfei Mao, Kiefer Kuah and others.

- C++
Published by vpirogov over 4 years ago

onednn - v2.4.3

This is a patch release containing the following changes to v2.4.2: * Fixed and issue with reorder primitive producing NaN results for some cases on future Intel Xeon Scalable processor (code name Sapphire Rapids) (ac20af36ea3aced6a1dafe83447ffaee424fefa3) * Fixed performance regression for inner product primitive for future Intel Xeon Scalable processor (code name Sapphire Rapids) (ac6a24d4d2db987665bb0832ea31b3f11ca42844, 2cf3526fdbb246b51db597ac730268a6391f87e2, d02dddf7b35ca4ea3c63a9d4aaa9a7c4be2d1cd8, bcdc17531179cc414c3d6e858f6708938ebcb7bb) * Fixed segmentation fault in int8 deconvolution primitive with asymmetric quantization for processors with Intel AVX-512 support (6ba086ae6be066222d46693803aefeacecf3ed0b)

- C++
Published by vpirogov over 4 years ago

onednn - v2.4.2

This is a patch release containing the following changes to v2.4.1: * Fixed performance regression for convolution primitive for the shapes with 3D spatial for future Intel Xeon Scalable processor (code name Sapphire Rapids) (aca0af1d192ef40ec93b45d24edd509e2905156d) * Fixed segmentation fault in bfloat16 forward and backward inner product primitive or future Intel Xeon Scalable processor (code name Sapphire Rapids) (ae8cf18e35a8ed069a0898d77f5548781ecf5e4b, 3de9549d5817ea50638ad208293bb2e2fbb15048) * Fixed reorder primitive with compensation (6ba086ae6be066222d46693803aefeacecf3ed0b) * Fixed issue in scratch pad size calculation for BRGEMM-based convolutions (dd9eceb88ed683806fb853a33bfab71982414fa3)

- C++
Published by tprimak over 4 years ago

onednn - v2.3.3

This is a patch release containing the following changes to v2.3.2: * Reverted check for memory descriptor stride validity for unit dimensions (861c6252a5957bd908a5183fcaa4cd7e29b61192) * Fixed build errors on Fucshia OS (753b5310317938d37379a34c7d817fb94f61efec) * Fixed implicit conversion in GPU GEMM implementation (30dee23e2d60e255ecc0b0dc199cdccb284c66b1) * Addressed issues detected by clang TSan (888ab523863e59433da70e72a14927a590673f89, 7555fd839fe04320fd0a470019f798b0b958c45e, 4ffdb3cd09f46df7b7d544c76f0b63d8cb4801ca, 57b8ffd8ec2bea4c15687e52cf906c7c5cb1202a, b52b2c09a54968b5bc8739034f2bcb925af0b9ce, 84b200f6493a094f46b48503e4519d457ad7d0b5, 67deb8eae31cecb42f8339c2baf9cb7b9ff73962) * Fixed undefined access issues detected by clang UBSan (5bab17c967e4550f7168943605cdd8b93833d564, 3494b1e973f2db02dac09c70ad6292ac07fa881a, 688536052e9395f8c8e8ecddc9cc28d42ff19272, 8cbe861995c27ee1be24b22c13067f011e919c7b, b13a2156857a16bb0677469b80eba30b56f6f91f, 859622df60226b3de38169ec9cd40a7b3715d6b9, 5813c99f69d8033289cbfff47a97df4cb0420abd) * Fixed memory leak in CPU GEMM implementation (45e3039fbed8407cc3bc09db5228f6e620ba3286, fd6d14caa1085ead840b6f5470d6bc69f66336f8) * Fixed int8 convolution correctnes issues on Intel Integrated Graphics (b7d40a0ec245bb33960e8dc5571d1e0c462c3c3b, 72e48568014a22cfc0886794bcae0d36527c339b) * Fixed access violation issue in GEMM implementation on Windows (aac6b2325379c3954ff773c8eca9a903958ba90a)

- C++
Published by vpirogov over 4 years ago

onednn - v2.4.1

This is a patch release containing the following changes to v2.4: * Reduced scratch pad size requirements for BRGEMM-based convolutions (a81ce3cce3a82ad074d1cdc50b73d4104910c1e0) * Worked around an issue with the number of threads detection on AMD processors (ad901e5489564d0035be0b4ec41f1cff4be96610)

- C++
Published by vpirogov over 4 years ago

onednn - v2.4

Performance Optimizations

Improved primitive cache performance for Intel Graphics products.
Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved binary primitive performance for cases when one of the tensors is broadcasted.
- Improved performance of reduction primitive, reorder, shuffle primitives.
- Improved performance of depthwise convolution forward propagation for processors with Intel AVX5-12 support
- Improved performance of forward inner product primitive for the shapes with minibatch equal to 1 for processors with Intel AVX-512 support
- Improved performance of int8 matmul and inner product primitives for processors with Intel AVX2 and Intel DL Boost support
Intel Graphics Products
- Introduced initial optimizations for future Intel Arc graphics (code name Alchemist and DG2).
- Improved performance of convolution and deconvolution primitives with new JIT convolution kernel generator implementation. These optimizations are identified by jit:ir marker in oneDNN verbose log.
AArch64-based Processors
- Added support for bfloat16 acceleration with Arm Compute Library (ACL). The behavior is controlled by floating point math mode API.
- Improved inner product, matmul, and eltwise primitives performance with ACL.
- Introduced support for sum and for indirect and Winograd convolution implementations with ACL.
NVIDIA Graphics
- Improved convolution performance with eltwise post-op.

Functionality

Introduced PReLU post-op support in convolution and matmul.
Extended maximum allowed post-ops chain for compute primitives (convolution, deconvolution, inner product, and matmul) to 32.
Introduced support for zero points in sum post-op for convolution and matmul. The functionality is implemented only for CPUs.
Extended binary primitive with support for mixed data types for input tensors. The functionality is implemented only for CPUs.
Extended sum post-op for convolution and matmul primitives with support for mixed data types. The functionality is implemented only for CPUs.
Added Unified Shared Memory (USM) support for OpenCL GPU runtime.

Usability

Added compile time options to manage the set of supported primitives and workload types. See DNNL_ENABLE_WORKLOAD and DNNL_ENABLE_PRIMITIVE in build options for more details. This feature allows to reduce binary footprint of the library for specialized applications.
Reduced overall library size by trimming down use of templates, OpenCL headers, and TBB headers. The configurations that benefitted the most are CPU only configuration with TBB threading and GPU only configuration. Note, that binary footprint depends on the compiler used to build the library and build options.
Introduced floating point math mode API. The API allows the library to use bfloat16 or float16 hardware acceleration in fp32 operations. Currently this mode is supported only on AArch64 processors when oneDNN is built with ACL.
Added a build option DNNLLIBRARYNAME to change the library name and CMake target. This feature helps projects that use multiple oneDNN configurations.

Breaking Changes

Updated minimal supported ACL version to 21.08 (was 21.05).

Deprecated functionality

Intel MKL-DNN compatibility API is deprecated and will be removed in the next update. See Transition from Intel MKL-DNN to oneDNN page for instructions on moving to new API.
Support for Intel Xeon Phi processors is deprecated and will be removed in the next release.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aleksandr Nikolaev @alenik01, Arthur Mitrano @aaraujom, Crefeda Rodrigues @cfRod, Diana Bite @diaena, Jing Xu @jingxu10, Kentaro Kawakami @kawakami-k, Kevin Putnam @intelkevinputnam, Mesut Meterelliyoz @mmeterel, MITSUNARI Shigeo @herumi, Nathan John Sircombe @nSircombe, Nicolas Chauvet @kwizart, Peter Caday @petercad. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov over 4 years ago

onednn - v2.4-rc

This is a release candidate for oneDNN v2.4. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Improved primitive cache performance for Intel Graphics products.
Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved binary primitive performance for cases when one of the tensors is broadcasted.
- Improved reorder primitive performance for memory formats with padding and/or zero points.
Intel Graphics Products
- Introduced initial optimizations for future Intel Arc graphics (code name Alchemist and DG2).
AArch64-based Processors
- Improved inner product and eltwise primitives performance with ACL.
- Introduced support for sum and for indirect and Winograd convolution implementations with ACL.
NVIDIA Graphics
- Improved convolution performance with eltwise post-op.

Functionality

Introduced PReLU post-op support in convolution and matmul.
Extended maximum allowed post-ops chain for compute primitives (convolution, deconvolution, inner product, and matmul) to 32.
Introduced support for zero points in sum post-op for convolution and matmul. The functionality is implemented only for CPUs.
Extended binary primitive with support for mixed data types for input tensors. The functionality is implemented only for CPUs.
Extended sum post-op for convolution and matmul primitives with support for mixed data types. The functionality is implemented only for CPUs.
Added USM support for OpenCL GPU runtime.

Usability

Added compile time options to manage the set of supported primitives and workload types. See DNNLENABLEWORKLOAD and DNNLENABLEPRIMITIVE in build options for more details. This feature allows to reduce binary footprint of the library for specialized applications.
Reduced overall library size by trimming down use of templates, OpenCL headers, and TBB headers. The configurations that benefitted the most are CPU only configuration with TBB threading and GPU only configuration. Note, that binary footprint depends on the compiler used to build the library and build options.
Introduced floating point math mode API. The API allows the library to use bfloat16 or float16 hardware acceleration in fp32 operations. Currently this mode is not supported in the implementation.
Added a build option DNNLLIBRARYNAME to change the library name and CMake target. This feature helps projects that use multiple oneDNN configurations.

Breaking Changes

Updated minimal supported ACL version from 21.08 (was 21.05).

Deprecated functionality

Intel MKL-DNN compatibility API is deprecated and will be removed in the next update. See Transition from Intel MKL-DNN to oneDNN page for instructions on moving to new API.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aleksandr Nikolaev @alenik01, Arthur Mitrano @aaraujom, Diana Bite @diaena, Jing Xu @jingxu10, Kentaro Kawakami @kawakami-k, Kevin Putnam @intelkevinputnam, MITSUNARI Shigeo @herumi, Nathan John Sircombe @nSircombe, Nicolas Chauvet (kwizart) @kwizart, Peter Caday @petercad. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov over 4 years ago

onednn - graph-v0.2

This is a technical preview for oneDNN Graph API based on oneDNN v2.3.2.

oneDNN Graph API extends oneDNN with a unified, high-level graph API for multiple AI hardware classes (CPU, GPU, accelerators). The graph interface integrates with the deep learning frameworks and inference engines to maximize opportunities for performance optimizations across a variety of hardware targets. This preview has full support for the oneAPI Graph programming model and partial support of the operations in oneDNN Graph API specification v0.7.

Learn more about oneDNN Graph API: * Introduction to oneDNN Graph API * Getting started with C++ API * Getting started with DPC++ API

Supported Functionality

C++ and DPC++ API.
Graph partition and compilation API.
Operations and fusions targeting fp32 inference for CNNs, MLPs, and transformer neural networks.

Performance Optimizations

Backend implementation relies on oneDNN and includes performance optimizations for Intel Architecture processors with Intel SSE4.1, Intel AVX, Intel AVX2, or Intel AVX512 instruction set.

Validation

Gtest suite is available for basic functional testing.
Comprehensive functional and performance validation is covered by the extended version of benchdnn.

Known Issues and Limitations

Some subgraphs might not be recognized as a partition even if it matches the general pattern description due to internal implementation.
The weight’s opaque layout can be queried only from a compiled partition, which requires that tensor shapes must be known at compilation time.
Binary operation with scalar and tensor inputs is not optimized.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Pinzhen Xu, Chunyuan Wu, Jianping Chen, Scott Cyphers, Nishant Patel, Yiqiang Li, Yang Sheng, Kiefer Kuah, Adam Straw, Tim Zerrell, Namrata Choudhury and others.

- C++
Published by vpirogov over 4 years ago

onednn - v2.3.2

This is a patch release containing the following changes to v2.3.1: * Fixed performance regression in fp32 inner product primitive for processors with Intel AVX512 support (3e379b8c51a2fc2e72be6c49c9e6855f003af9e6) * Removed assert related to Winograd convolution algorithm dispatching on GEN9 GPUs (2b4f73adf89a3804dd5018014596ad2354309d40)

- C++
Published by tprimak almost 5 years ago

onednn - v2.3.1

This is a patch release containing the following changes to v2.3: * Improved int8 GEMM performance for processors with Intel AVX2 and Intel DL Boost support (f5c071bc371c26cac30bb68cda3ab1224ed697c1) * Fixed integer overflow for inner product implementation on CPUs (66971b57889d1246c643d736e50195c1bcd46a60) * Fixed out of bounds access in GEMM implementation for Intel SSE 4.1 (4e81df0a26e520c161527d52ce63d55734e9dabb) * Fixed correctness issue for depthwise convolution post-op with non-default scales on CPUs (783e1d6f035d20915cc1c8722d1b512888111beb, 066c832f7a2f6892a79c3f1b5a04b1a5f236e874) * Fixed crash for s8 binary primitive on Windows (d9fd397e2f130dddffbd2ced37edb300a2ba7649) * Fixed performance regression in fp32 to u8 reorder for Intel AMX specific memory formats (97f40cf0efef17361e948423a0b4fc2db04a903c, 532648adff4fe8590838f1f90409463b9237e358) * Fixed correctness issue for bfloat16 convolution weight gradient on processors with Intel AMX support (053406d0fd5a91f3e64adb81828be1632b74f9a5, 6649b759a5e801ad095c3c44d74c1dc27ab82617) * Fixed correctness issue for bfloat16 inner product backpropagation on processors with Intel AMX support (a2e6c55261bb3c353a295b7e2e57d403e5d73696) * Fixed correctness issue for bfloat16 convolution with padded memory formats on GEN9 GPUs (c0aea07a7e5b21829e4d484e232b9eccf49128d4) * Fixed correctness issue for int8 matmul primitive with zero points on processors with Intel AMX support (55cb716084cc625bc97e5f90b4f82bb2fcd72962) * Fixed segfault in depthwise convolution post-op on CPUs (ad466354b3108c4cacb1b85a6f93f8bdfe9d4e59)

- C++
Published by vpirogov almost 5 years ago

onednn - v2.3

Performance Optimizations

Extended primitive cache to improve primitive descriptor creation performance.
Improved primitive cache performance in multithreaded configurations.
Intel Architecture Processors
- Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
- Improved performance of reduction primitive.
- Improved performance of depthwise convolution primitive with NHWC activations for training cases
Intel Graphics Products
- Improved fp32 and fp16 Winograd convolution performance.
- Introduced support for automatic selection between direct and Winograd convolution algorithms.
- Improved int8 depthwise convolution performance.
- Improved performance of reorder, shuffle, concat, binary, and batch normalization primitives
- Improved layer normalization performance for blocked formats.
AArch64-based Processors
- Improved reorder primitive performance for systems with SVE 128 and SVE 256 support.
- Improved eltwise primitive performance for systems with SVE 512 support.

Functionality

Extended batch normalization and layer normalization primitives API to take separate scale and shift arguments.
Extended resampling primitive with post-ops support and mixed source and destination data types.

Usability

Introduced binary distribution in conda-forge. Supported configurations cover Linux, Windows, and macOS operating systems and Intel64/AMD64, Aarch64, and PPC64 architectures.
Introduced support for GPU-only build. This configuration helps to reduce binary footprint for applications targeting GPU.
Introduced an option to use GNU OpenMP as CPU runtime for DPC++ configuration.
Introduced verbose log converter. This tool processes oneDNN verbose logs and generates test cases for benchdnn.

Breaking Changes

Updated minimal supported CMake version from to 2.8.12 (was 2.8.11).
Updated minimal supported ACL version from 21.05 (was 21.02).

Thanks to the Contributors

This release contains contributions from the project core team as well as Alexandre Truong @aletru01, Arthur Mitrano @aaraujom, fitchbe @fitchbe, Isuru Fernando @isuruf, Joe Ramsay @joeramsay, Kentaro Kawakami @kawakami-k, leizheng1 @leizheng1, Nomoto Kazuhiro @NomotoKazuhiro, Peter Caday @petercad, Pablo Romero @pablocum, Takumi-H @Takumi-Honda, Uwe L. Korn @xhochy, Vasily Rubtsov @vasilyru. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov almost 5 years ago

onednn - v2.3-rc2

This is a release candidate for oneDNN v2.3. Please provide feedback and submit defect reports via Github issues.

- C++
Published by vpirogov almost 5 years ago

onednn - v2.2.4

This is a patch release containing the following changes to v2.2.3: * Fixed build error with GCC 11 (eda1add9567b2491a5e4892a0f8ba7aa1c0016cd) * Fixed an issue with reorder reporting unimplemented when quantizing f32 weights to s8 (4f05b76bb765ed8a892be3325730992763025f0b, 5d3d1e18747f210a121cf00d909024ff7b5d8b16, cc77eef809d0331b245eb21a7956d507505700aa) * Updated name for GPU gen12 architecture to xe (3d202c205473daec426a6de3a32e074db372c09d)

- C++
Published by vpirogov almost 5 years ago

onednn - v2.3-rc

This is a release candidate for oneDNN v2.3. Please provide feedback and submit defect reports via Github issues.

- C++
Published by vpirogov almost 5 years ago

onednn - v2.2.3

This is a patch release containing the following changes to v2.2.2: * Fixed a bug in int8 depthwise convolution ptimitive with groups and 1d spatial size for processors with Intel AVX-512 and Intel AVX2 support (8a784c60fa3d074bd719ff7a8aecfe8ff7ff8966, f0e4af96163e5fa41320d24cc6952980b843ca7b) * Fixed correctness issue for PReLU primitive on Intel Processor Graphics (f3c3daf8a67477fcf3dceb826ea9e84c641ed67d) * Fixed corretness issue in reorder for blocked layouts with zero padding (68f05d00ae7743f16b41decd9da27599fdb191ec, d51616bc7ebee49f501086ace373d20833cea6fa, fd2c6421f1eff12822ba8808e0f979c60e21b2cd) * Improved performance of weights reorders used by BRGEMM-based convolution primitive for processors with Intel AVX-512 support (23b2ec0d6f73aba06c722c54eeb6d6ac0082242b, 10f81875774d0cdf8b293146bc0277daa330a48a, 4c0819c432cfad488c897cf1deefe0e89cb11749) * Added -fp-model=precise build flag for DPC++ code (3e40e5e92ebcf40a9115827ce568d32c5049f74a) * Fixed potential memory leak in matmul primitive (36dba73d0f584d30ce714415a59f42db735f4494) * Fixed performance of matmul primitive when fused with bias update and sum (f993b25dbe71010fc63ef0a5591ce6d85c9e47c3) * Fixed a bug in matmul primitive when writing to non-contiguous destination buffer (36d25d4308a0bc5906df44f6ef6afc2074699500)

- C++
Published by vpirogov about 5 years ago

onednn - v2.2.2

This is a patch release containing the following changes to v2.2.1: * Fixed performance regression in fp32 forward inner product for shapes with number of output channels equal to 1 for processors with Intel AVX-512 support (714b1fd7f9ee51cc4b8f8a09ac9a0fc9be8403c9) * Fixed performance regression in forward convolutions with groups for processors with Intel AVX-512 support(3555d4a76e63f07fd36fdeea3947e0267bfcb814) * Removed -std=c++11 build flag for DPC++ headers (1fcb867e37ef48c82ee2c720a0405ad4e6299300) * Fixed buffer access in initializing workspace in RNN implementation on GPU (9b0309142937001f7140f80c451a294d31464626) * Fixed fix a bug in convolution with 1x1 kernel and mixed strides on processors with Intel AVX-512 support (d0b3e3fe0b15d9d8c05d21b97df303cdfb101076) * Used getauxval for Linux to get CPU features on for AArch64 systems (25c4ceaca3472dbd340dc942718a4e4b22c8a77c) * Added -fp-model=precise build flag for DPC++ code (3e40e5e92ebcf40a9115827ce568d32c5049f74a) * Fixed out-of-bounds writes in elementwise primitive on Intel Processor Graphics (bcf823c48574e163f34abbd4226d7a7af52bf374)

- C++
Published by tprimak about 5 years ago

onednn - v2.2.1

This is a patch release containing the following changes to v2.2: * Fixed segfault for cases when primitive descriptor or attributed contain NaN (e6d05ecf20a110f83bf037be99c6c5110bf4d981, dbca1e9370c49fa4fe0fa0b4a42a4fa86b6e64a6, 0326b096eff60a2813265dce1bcb31c12177023d, 0326b096eff60a2813265dce1bcb31c12177023d) * Fixed engine creation failure for GPU subdevices (4c3a11438405ca191b1efc24b057286fc236c2d2) * Fixed long lines clipping in verbose output (70d70a8d064ad802344d90f6395760ef9bd720e2) * Fixed segfault in bfloat16 convolution weight gradient implementation on processors with Intel AMX support (a3a73a370797bc4b28a6868d533a6fbed0dad0df) * Fixed performance regression in binary primitive with per_oc broadcast strategy (9ac85d8508658adf0b141844f2355448aa5a3a2a) * Worked around a bug with Microsoft Visual C++ compiler version detection in CMake 3.19 (2f39155b256367e2b37ce782a222144a0b294cdc) * Removed -std=c++11 build flag for DPC++ code to align with SYCL standard (1b026f5e303649d9c0f98168a922e6f085001d3c)

- C++
Published by vpirogov about 5 years ago

onednn - v2.1.3

This is a patch release containing the following changes to v2.1.2: * Updated xbyak_aarch64 to support Apple silicon (dd1a02ab2a962bbeadfc0d2e53fedf39ed2b7b7e, 913010b253eccd4654c29f78c81227f7342e3262, 2d155dd22c59f4a059e9a7903c503d2221542811) * Fixed segfault in fp32 depthwise convolution with padded memory (2d8283f575d0a0a43a8a967f659f95e2fd8dd866) * Fixed potential issues in BRGEMM-based convolution implementation (b183dffa0fefa2c342070daae95c00ff274e8310, d2b1653f28f35ea3dc93c10ba6b9b538e80ba08e) * Fixed memory leak on NVIDIA GPUs (06803f2c2834b67a357fdb24d03ea906b9ffdd3a)

- C++
Published by vpirogov about 5 years ago

onednn - v2.2

Performance Optimizations

Intel Architecture processors
- Improved performance of int8 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved performance of compute functionality for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved fp32 inner product forward propagation performance for processors with Intel AVX-512 support.
- Improved dnnl_gemm performance for cases with n=1 on all supported processors.
Intel Graphics products
- Introduced NHWC format support for activations for int8 primitives.
AArch64-based processors
- Improved performance of fp32 and int8 convolution, and softmax primitives for processors with SVE 512 support.
- Improved performance of fp32 convolution via Arm Compute Library (ACL).
- Improved performance of convolution with a combination of sum and relu post-ops via ACL.

Functionality

Extended eltwise primitive with support for mish and hardswish algorithms.
Extended binary primitive with support for comparison operators.
Introduced support for post-ops in GPU resampling implementation.
Introduced asymmetric quantization support for int8 deconvolution.
Introduced binary post-ops support for matmul primitive.

Usability

Improved presentation of oneDNN primitives in VTune Amplifier.
Introduced Linux perf support for AArch64.
Introduced support for Fujitsu C++ compiler.
Introduced a build time check for minimal supported ACL version. Currently oneDNN requires ACL 21.02 or later.
Added support for cuDNN 8.x

Thanks to the contributors

This release contains contributions from the project core team as well as Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Dr-Noob @Dr-Noob, Gmc2 @GHGmc2, higuchi.motoko @higuchi-motoko, Joe Ramsay @joeramsay, Kentaro Kawakami @kawakami-k, Louie Tsai @louie-tsai, masafumi yamazaki @m-ymzk, Nathan John Sircombe @nSircombe, Takumi-H @Takumi-Honda. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by vpirogov about 5 years ago

onednn - v2.1.2

This is a patch release containing the following changes to v2.1.1: * Improved performance of forward convolution with plain activations for processors with Intel AVX-512 support (2147a58a6b075edcbb8b03fb158a73b7e706c324) * Enabled I-cache refreshing before executing JIT-ed code for AArch64 systems (9f3bc1c9279dde44383ef476ae49e813142b3cdc) * Returned blocked layouts as default for forward training (7af2898e65136ad2dd8cfc280027428e3ef2ec72, bd4826d8f098d196a9502d0c6d347f0956a243ad)

- C++
Published by tprimak about 5 years ago

onednn - v2.2-rc

This is a release candidate for oneDNN v2.2. Please provide feedback and submit defect reports via Github issues.

- C++
Published by vpirogov about 5 years ago

onednn - v2.1.1

This is a patch release containing the following changes to v2.1: * Improved performance of fp32 depthwise convolution with plain activations on CPU (762a9c75a01476457d705c1e98f4d28f74b80e4d) * Worked around internal compiler error in GCC 7.3.1 when building with --std=c++14 (f637501d41e0d9a1515430a5530fca53fe656903) * Fixed memory leaks in batchnorm and gemm implementations (2ea5385402c2b3d6995b9e6bb8cb773339d9b7c2, 4f3a7cf1bc3009415a2cd065ffe2ed4ed45fda6c) * Addressed several issues in benchdnn and gtests (bb7bdb41e13ff47d7993e29827b3e60697c4809a, 0e04cc29a09eacc81d9e0dd705b55381b19166ea, d7df8d2240ea0c4d5ce74a209ccf652dd7094570, a59354fad484c46dd98956c406534d371d3fd08e)

- C++
Published by vpirogov over 5 years ago

onednn - v2.1

Performance optimizations

Reduced overheads associated with primitive cache.
Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
- Improved functionality performance for padded memory formats.
- Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
- Improved performance of pooling primitive for float16 data type.
- Improved performance of lnorm primitive for plain formats.
- Improved performance of resampling primitive for blocked formats.
Intel Architecture processors
- Introduced initial optimizations for bfloat16 functionality for future Intel Xeon Scalable processor with Intel AMX support (code name Sapphire Rapids).
- Improved performance of int8 and bfloat16 RNN and inner product primitives.
- Improved performance of shuffle primitive for bfloat16 data type.
- Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel AVX512 compute unit.
- Improved forward convolution performance for Intel AVX-512 systems.
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved convolution and batch normalization performance with threadpool.
AArch64-based processors
- Improved performance of Winograd convolution with ArmCL.
- Improved performance of int8 convolution with ArmCL.
- Added JIT support for Aarch64 and JIT implementations for reorder, eltwise, pooling, and batch normalization primitives.
NVIDIA GPUs
- (preview) Introduced support for NVIDIA GPU. The implementation relies on DPC++ Compiler, cuDNN, and cuBLAS libraries.

New Functionality

Introduced int8 support for LSTM primitive with projection for CPU.
Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs.
Extended the number of supported post-ops for primitives to 20.
Extended eltwise primitive with support for logsigmoid and clip_v2 algorithms.
Introduced support for PRelu primitive.
Extended matmul implementation with support for per-output channel zero-points for quantization.
Extended support for broadcasting in binary primitive to both inputs for CPU.
Introduced float16 support in reduction primitive for GPU.
Introduced support for mixed input and output types in binary primitive for GPU.

Usability

Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.

Validation

Extended benchdnn to report operation bandwidth.
Added ability to choose target GPU in benchdnn.

Thanks to the contributors

This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Benjamin Fitch, Ben Tracy @CodeplayBen, Daniel Soutar @danielsoutar, @dylan-angus-codeplay, Diana Bite @diaena, higuchi.motoko @higuchi-motoko, Jacob Kahn @jacobkahn, Kentaro Kawakami @kawakami-k, Kumudha KN @KumudhaN, kurihara @Koji-Kurihara, Mehdi Goli @mehdi-goli, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, Xinyu Chen @xinyu-intel, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

- C++
Published by anita-intel over 5 years ago

onednn - v2.1-rc

This is a release candidate for oneDNN v2.1. Please provide feedback and report bugs in Github issues.

- C++
Published by anita-intel over 5 years ago

Recent Releases of onednn

onednn - v3.9.1

onednn - v3.9

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Functional API

Graph API

Microkernel API

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

Usability

Validation

Known Limitations

Deprecated Functionality

Thanks to our Contributors

onednn - v3.9-rc

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Functional API

Graph API

Microkernel API

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

Usability

Validation

Known Limitations

Deprecated Functionality

Thanks to our Contributors

onednn - v3.8.1

onednn - v3.8

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Common

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

NVIDIA GPUs

Usability

Validation

Deprecated Functionality

Breaking Changes

Thanks to our Contributors

onednn - v3.7.3

onednn - v3.8-rc

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Common

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

NVIDIA GPUs

Usability

Validation

Breaking Changes

Thanks to our Contributors

onednn - v3.7.2

onednn - v3.7.1

onednn - v3.7

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

NVIDIA GPUs

Functionality

Common

Intel Architecture Processors