Recent Releases of https://github.com/ashvardanian/simsimd

https://github.com/ashvardanian/simsimd - Release v6.5.3

Release: v6.5.3 [skip ci]

Patch

  • Make: Co-package PyTests (#278) (4491f09)

- C
Published by ashvardanian 6 months ago

https://github.com/ashvardanian/simsimd - v6.5.2: Improved Compatibility & Docs Consistency

Patch

  • Make: Rust 1.64 compatibility (889bf25)
  • Docs: Inconsistencies & typos (301d59c)
  • Make: Avoid Cargo.lock for the library (5b9c207)
  • Make: MSVC-friendlier Rust builds (e940014)
  • Improve: Naming Rust tests (e1fab5c)

- C
Published by ashvardanian 6 months ago

https://github.com/ashvardanian/simsimd - Release v6.5.1

Release: v6.5.1 [skip ci]

Patch

  • Make: Avoid --lib-sdir . on Linux (39623cc)
  • Docs: Probability Distributions in Rust (af7c145)

- C
Published by ashvardanian 6 months ago

https://github.com/ashvardanian/simsimd - v6.5: Half-Precision `f16` & `bf16` Numbers in Rust 🦀

SimSIMD has historically been one of the largest collections of mixed-precision kernels, but f32 to/from f16 and bf16 conversion operators have never been exposed to bindings. This release is the first step in that direction. I look forward to everyone's suggestions on how to further improve the Rust API. Thanks 🤗


Here's an example:

```rs use simsimd::{SpatialSimilarity, f16, bf16};

// Process embeddings at different precisions for speed vs accuracy trade-offs let embeddingsf32 = vec![1.0, 2.0, 3.0, 4.0]; let queryf32 = vec![0.5, 1.5, 2.5, 3.5];

// Convert to f16 for memory efficiency (2x compression) let embeddingsf16: Vec = embeddingsf32.iter() .map(|&x| f16::fromf32(x)) .collect(); let queryf16: Vec = queryf32.iter() .map(|&x| f16::fromf32(x)) .collect();

// Convert to bf16 for ML workloads (better range than f16) let embeddingsbf16: Vec = embeddingsf32.iter() .map(|&x| bf16::fromf32(x)) .collect(); let querybf16: Vec = queryf32.iter() .map(|&x| bf16::fromf32(x)) .collect();

// Hardware-accelerated similarity at different precisions let similarityf32 = f32::cosine(&embeddingsf32, &queryf32).unwrap(); let similarityf16 = f16::cosine(&embeddingsf16, &queryf16).unwrap(); let similaritybf16 = bf16::cosine(&embeddingsbf16, &query_bf16).unwrap();

println!("Cosine similarities:"); println!("f32: {:.6}", similarityf32); // Full precision println!("f16: {:.6}", similarityf16); // 2x memory savings println!("bf16: {:.6}", similarity_bf16); // ML-optimized range

// Natural arithmetic operations work seamlessly let scaledf16 = embeddingsf16.iter() .map(|&x| x * f16::fromf32(2.0) + f16::ONE) .collect::<Vec<>>();

// Direct bit manipulation when needed let rawbits = embeddingsf16[0].0; // Access underlying u16 let reconstructed = f16(raw_bits); ```

Minor

  • Add: Half-precision converters for C/Rust (bf5a7d2)

Patch

  • Improve: Support more numeric operators in Rust (1d7b2b8)
  • Improve: Rust nostd builds (c6c0698)
  • Make: Bump dependencies (3667313)
  • Fix: Expose Distance in Rust (e0c69c7)

- C
Published by ashvardanian 8 months ago

https://github.com/ashvardanian/simsimd - v6.4.10: Multi-threading in Python 3.13t

Other minor tweaks:

  • [x] bf16 L2 calculation in Rust
  • [x] flushing denormals in Rust
  • [x] nonnull build warnings in GCC & Clang
  • [x] upgrading JS dependencies

Patch

  • Fix: Require NumPy for GIL tests (529b0dd)
  • Improve: Free threading examples & checks (83e522a)
  • Make: Enable free-threading CIBW builds (0093c3f)
  • Docs: Setting up uv env (8dc7012)
  • Improve: GIL-free batch-processing in Py (eb234d5)
  • Make: Drop Python 3.7 for 3.13t (fc62de4)
  • Improve: Flushing denormals in Rust (1c1e608)
  • Docs: Top-level script instructions (d17c9a9)
  • Fix: Typos & auto-formatting (7bca0e3)
  • Make: Bump JS deps (440d3e5)
  • Fix: Avoid nonull attribute warning on GCC (f14db4a)
  • Fix: Wrong SpatialSimilarity::<bf16>::l2 dispatch (00ad454)

- C
Published by ashvardanian 8 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.9

Release: v6.4.9 [skip ci]

Patch

  • Fix: add dot i8 (Rust) (eaeb3b7)

- C
Published by ashvardanian 9 months ago

https://github.com/ashvardanian/simsimd - v6.4.8: Supporting R-profile Arm CPUs 💪

  • Fix: GCC can't handle v8.0-a decimal (4116f8a)
  • Fix: f16, i8, bf16 compile-time dispatch (29c0f46)
  • Docs: Globally unset DEVELOPER_DIR (22bb40b)
  • Fix: Check for NEON for R-profile CPUs (a6bbf9e)
  • Make: Lower armv8.2 to armv8.0 requirement (31fbdcd)
  • Improve: Set nonnull attributes (fc61d19)
  • Docs: Unset DEVELOPER_DIR on macOS (69c6614)
  • Make: Bump Google Benchmark (cca25a0)
  • Docs: Refresh C example (73e6ccb)

- C
Published by ashvardanian 9 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.7

Release: v6.4.7 [skip ci]

Patch

  • Make: Differentiate cibw uploads (9116b2a)

- C
Published by ashvardanian 9 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.6

Release: v6.4.6 [skip ci]

Patch

  • Fix: Deno testing CLI commands (be8acfb)
  • Make: Bump vulnerable JS deps (506b816)
  • Make: Checking env. variables on macOS (6cb256f)
  • Make: Try compiling wheels with different flags (77870cf)
  • Make: Enable Deno to run pre-builds (71b3412)
  • Make: Set f16c flag for _cvtss_sh (1215418)
  • Fix: Pedantic _Float16 cast warnings (3230095)
  • Make: Overwrite JS bundles (c252f84)
  • Fix: Missing avx512dq flags (ace4f7e)

- C
Published by ashvardanian 9 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.5

Release: v6.4.5 [skip ci]

Patch

  • Fix: Aliasing of half-precision types (abb2d88)

- C
Published by ashvardanian 9 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.4

Release: v6.4.4 [skip ci]

Patch

  • Make: Return Rust build errors (#264) (7e3b493)

- C
Published by ashvardanian 10 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.3

Release: v6.4.3 [skip ci]

Patch

  • Fix: Use correct type in sparse dot-product macro (354a6b8)

- C
Published by ashvardanian 10 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.2

Release: v6.4.2 [skip ci]

Patch

  • Fix: i4 cosine on Ice Lake (#262) (ffdbbf8)

- C
Published by ashvardanian 10 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.1

Release: v6.4.1 [skip ci]

Patch

  • Docs: Dual-licensing with 3-clause BSD (7520fcf)

- C
Published by ashvardanian 11 months ago

https://github.com/ashvardanian/simsimd - Release v6.4.0

Release: v6.4.0 [skip ci]

Minor

  • Add: Expose L2 distance in Swift (#255) (b106afc)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.3.4

Release: v6.3.4 [skip ci]

Patch

  • Fix: Turin kernels for spdot (#252) (5044fef)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.3.3

Release: v6.3.3 [skip ci]

Patch

  • Improve: Sparse intersection dependency chain (#251) (b8ee93f)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.3.2

Release: v6.3.2 [skip ci]

Patch

  • Make: Upgrade deprecated CI tools (d9bc3d2)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.3.1

Release: v6.3.1 [skip ci]

Patch

  • Make: Update release.yml for Arm (d8c6f40)
  • Make: Use official Docker repo (48d39e3)
  • Make: Remove conflicting containerd on Arm (41e02f9)
  • Make: Install Docker on Aarch64 (636a22d)
  • Make: Avoid extras repo in yum on Aarch64 (aa5aced)
  • Fix: Wrong variable used in l2sqbf16sve (c26008b)
  • Make: Resolve Windows build conflicts (8e50840)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.3.0

Release: v6.3.0 [skip ci]

Minor

  • Add: simsimd_flush_denormals (63af257)

Patch

  • Make: Faster cibuildwheel releases (e54e939)
  • Make: Fix CI instance label (60048b4)
  • Make: Use newer Python for cibuildwheel (e922019)
  • Make: Skip 32-bit Windows Python images (372480a)
  • Make: Use newer image for Arm CI (a63f55f)
  • Make: Patch pyproject.toml (b3e35a9)
  • Make: Skip PyPy builds (1fe7faa)
  • Make: test-command Windows compatibility (dea5b71)
  • Make: Skip armv7l PyPi builds (9227c86)
  • Make: Skip missing ARM builds (e1de8ee)
  • Make: Use default images (4d9e6e7)
  • Make: Separate PyPi CI jobs (b690f7b)
  • Improve: Naming CI steps (d883b9d)
  • Make: Try newer cibuildwheel==2.22 (77898ac)
  • Improve: Test flushing denormalized values (9685346)
  • Improve: Code styling (adbbd9c)
  • Improve: Prefer Asm in _simsimd_flush_denormals_x86 (7c54935)
  • Docs: Spelling (114ff7d)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.2.3

Release: v6.2.3 [skip ci]

Patch

  • Docs: Navigating the codebase (c375e3b)
  • Docs: FMA ports details (7c87779)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.2.2

Release: v6.2.2 [skip ci]

Patch

  • Docs: Swift page typos (#241) (180be84)

- C
Published by ashvardanian about 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.2.1

Release: v6.2.1 [skip ci]

Patch

  • Fix: -Wvla warnings (671be9f)
  • Docs: Harley-Seal plans for binary kernels (45dbe6e)
  • Docs: Cleaner benchmarks table (a39419c)

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v6.2: Complex Bilinear Forms for Physics ⚛️

  • Technical highlight of the release: AVX-512 masking is extremely handy in implementing unrolled BLAS Level 2 operations for small inputs, resulting in up to 5x faster kernels than OpenBLAS.
  • Semantic highlight of the release: Bilinear forms now support complex numbers as inputs, extending the kernels' applicability to Computational Physics.

Bilinear Forms are essential in Scientific Computing. Some of the most computationally intensive cases arise in Quantum systems and their simulations, as discussed on r/Quantum. This PR adds support for complex inputs to make it more broadly applicable.

math \text{BilinearForm}(a, b, M) = a^T M b

In Python, you can execute this by consecutively calling 2 NumPy functions. Ideally, reusing a buffer for the intermediate results:

```py ndim = 128

import numpy as np dtype = np.float32 temporary_vector = np.empty((ndim, ), dtype=dtype)

firstquantumstate = np.random.randn(ndim).astype(dtype) secondquantumstate = np.random.randn(ndim).astype(dtype) interaction_matrix = np.random.randn(ndim, ndim).astype(dtype)

np.matmul(firstquantumstate, interactionmatrix, out=temporaryvector) result: float = np.inner(temporaryvector, secondquantum_state) ```

With SimSIMD, the last 2 lines are fused:

py import simsimd as simd simd.bilinear(first_quantum_state, second_quantum_state, interaction_matrix)

For 128-dimensional np.float32, the latency of 2.11 μs with NumPy went down to 1.31 μs. For smaller 16-dimensional np.float32, the latency of 1.31 μs with NumPy went down to 202 ns. As always, the gap is wider for low-precision np.float16 representations: 2.68 μs with NumPy vs 313 ns with NumPy.

Small Matrices and AVX-512

In the past, developers were used to providing separate precompiled kernels for every reasonable matrix size when dealing with small matrices. That negatively affects the binary size and makes CPU L1i instruction caches ineffective. With AVX-512, however, for different matrix sizes, we can reuse the same single-instruction vectorized loops with just a single additional BZHI instruction precomputing the load masks.

Avoiding Data Dependency

A common approach in dot products is to use a single register to accumulate dot products. That VFMADD132PS instruction:

  • AMD Zen 4 has a latency of 4 cycles and can execute on ports 0 and 1.
  • Intel Skylake-X has a latency of 4 cycles and can execute on ports 0 and 5.

Assuming it can run on 2 ports simultaneously, even on modern hardware, introducing data dependency between consecutive statements is inefficient. In future generations, we may be able to compute this on more ports, so to "futureproof" the solution, I use 4 intermediaries.

Avoiding Horizontal Reductions

When computing $a \dot X \dot b$, we may prefer to evaluate $X \dot b$ first due to the associativity of matrix multiplication. On tiny inputs, the operation may be bottlenecked by computing horizontal reductions for every one of the rows in $X$. Instead, we use more serial loads and broadcasts but only perform one horizontal accumulation in the end, assuming all of the needed intermediaries fit into a single register (or a few if we minimize the data dependency).

Intel Sapphire Rapids Benchmarks

Running on recent Intel Sapphire Rapids CPUs, one can expect the following performance metrics for 128-dimensional Bilinear Forms for SimSIMD and OpenBLAS:

```sh

Benchmark Time CPU Iterations UserCounters...

bilinearf64blas<128d>/mintime:10.000/threads:1 3584 ns 3584 ns 3906234 absdelta=3.8576a bytes=571.503M/s pairs=279.054k/s relativeerror=1.45341f bilinearf64cblas<128d>/mintime:10.000/threads:1 7605 ns 7604 ns 1856665 absdelta=3.90906a bytes=538.656M/s pairs=131.508k/s relativeerror=3.10503f bilinearf32blas<128d>/mintime:10.000/threads:1 1818 ns 1818 ns 7621072 absdelta=743.294p bytes=563.325M/s pairs=550.122k/s relativeerror=301.396n bilinearf32cblas<128d>/mintime:10.000/threads:1 3607 ns 3606 ns 3886483 absdelta=958.531p bytes=567.864M/s pairs=277.278k/s relativeerror=1.4445u bilinearf16haswell<128d>/mintime:10.000/threads:1 1324 ns 1324 ns 10597225 absdelta=1.31674n bytes=386.742M/s pairs=755.355k/s relativeerror=851.968n bilinearbf16haswell<128d>/mintime:10.000/threads:1 1305 ns 1305 ns 10752131 absdelta=1.33001n bytes=392.464M/s pairs=766.532k/s relativeerror=561.046n bilinearbf16genoa<128d>/mintime:10.000/threads:1 862 ns 862 ns 16241596 absdelta=1.40284n bytes=593.885M/s pairs=1.15993M/s relativeerror=849.533n bilinearbf16cgenoa<128d>/mintime:10.000/threads:1 2610 ns 2610 ns 5355435 absdelta=351.596p bytes=392.313M/s pairs=383.118k/s relativeerror=243.698n bilinearf16sapphire<128d>/mintime:10.000/threads:1 875 ns 875 ns 16038203 absdelta=10.5652u bytes=584.951M/s pairs=1.14248M/s relativeerror=9.42998m bilinearf16csapphire<128d>/mintime:10.000/threads:1 2159 ns 2159 ns 6449575 absdelta=4.43296u bytes=474.398M/s pairs=463.28k/s relativeerror=3.98057m bilinearf64skylake<128d>/mintime:10.000/threads:1 3483 ns 3483 ns 4019657 absdelta=4.3853a bytes=587.96M/s pairs=287.09k/s relativeerror=3.02046f bilinearf64cskylake<128d>/mintime:10.000/threads:1 7178 ns 7178 ns 1949803 absdelta=3.45547a bytes=570.624M/s pairs=139.313k/s relativeerror=4.07708f bilinearf32skylake<128d>/mintime:10.000/threads:1 1783 ns 1783 ns 7848896 absdelta=2.45041n bytes=574.255M/s pairs=560.796k/s relativeerror=811.561n bilinearf32cskylake<128d>/mintime:10.000/threads:1 3504 ns 3504 ns 3976879 absdelta=1.94251n bytes=584.494M/s pairs=285.397k/s relativeerror=2.99757u bilinearf64serial<128d>/mintime:10.000/threads:1 5528 ns 5528 ns 2529904 absdelta=0 bytes=370.459M/s pairs=180.888k/s relativeerror=0 bilinearf64cserial<128d>/mintime:10.000/threads:1 12324 ns 12324 ns 1140788 absdelta=0 bytes=332.371M/s pairs=81.1453k/s relativeerror=0 bilinearf32serial<128d>/mintime:10.000/threads:1 5299 ns 5298 ns 2649614 absdelta=1.69242n bytes=193.264M/s pairs=188.734k/s relativeerror=776.834n bilinearf32cserial<128d>/mintime:10.000/threads:1 10217 ns 10216 ns 1370535 absdelta=1.89398n bytes=200.461M/s pairs=97.8816k/s relativeerror=3.25219u bilinearf16serial<128d>/mintime:10.000/threads:1 42372 ns 42371 ns 330369 absdelta=1.93284n bytes=12.0838M/s pairs=23.6011k/s relativeerror=1.51289u bilinearf16cserial<128d>/mintime:10.000/threads:1 46101 ns 46100 ns 303997 absdelta=1.77214n bytes=22.2124M/s pairs=21.6918k/s relativeerror=1.5494u bilinearbf16serial<128d>/mintime:10.000/threads:1 85325 ns 85324 ns 163256 absdelta=1.34067n bytes=6.00066M/s pairs=11.72k/s relativeerror=527.801n bilinearbf16cserial<128d>/mintime:10.000/threads:1 178970 ns 178967 ns 78235 absdelta=1.46323n bytes=5.72174M/s pairs=5.58764k/s relativeerror=1004.88n ```

Highlights:

  • Single- and double-precision kernels are only about 5% faster than BLAS due to removed temporary buffer stores.
  • Both bf16 and f16 kernels provide linear speedups proportional to the number of bits in the data type.

On low-dimensional inputs, the performance gap is larger:

```

Benchmark Time CPU Iterations UserCounters...

bilinearf64blas<8d>/mintime:10.000/threads:1 42.7 ns 42.7 ns 328247670 absdelta=15.9107a bytes=3.00004G/s pairs=23.4378M/s relativeerror=550.946a bilinearf64cblas<8d>/mintime:10.000/threads:1 57.4 ns 57.4 ns 243896993 absdelta=21.3452a bytes=4.46378G/s pairs=17.4366M/s relativeerror=514.643a bilinearf32blas<8d>/mintime:10.000/threads:1 32.2 ns 32.2 ns 434784869 absdelta=6.73645n bytes=3.97757G/s pairs=31.0747M/s relativeerror=235.395n bilinearf32cblas<8d>/mintime:10.000/threads:1 50.6 ns 50.6 ns 276504577 absdelta=7.97379n bytes=2.52823G/s pairs=19.7518M/s relativeerror=251.204n bilinearf16haswell<8d>/mintime:10.000/threads:1 13.7 ns 13.7 ns 1000000000 absdelta=6.06053n bytes=9.35133G/s pairs=73.0573M/s relativeerror=139.096n bilinearbf16haswell<8d>/mintime:10.000/threads:1 13.0 ns 13.0 ns 1000000000 absdelta=5.03892n bytes=9.84787G/s pairs=76.9365M/s relativeerror=114.101n bilinearbf16genoa<8d>/mintime:10.000/threads:1 12.6 ns 12.6 ns 1000000000 absdelta=5.63947n bytes=10.1297G/s pairs=79.1384M/s relativeerror=166.305n bilinearbf16cgenoa<8d>/mintime:10.000/threads:1 69.0 ns 69.0 ns 203022389 absdelta=1.61581n bytes=1.85573G/s pairs=14.4979M/s relativeerror=60.9203n bilinearf16sapphire<8d>/mintime:10.000/threads:1 8.52 ns 8.52 ns 1000000000 absdelta=51.4863u bytes=15.0256G/s pairs=117.387M/s relativeerror=1.92771m bilinearf16csapphire<8d>/mintime:10.000/threads:1 64.6 ns 64.6 ns 216692584 absdelta=43.8492u bytes=1.98133G/s pairs=15.4791M/s relativeerror=1.48218m bilinearf32skylake<8d>/mintime:10.000/threads:1 7.28 ns 7.28 ns 1000000000 absdelta=8.92396n bytes=17.5799G/s pairs=137.343M/s relativeerror=266.557n bilinearf32cskylake<8d>/mintime:10.000/threads:1 42.8 ns 42.8 ns 326789735 absdelta=10.4774n bytes=2.98821G/s pairs=23.3454M/s relativeerror=267.67n bilinearf64skylake<8d>/mintime:10.000/threads:1 7.16 ns 7.16 ns 1000000000 absdelta=16.8322a bytes=17.8732G/s pairs=139.634M/s relativeerror=776.898a bilinearf64cskylake<8d>/mintime:10.000/threads:1 31.2 ns 31.2 ns 449958679 absdelta=17.4692a bytes=8.20188G/s pairs=32.0386M/s relativeerror=477.326a bilinearf64serial<8d>/mintime:10.000/threads:1 19.3 ns 19.3 ns 724453573 absdelta=0 bytes=6.63046G/s pairs=51.8005M/s relativeerror=0 bilinearf64cserial<8d>/mintime:10.000/threads:1 47.7 ns 47.7 ns 293638808 absdelta=0 bytes=5.36703G/s pairs=20.965M/s relativeerror=0 bilinearf32serial<8d>/mintime:10.000/threads:1 18.4 ns 18.4 ns 759547931 absdelta=7.93122n bytes=6.94336G/s pairs=54.245M/s relativeerror=213.04n bilinearf32cserial<8d>/mintime:10.000/threads:1 45.6 ns 45.6 ns 307012654 absdelta=9.52236n bytes=2.80829G/s pairs=21.9398M/s relativeerror=282.08n bilinearf16serial<8d>/mintime:10.000/threads:1 171 ns 171 ns 81713243 absdelta=7.46151n bytes=747.117M/s pairs=5.83685M/s relativeerror=187.409n bilinearf16cserial<8d>/mintime:10.000/threads:1 208 ns 208 ns 67195854 absdelta=8.79194n bytes=614.281M/s pairs=4.79907M/s relativeerror=265.818n bilinearbf16serial<8d>/mintime:10.000/threads:1 359 ns 359 ns 38947709 absdelta=5.77119n bytes=356.094M/s pairs=2.78198M/s relativeerror=122.725n bilinearbf16cserial<8d>/mintime:10.000/threads:1 744 ns 744 ns 18821435 absdelta=6.72388n bytes=172.071M/s pairs=1.34431M/s relativeerror=145.277n ```

Highlights:

  • For f32, the performance grew from 31.07 to 137.34 Million operations per second.
  • For f64, the performance grew from 23.44 to 139.63 Million operations per second.

Commits

  • Add: Bilinear complex kernels for NEON (cd15779)
  • Add: Half-precision bilinear forms on x86 (f804694)
  • Add: f64 bilinear forms in AVX-512 (7f24d59)
  • Add: Complex bilinear forms (3a56174)
  • Add: Complex structs (ee2de83)
  • Docs: Improved benchmarks table (acc61b5)
  • Make: Revert s390x and ppc64le support (82666ff)
  • Make: Pin cibuildwheel==2.21.3 (d865a46)
  • Make: Drop s390x and ppc64le (5d9a219)
  • Improve: Fewer PyTest runs (6effbea)
  • Make: Override s390x container (08c7ac0)
  • Make: Bump Py & TinySemVer (03eba14)
  • Make: Upgrade Py CI (d82c2bb)
  • Improve: Computing masks in AVX-512 (f259492)
  • Improve: Unroll AVX-512 bilinear forms (05e29b3)
  • Docs: Multi-threading in BLAS (4f351ad)
  • Fix: Write beyond buffer bounds (2bef182)
  • Improve: cBLAS Bilinear Form benchmarks (1e5b9d7)
  • Improve: Shorter macros (15eedb5)
  • Improve: Dispatch complex kernels (3ba20d0)
  • Fix: Missing idx_scalars in SVE (19a56bc)
  • Docs: Using complex & int types (67e39e6)
  • Improve: Complex numbers handling in bench.cxx (b923c1a)
  • Improve: Use complex types for dense.h (8941462)

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v6.1.2: Avoid AVX-512BW on Skylake-X

Previous AVX-512 implementation of complex products used an extra ZMM register for swap_adjacent_vec. Moreover, they used the vpshufb instruction available only with the Ice Lake capability and newer. The replacement uses the _mm512_permute_ps and its double-precision variant.

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v6.1.0: Add canonical L2 to Rust SDK

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.7

Release: v6.0.7 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.6

Release: v6.0.6 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.5

Release: v6.0.5 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.4

Release: v6.0.4 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.3

Release: v6.0.3 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.2

Release: v6.0.2 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v6.0.1

Release: v6.0.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v6.0: Towards BLAS

SimSIMD is becoming bigger! Our course is to become the go-to cross-platform mixed-precision BLAS library for dense and sparse representations on modern hardware 🥳

It, however, required some almost unnoticeable wording changes to support a broader type system. In Python, for example, the same l type specifier can map to both int32 and int64 across Windows and Linux/macOS. Moreover, expanding to 8-byte integers, the u8 specifiers can mean both 8-bit and 8-byte integers, which is confusing. That's why we are dropping short dtype descriptors. Before you could write:

py simsimd.dot(a, b, dtype='u8')

Now you have to write:

py simsimd.dot(a, b, dtype='uint8')

Safety comes at a cost, and this time the price is 2 extra letters 😉

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.11

Release: v5.9.11 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.10

Release: v5.9.10 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.9.9: Weighted Sums Special Cases on Arm

Release: v5.9.9 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.9.8: Weighted Sums Special Cases on x86

Release: v5.9.8 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.7

Release: v5.9.7 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.6

Release: v5.9.6 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.5

Release: v5.9.5 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.4

Release: v5.9.4 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.3

Release: v5.9.3 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.9.2: Faster FMA on Haswell

Handling loads and stores with SIMD is tricky. Not because of up-casting, but the down-casting at the end of the loop. In AVX2 it's a drag! We keep it for another day and use AVX2 for the actual math and value clipping. The current variant operates at 15-19 GB/s as opposed to under 500 MB/s for serial code.

```sh

Benchmark Time CPU Iterations UserCounters...

fmau8haswell<1536d>/mintime:10.000/threads:1 248 ns 248 ns 56523758 absdelta=8.20566 bytes=18.6111G/s pairs=4.03886M/s relativeerror=2.16737m wsumu8haswell<1536d>/mintime:10.000/threads:1 197 ns 197 ns 71164289 absdelta=7.76442 bytes=15.5983G/s pairs=5.07757M/s relativeerror=2.86599m fmau8sapphire<1536d>/mintime:10.000/threads:1 70.9 ns 70.9 ns 197581878 absdelta=9.2812 bytes=64.9908G/s pairs=14.1039M/s relativeerror=2.45142m wsumu8sapphire<1536d>/mintime:10.000/threads:1 51.2 ns 51.2 ns 275604255 absdelta=8.89144 bytes=60.0323G/s pairs=19.5418M/s relativeerror=3.28203m fmau8serial<1536d>/mintime:10.000/threads:1 9749 ns 9748 ns 1428411 absdelta=1.66854 bytes=472.69M/s pairs=102.58k/s relativeerror=440.882u wsumu8serial<1536d>/mintime:10.000/threads:1 9455 ns 9455 ns 1488320 absdelta=2.32787 bytes=324.901M/s pairs=105.762k/s relativeerror=859.403u ```

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.9.1

Release: v5.9.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.9: Fused-Multiply-Add in Mixed Precision

SimSIMD is expanding and becoming closer to a fully-fledged BLAS library. BLAS level 1 for now, but it's a start! SimSIMD will prioritize mixed and low-precision vector math, favoring modern AI workloads. For image & media processing workloads, the new fma and wsum kernels approach 65 GB/s per core on Intel Sapphire Rapids. That's 100x faster than the serial code for u8 inputs with f32 scaling and accumulation.

Contains the following element-wise operations:

math \text{FMA}_i(A, B, C, \alpha, \beta) = \alpha \cdot A_i \cdot B_i + \beta \cdot C_i

math \text{WSum}_i(A, B, \alpha, \beta) = \alpha \cdot A_i + \beta \cdot B_i

In NumPy terms:

py import numpy as np def wsum(A: np.ndarray, B: np.ndarray, Alpha: float, Beta: float) -> np.ndarray: assert A.dtype == B.dtype, "Input types must match and affect the output style" return (Alpha * A + Beta * B).astype(A.dtype) def fma(A: np.ndarray, B: np.ndarray, C: np.ndarray, Alpha: float, Beta: float) -> np.ndarray: assert A.dtype == B.dtype and A.dtype == C.dtype, "Input types must match and affect the output style" return (Alpha * A * B + Beta * C).astype(A.dtype)

This tiny set of operations is enough to implement a wide range of algorithms:

  • To scale a vector by a scalar, just call WSum with $\beta = 0$.
  • To sum two vectors, just call WSum with $\alpha = \beta = 1$.
  • To average two vectors, just call WSum with $\alpha = \beta = 0.5$.
  • To multiply vectors element-wise, just call FMA with $\beta = 0$.

Benchmarks

On Intel Sapphire Rapids:

```sh Run on (16 X 3900 MHz CPU s) CPU Caches: L1 Data 48 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 2048 KiB (x8) L3 Unified 61440 KiB (x1)

Load Average: 0.79, 0.75, 0.56

Benchmark Time CPU Iterations UserCounters...

fmaf64haswell<1536d>/mintime:10.000/threads:1 1344 ns 1344 ns 10391897 absdelta=0 bytes=27.4208G/s pairs=743.836k/s relativeerror=0 wsumf64haswell<1536d>/mintime:10.000/threads:1 1040 ns 1040 ns 13465261 absdelta=0 bytes=23.6376G/s pairs=961.815k/s relativeerror=0 fmaf32haswell<1536d>/mintime:10.000/threads:1 651 ns 651 ns 21534450 absdelta=23.597n bytes=28.3033G/s pairs=1.53555M/s relativeerror=47.0002n wsumf32haswell<1536d>/mintime:10.000/threads:1 392 ns 392 ns 36225731 absdelta=19.6436n bytes=31.3326G/s pairs=2.54985M/s relativeerror=54.2672n fmaf16haswell<1536d>/mintime:10.000/threads:1 188 ns 188 ns 74334715 absdelta=9.24044u bytes=49.1302G/s pairs=5.33097M/s relativeerror=18.3975u wsumf16haswell<1536d>/mintime:10.000/threads:1 130 ns 129 ns 106997523 absdelta=12.015u bytes=47.4441G/s pairs=7.72203M/s relativeerror=33.1896u fmabf16haswell<1536d>/mintime:10.000/threads:1 225 ns 225 ns 62443286 absdelta=1.91338m bytes=41.0221G/s pairs=4.45118M/s relativeerror=3.81108m wsumbf16haswell<1536d>/mintime:10.000/threads:1 161 ns 161 ns 86471812 absdelta=1.36093m bytes=38.1318G/s pairs=6.20635M/s relativeerror=3.75961m fmau8sapphire<1536d>/mintime:10.000/threads:1 70.9 ns 70.9 ns 197232316 absdelta=9.2812 bytes=64.9867G/s pairs=14.103M/s relativeerror=2.45142m wsumu8sapphire<1536d>/mintime:10.000/threads:1 50.6 ns 50.6 ns 276672248 absdelta=8.89144 bytes=60.6775G/s pairs=19.7518M/s relativeerror=3.28203m fmai8sapphire<1536d>/mintime:10.000/threads:1 94.0 ns 94.0 ns 149003863 absdelta=10.1192 bytes=49.0403G/s pairs=10.6424M/s relativeerror=6.98359m wsumi8sapphire<1536d>/mintime:10.000/threads:1 70.4 ns 70.4 ns 198873173 absdelta=9.76862 bytes=43.613G/s pairs=14.197M/s relativeerror=9.3472m fmaf64skylake<1536d>/mintime:10.000/threads:1 1340 ns 1340 ns 10460553 absdelta=39.3003a bytes=27.5182G/s pairs=746.479k/s relativeerror=78.2836a wsumf64skylake<1536d>/mintime:10.000/threads:1 1036 ns 1036 ns 13484768 absdelta=28.4608a bytes=23.717G/s pairs=965.047k/s relativeerror=78.6298a fmaf32skylake<1536d>/mintime:10.000/threads:1 626 ns 626 ns 22261554 absdelta=25.3818n bytes=29.4286G/s pairs=1.5966M/s relativeerror=50.5553n wsumf32skylake<1536d>/mintime:10.000/threads:1 386 ns 386 ns 35032887 absdelta=19.7444n bytes=31.8146G/s pairs=2.58908M/s relativeerror=54.5454n fmabf16skylake<1536d>/mintime:10.000/threads:1 188 ns 188 ns 74667249 absdelta=415.805u bytes=48.9511G/s pairs=5.31154M/s relativeerror=827.962u wsumbf16skylake<1536d>/mintime:10.000/threads:1 147 ns 147 ns 95128759 absdelta=269.793u bytes=41.8834G/s pairs=6.81696M/s relativeerror=745.331u fmaf16serial<1536d>/mintime:10.000/threads:1 900 ns 900 ns 15592180 absdelta=2.97965u bytes=10.2444G/s pairs=1.11159M/s relativeerror=5.93995u wsumf16serial<1536d>/mintime:10.000/threads:1 821 ns 821 ns 17058449 absdelta=1.11521u bytes=7.48594G/s pairs=1.21841M/s relativeerror=3.07961u fmau8serial<1536d>/mintime:10.000/threads:1 6692 ns 6692 ns 2089290 absdelta=1.66854 bytes=688.583M/s pairs=149.432k/s relativeerror=440.882u wsumu8serial<1536d>/mintime:10.000/threads:1 5577 ns 5577 ns 2508971 absdelta=2.32787 bytes=550.797M/s pairs=179.296k/s relativeerror=859.403u fmai8serial<1536d>/mintime:10.000/threads:1 6874 ns 6874 ns 2039761 absdelta=5.14013 bytes=670.367M/s pairs=145.479k/s relativeerror=3.54862m wsumi8serial<1536d>/mintime:10.000/threads:1 5851 ns 5851 ns 2394538 absdelta=6.36953 bytes=525.018M/s pairs=170.904k/s relativeerror=6.09231m ```

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.8.0

Release: v5.8.0 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.7.3

Release: v5.7.3 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.7.2

Release: v5.7.2 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.7.1

Release: v5.7.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.7.0

There are several ongoing efforts to extend the functionality of SimSIMD and this PR prepares some of the groundwork for:

  • 🆕 AMD Turin capability level
  • 🆕 Intel Sierra Forest capability level

Those are some amazing CPUs, featuring up to 244 cores per socket, with reduced latencies for some very powerful instructions. Moreover, SimSIMD now provides:

  • 🆕 Spatial kernels for sub-byte i4 vectors
  • 🆕 Sparse Dot Products

This PR also:

  • [x] Fixes cdist for complex inputs
  • [x] Enables dynamic dispatch in Swift
  • [x] Enables dynamic dispatch in JavaScript
  • [x] Ships a new benchmarking suite

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.6.4

Release: v5.6.4 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.6.3

Release: v5.6.3 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.6.2

Release: v5.6.2 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.6.1

Release: v5.6.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.6.0

Release: v5.6.0 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.5.1

Release: v5.5.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.5.0

Release: v5.5.0 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.4.4

Release: v5.4.4 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.4.3

Release: v5.4.3 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.4.2

Release: v5.4.2 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.4.1

Release: v5.4.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.4: 100x – 10'000x More Accurate Cosine Distance

The cosine similarity is the most common and straightforward metric used in machine learning and information retrieval. Interestingly, there are multiple ways to shoot yourself in the foot when computing it. The cosine similarity is the inverse of the cosine distance, which is the cosine of the angle between two vectors.

math \text{CosineSimilarity}(a, b) = \frac{a \cdot b}{\|a\| \cdot \|b\|}

math \text{CosineDistance}(a, b) = 1 - \frac{a \cdot b}{\|a\| \cdot \|b\|}

In NumPy terms, SimSIMD implementation is similar to:

```python import numpy as np

def cos_numpy(a: np.ndarray, b: np.ndarray) -> float: ab, a2, b2 = np.dot(a, b), np.dot(a, a), np.dot(b, b) # Fused in SimSIMD if a2 == 0 and b2 == 0: result = 0 # Same in SciPy elif ab == 0: result = 1 # Division by zero error in SciPy else: result = 1 - ab / (sqrt(a2) * sqrt(b2)) # Bigger rounding error in SciPy return result ```

In SciPy, however, the cosine distance is computed as 1 - ab / np.sqrt(a2 * b2). It handles the edge case of a zero and non-zero argument pair differently, resulting in a division by zero error. It's not only less efficient, but also less accurate, given how the reciprocal square roots are computed. The C standard library provides the sqrt function, which is generally very accurate, but slow. The rsqrt in-hardware implementations are faster, but have different accuracy characteristics.

  • SSE rsqrtps and AVX vrsqrtps: $1.5 \times 2^{-12}$ maximal error.
  • AVX-512 vrsqrt14pd instruction: $2^{-14}$ maximal error.
  • NEON frsqrte instruction has no clear error bounds.

To overcome the limitations of the rsqrt instruction, SimSIMD uses the Newton-Raphson iteration to refine the initial estimate for high-precision floating-point numbers. It can be defined as:

math x_{n+1} = x_n \cdot (3 - x_n \cdot x_n) / 2

On 1536-dimensional inputs on Intel Sapphire Rapids CPU a single such iteration can result in a 2-3 orders of magnitude relative error reduction:

| Datatype | NumPy Error | SimSIMD w/out Iteration | SimSIMD | | :--------- | ------------------: | ----------------------: | ------------------: | | bfloat16 | 1.89e-08 ± 1.59e-08 | 3.07e-07 ± 3.09e-07 | 3.53e-09 ± 2.70e-09 | | float16 | 1.67e-02 ± 1.44e-02 | 2.68e-05 ± 1.95e-05 | 2.02e-05 ± 1.39e-05 | | float32 | 2.21e-08 ± 1.65e-08 | 3.47e-07 ± 3.49e-07 | 3.77e-09 ± 2.84e-09 | | float64 | 0.00e+00 ± 0.00e+00 | 3.80e-07 ± 4.50e-07 | 1.35e-11 ± 1.85e-11 |

On Arm:

| Datatype | NumPy Error | SimSIMD w/out Iteration | SimSIMD | | :--------- | ------------------: | ----------------------: | ------------------: | | bfloat16 | 1.55e-09 ± 1.27e-09 | 2.79e-05 ± 3.60e-05 | 2.09e-08 ± 1.50e-08 | | float16 | 1.05e-05 ± 9.99e-06 | 4.97e-05 ± 4.33e-05 | 4.81e-05 ± 3.38e-05 | | float32 | 2.37e-09 ± 1.88e-09 | 1.79e-05 ± 1.69e-05 | 9.02e-09 ± 7.16e-09 | | float64 | 0.00e+00 ± 0.00e+00 | 2.54e-05 ± 2.32e-05 | 2.23e-13 ± 4.67e-13 |

Benchmarks

x86: Intel Sapphire Rapids

Baseline

sh +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | 0 | cosine | 11 | bfloat16 | 9.86e-09 ± 1.58e-08 | 3.35e-04 ± 4.42e-04 | 2.16e+04 ± 1.18e+03 | 2.42e+04 ± 2.79e+03 | 2.51e+03 ± 4.17e+02 | 9.90x ± 2.24x | | 1 | cosine | 11 | float16 | 1.46e-04 ± 1.83e-04 | 5.09e-04 ± 7.05e-04 | 2.16e+04 ± 1.27e+03 | 2.53e+04 ± 2.54e+03 | 1.15e+03 ± 9.31e+01 | 22.17x ± 1.76x | | 2 | cosine | 11 | float32 | 2.13e-08 ± 2.20e-08 | 2.69e-04 ± 4.08e-04 | 2.14e+04 ± 1.51e+03 | 2.37e+04 ± 3.52e+03 | 1.96e+03 ± 6.73e+03 | 23.21x ± 3.90x | | 3 | cosine | 11 | float64 | 0.00e+00 ± 0.00e+00 | 4.51e-04 ± 5.78e-04 | 2.57e+04 ± 1.16e+04 | 1.57e+04 ± 1.55e+03 | 1.51e+03 ± 9.03e+02 | 11.57x ± 2.21x | | 4 | cosine | 11 | int8 | 0.00e+00 ± 0.00e+00 | 4.56e-04 ± 5.30e-04 | 1.59e+04 ± 6.32e+02 | 1.60e+04 ± 5.11e+02 | 1.72e+03 ± 6.12e+02 | 9.89x ± 1.86x | | 5 | cosine | 97 | bfloat16 | 6.71e-09 ± 7.90e-09 | 1.31e-04 ± 1.47e-04 | 2.14e+04 ± 9.71e+02 | 2.36e+04 ± 4.33e+02 | 2.47e+03 ± 3.95e+02 | 9.82x ± 1.71x | | 6 | cosine | 97 | float16 | 3.00e-05 ± 2.42e-05 | 1.00e-04 ± 7.79e-05 | 2.15e+04 ± 1.70e+03 | 2.70e+04 ± 2.02e+03 | 1.18e+03 ± 8.51e+01 | 22.89x ± 2.06x | | 7 | cosine | 97 | float32 | 6.84e-09 ± 5.72e-09 | 1.13e-04 ± 1.19e-04 | 2.19e+04 ± 1.84e+03 | 2.33e+04 ± 1.91e+03 | 1.04e+03 ± 9.38e+01 | 22.44x ± 2.38x | | 8 | cosine | 97 | float64 | 0.00e+00 ± 0.00e+00 | 9.69e-05 ± 1.54e-04 | 2.13e+04 ± 2.00e+03 | 1.54e+04 ± 1.39e+03 | 1.30e+03 ± 1.20e+02 | 11.92x ± 1.47x | | 9 | cosine | 97 | int8 | 0.00e+00 ± 0.00e+00 | 1.14e-04 ± 1.33e-04 | 1.56e+04 ± 4.34e+02 | 1.60e+04 ± 3.64e+02 | 1.57e+03 ± 2.48e+02 | 10.43x ± 1.55x | | 10 | cosine | 1536 | bfloat16 | 1.55e-09 ± 1.27e-09 | 2.79e-05 ± 3.60e-05 | 2.78e+04 ± 1.54e+03 | 2.73e+04 ± 4.66e+02 | 2.83e+03 ± 3.41e+02 | 9.82x ± 1.25x | | 11 | cosine | 1536 | float16 | 1.05e-05 ± 9.99e-06 | 4.97e-05 ± 4.33e-05 | 2.56e+04 ± 2.02e+03 | 5.44e+04 ± 1.77e+03 | 1.48e+03 ± 1.78e+02 | 37.23x ± 4.42x | | 12 | cosine | 1536 | float32 | 2.37e-09 ± 1.88e-09 | 1.79e-05 ± 1.69e-05 | 2.49e+04 ± 1.29e+03 | 2.63e+04 ± 5.41e+03 | 1.56e+03 ± 3.41e+02 | 17.46x ± 3.77x | | 13 | cosine | 1536 | float64 | 0.00e+00 ± 0.00e+00 | 2.54e-05 ± 2.32e-05 | 2.51e+04 ± 2.21e+03 | 1.87e+04 ± 2.87e+02 | 2.39e+03 ± 6.24e+02 | 8.25x ± 1.68x | | 14 | cosine | 1536 | int8 | 0.00e+00 ± 0.00e+00 | 3.06e-05 ± 3.12e-05 | 1.91e+04 ± 1.14e+03 | 2.18e+04 ± 1.17e+03 | 1.72e+03 ± 2.66e+02 | 13.00x ± 2.13x | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+

With 1 Iteration

sh +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | 0 | cosine | 11 | bfloat16 | 3.04e-08 ± 2.53e-08 | 3.63e-09 ± 6.75e-09 | 1.24e+04 ± 8.90e+02 | 7.19e+03 ± 4.48e+02 | 2.75e+03 ± 7.66e+02 | 2.71x ± 0.37x | | 1 | cosine | 11 | float16 | 2.61e-04 ± 2.45e-04 | 2.12e-04 ± 3.90e-04 | 1.24e+04 ± 9.59e+02 | 9.28e+03 ± 1.72e+03 | 1.27e+03 ± 5.28e+02 | 7.65x ± 1.19x | | 2 | cosine | 11 | float32 | 2.91e-08 ± 1.81e-08 | 1.20e-08 ± 1.26e-08 | 1.36e+04 ± 4.00e+03 | 8.32e+03 ± 2.87e+03 | 1.09e+03 ± 1.63e+02 | 7.55x ± 1.54x | | 3 | cosine | 11 | float64 | 0.00e+00 ± 0.00e+00 | 3.35e-10 ± 4.33e-10 | 1.35e+04 ± 7.24e+03 | 6.02e+03 ± 8.72e+02 | 1.58e+03 ± 1.18e+03 | 4.45x ± 1.08x | | 4 | cosine | 11 | int8 | 0.00e+00 ± 0.00e+00 | 2.81e-03 ± 1.80e-02 | 9.17e+03 ± 4.17e+03 | 8.02e+03 ± 1.51e+03 | 1.76e+03 ± 2.00e+02 | 4.56x ± 0.81x | | 5 | cosine | 97 | bfloat16 | 2.02e-08 ± 1.25e-08 | 3.44e-09 ± 4.63e-09 | 1.34e+04 ± 3.38e+03 | 7.79e+03 ± 2.84e+03 | 2.55e+03 ± 1.09e+02 | 3.05x ± 1.07x | | 6 | cosine | 97 | float16 | 1.97e-04 ± 1.18e-04 | 5.37e-05 ± 4.52e-05 | 1.26e+04 ± 1.11e+03 | 1.06e+04 ± 2.95e+03 | 1.19e+03 ± 1.46e+02 | 8.90x ± 1.93x | | 7 | cosine | 97 | float32 | 2.39e-08 ± 1.36e-08 | 5.66e-09 ± 4.83e-09 | 1.31e+04 ± 3.11e+03 | 7.78e+03 ± 1.25e+03 | 1.26e+03 ± 8.08e+02 | 6.92x ± 1.63x | | 8 | cosine | 97 | float64 | 0.00e+00 ± 0.00e+00 | 6.84e-11 ± 1.10e-10 | 1.25e+04 ± 1.21e+03 | 6.51e+03 ± 1.69e+03 | 1.37e+03 ± 3.63e+02 | 4.89x ± 1.28x | | 9 | cosine | 97 | int8 | 0.00e+00 ± 0.00e+00 | 1.93e-03 ± 4.20e-03 | 8.37e+03 ± 1.87e+03 | 7.89e+03 ± 7.80e+02 | 2.02e+03 ± 1.66e+03 | 4.34x ± 0.69x | | 10 | cosine | 1536 | bfloat16 | 2.25e-08 ± 1.61e-08 | 3.53e-09 ± 2.70e-09 | 1.52e+04 ± 2.81e+03 | 8.28e+03 ± 5.26e+02 | 3.07e+03 ± 1.32e+02 | 2.70x ± 0.20x | | 11 | cosine | 1536 | float16 | 2.00e-02 ± 1.76e-02 | 2.02e-05 ± 1.39e-05 | 1.43e+04 ± 2.37e+03 | 2.74e+04 ± 3.46e+03 | 1.38e+03 ± 1.25e+02 | 19.98x ± 2.35x | | 12 | cosine | 1536 | float32 | 2.24e-08 ± 1.40e-08 | 3.77e-09 ± 2.84e-09 | 1.36e+04 ± 2.64e+03 | 8.64e+03 ± 8.04e+02 | 1.23e+03 ± 8.10e+01 | 7.06x ± 0.72x | | 13 | cosine | 1536 | float64 | 0.00e+00 ± 0.00e+00 | 1.35e-11 ± 1.85e-11 | 1.34e+04 ± 1.27e+03 | 7.31e+03 ± 8.12e+02 | 1.98e+03 ± 2.02e+03 | 4.49x ± 1.01x | | 14 | cosine | 1536 | int8 | 0.00e+00 ± 0.00e+00 | 4.20e-04 ± 4.88e-04 | 9.47e+03 ± 2.09e+03 | 1.01e+04 ± 1.11e+03 | 1.95e+03 ± 1.04e+02 | 5.19x ± 0.56x | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+

Arm: AWS Graviton 3

Baseline

sh +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | 0 | cosine | 11 | bfloat16 | 1.15e-08 ± 1.76e-08 | 3.27e-08 ± 2.04e-08 | 2.15e+04 ± 9.10e+02 | 2.34e+04 ± 7.89e+02 | 2.68e+03 ± 2.09e+03 | 9.91x ± 2.11x | | 1 | cosine | 11 | float16 | 1.36e-04 ± 2.03e-04 | 1.30e-04 ± 1.46e-04 | 2.12e+04 ± 1.69e+03 | 2.57e+04 ± 3.00e+03 | 9.60e+02 ± 7.11e+01 | 26.76x ± 2.32x | | 2 | cosine | 11 | float32 | 1.87e-08 ± 1.99e-08 | 3.84e-04 ± 4.15e-04 | 2.08e+04 ± 1.68e+03 | 2.35e+04 ± 3.43e+03 | 8.79e+02 ± 8.79e+01 | 26.85x ± 2.98x | | 3 | cosine | 11 | float64 | 0.00e+00 ± 0.00e+00 | 6.10e-04 ± 1.27e-03 | 2.50e+04 ± 1.20e+04 | 1.55e+04 ± 1.45e+03 | 1.24e+03 ± 7.82e+02 | 13.99x ± 2.64x | | 4 | cosine | 11 | int8 | 0.00e+00 ± 0.00e+00 | 2.37e-08 ± 1.57e-08 | 1.59e+04 ± 7.21e+02 | 1.64e+04 ± 3.14e+03 | 1.48e+03 ± 2.97e+02 | 11.38x ± 2.44x | | 5 | cosine | 97 | bfloat16 | 5.98e-09 ± 6.36e-09 | 2.19e-08 ± 1.39e-08 | 2.14e+04 ± 7.54e+02 | 2.35e+04 ± 1.15e+03 | 2.31e+03 ± 3.70e+02 | 10.45x ± 2.11x | | 6 | cosine | 97 | float16 | 3.40e-05 ± 2.87e-05 | 5.63e-05 ± 4.57e-05 | 2.13e+04 ± 2.05e+03 | 2.67e+04 ± 1.57e+03 | 9.43e+02 ± 6.95e+01 | 28.48x ± 2.48x | | 7 | cosine | 97 | float32 | 9.55e-09 ± 7.13e-09 | 9.71e-05 ± 1.50e-04 | 2.06e+04 ± 1.38e+03 | 2.27e+04 ± 1.03e+03 | 8.77e+02 ± 6.84e+01 | 26.02x ± 2.05x | | 8 | cosine | 97 | float64 | 0.00e+00 ± 0.00e+00 | 1.31e-04 ± 1.89e-04 | 2.07e+04 ± 2.22e+03 | 1.53e+04 ± 1.25e+03 | 1.06e+03 ± 1.08e+02 | 14.52x ± 1.51x | | 9 | cosine | 97 | int8 | 0.00e+00 ± 0.00e+00 | 2.06e-08 ± 1.53e-08 | 1.59e+04 ± 2.18e+03 | 1.58e+04 ± 1.79e+02 | 1.37e+03 ± 2.12e+02 | 11.81x ± 1.86x | | 10 | cosine | 1536 | bfloat16 | 1.76e-09 ± 1.55e-09 | 2.07e-08 ± 1.44e-08 | 2.84e+04 ± 1.25e+03 | 2.77e+04 ± 7.20e+02 | 3.20e+03 ± 3.70e+02 | 8.80x ± 1.12x | | 11 | cosine | 1536 | float16 | 8.31e-06 ± 7.39e-06 | 4.23e-05 ± 3.41e-05 | 2.50e+04 ± 1.64e+03 | 5.42e+04 ± 2.20e+03 | 1.22e+03 ± 1.41e+02 | 44.85x ± 4.00x | | 12 | cosine | 1536 | float32 | 2.64e-09 ± 1.97e-09 | 2.61e-05 ± 3.13e-05 | 2.44e+04 ± 3.00e+03 | 2.57e+04 ± 1.63e+03 | 1.25e+03 ± 1.69e+02 | 20.80x ± 1.94x | | 13 | cosine | 1536 | float64 | 0.00e+00 ± 0.00e+00 | 1.59e-05 ± 1.54e-05 | 2.47e+04 ± 1.90e+03 | 1.90e+04 ± 1.83e+03 | 1.90e+03 ± 4.16e+02 | 10.32x ± 1.99x | | 14 | cosine | 1536 | int8 | 0.00e+00 ± 0.00e+00 | 2.04e-08 ± 1.39e-08 | 1.90e+04 ± 1.38e+03 | 2.15e+04 ± 3.39e+02 | 1.48e+03 ± 2.51e+02 | 14.91x ± 2.54x | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+

With 2 Iterations

sh +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+ | 0 | cosine | 11 | bfloat16 | 1.54e-08 ± 2.76e-08 | 2.94e-08 ± 2.62e-08 | 2.09e+04 ± 1.18e+03 | 2.37e+04 ± 2.20e+03 | 2.16e+03 ± 4.40e+02 | 11.41x ± 2.71x | | 1 | cosine | 11 | float16 | 1.32e-04 ± 1.43e-04 | 1.96e-04 ± 2.77e-04 | 2.19e+04 ± 1.90e+03 | 2.59e+04 ± 4.38e+03 | 9.59e+02 ± 9.05e+01 | 27.03x ± 3.00x | | 2 | cosine | 11 | float32 | 3.44e-08 ± 4.95e-08 | 2.11e-08 ± 2.49e-08 | 2.11e+04 ± 1.36e+03 | 2.37e+04 ± 4.08e+03 | 8.57e+02 ± 8.07e+01 | 27.65x ± 3.71x | | 3 | cosine | 11 | float64 | 0.00e+00 ± 0.00e+00 | 8.65e-12 ± 1.39e-11 | 2.52e+04 ± 1.22e+04 | 1.56e+04 ± 1.36e+03 | 1.32e+03 ± 7.76e+02 | 13.13x ± 2.61x | | 4 | cosine | 11 | int8 | 0.00e+00 ± 0.00e+00 | 3.03e-08 ± 3.66e-08 | 1.61e+04 ± 1.03e+03 | 1.60e+04 ± 5.69e+02 | 1.58e+03 ± 3.06e+02 | 10.39x ± 1.62x | | 5 | cosine | 97 | bfloat16 | 5.22e-09 ± 4.67e-09 | 2.43e-08 ± 1.48e-08 | 2.12e+04 ± 8.81e+02 | 2.38e+04 ± 1.98e+03 | 2.13e+03 ± 4.24e+02 | 11.58x ± 2.17x | | 6 | cosine | 97 | float16 | 3.17e-05 ± 3.81e-05 | 6.11e-05 ± 5.12e-05 | 2.15e+04 ± 1.56e+03 | 2.70e+04 ± 2.32e+03 | 9.84e+02 ± 9.83e+01 | 27.66x ± 3.59x | | 7 | cosine | 97 | float32 | 7.65e-09 ± 6.03e-09 | 8.76e-09 ± 5.92e-09 | 2.14e+04 ± 1.90e+03 | 2.31e+04 ± 1.93e+03 | 9.10e+02 ± 8.64e+01 | 25.54x ± 3.07x | | 8 | cosine | 97 | float64 | 0.00e+00 ± 0.00e+00 | 1.48e-12 ± 2.76e-12 | 2.11e+04 ± 1.81e+03 | 1.53e+04 ± 6.54e+02 | 1.15e+03 ± 1.13e+02 | 13.34x ± 1.24x | | 9 | cosine | 97 | int8 | 0.00e+00 ± 0.00e+00 | 2.29e-08 ± 1.49e-08 | 1.60e+04 ± 2.33e+03 | 1.61e+04 ± 2.06e+03 | 1.41e+03 ± 2.06e+02 | 11.64x ± 1.95x | | 10 | cosine | 1536 | bfloat16 | 2.04e-09 ± 1.61e-09 | 2.09e-08 ± 1.50e-08 | 2.84e+04 ± 1.13e+03 | 2.81e+04 ± 1.77e+03 | 2.98e+03 ± 4.43e+02 | 9.62x ± 1.47x | | 11 | cosine | 1536 | float16 | 8.23e-06 ± 8.19e-06 | 4.81e-05 ± 3.38e-05 | 2.57e+04 ± 2.31e+03 | 5.45e+04 ± 2.38e+03 | 1.23e+03 ± 1.69e+02 | 44.93x ± 5.18x | | 12 | cosine | 1536 | float32 | 2.41e-09 ± 1.51e-09 | 9.02e-09 ± 7.16e-09 | 2.53e+04 ± 3.03e+03 | 2.59e+04 ± 1.42e+03 | 1.46e+03 ± 2.97e+02 | 18.45x ± 3.57x | | 13 | cosine | 1536 | float64 | 0.00e+00 ± 0.00e+00 | 2.23e-13 ± 4.67e-13 | 2.57e+04 ± 4.05e+03 | 1.87e+04 ± 9.64e+02 | 2.27e+03 ± 6.14e+02 | 8.75x ± 1.99x | | 14 | cosine | 1536 | int8 | 0.00e+00 ± 0.00e+00 | 2.32e-08 ± 1.39e-08 | 1.92e+04 ± 2.52e+03 | 2.17e+04 ± 4.07e+02 | 1.48e+03 ± 2.82e+02 | 15.13x ± 2.60x | +----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.3: 5x Faster Set Intersections with SVE2, NEON, and AVX-512

set-intersections

  • Need to accelerate TF-IDF search ranking?
  • Joining large tables in an OLAP database?
  • Implementing graph algorithms?

Chances are - you need fast set intersections! It's one of the most common operations in programming, yet one of the hardest to accelerate with SIMD! This PR improves existing kernels and adds new ones for fast set intersections of sorted arrays of unique u16 and u32 values. Now, SimSIMD is not practically the only production codebase to use Arm SVE, but also one of the first to use the new SVE2 instructions available on Graviton 4 AWS CPUs, and coming to Nvidia's Grace Hopper, Microsoft Cobalt, and Google Axios! So upgrade to v5.2 and let's make the databases & search systems go 5x faster!

Speedups on x86

The new AVX-512 variant shows significant improvements in pairs/s across all benchmarks:

  • For |A|=128, |B|=128, |A∩B|=1, pairs/s increased
    • from 1.14M/s in the old implementation
    • to 7.73M/s in the new one, a 6.7x improvement.
  • At |A∩B|=64, the pairs/s rose:
    • from 1.13M/s in the old implementation
    • to 8.19M/s in the new one, a 7.2x gain.
  • For larger sets, like |A|=1024 and |B|=8192, with |A∩B|=10, pairs/s increased:
    • from 130.18k/s in the old implementation
    • to 194.50k/s in the new one, a 49% gain.

However, in cases like |A|=128, |B|=8192, with |A∩B|=64, pairs/s slightly decreased from 369.7k/s to 222.9k/s. Overall, the new implementation outperforms the previous one, and no case is worse than the serial version.

Speedups on Arm

On the Arm architecture, similar performance gains were achieved using the NEON and SVE2 instruction sets:

  • The optimized NEON implementation showed a 3.9x improvement in pairs/s for |A|=128, |B|=128, |A∩B|=1, going from 1.62M/s to 5.12M/s.
  • For |A∩B|=64 in the same configuration, performance improved from 1.60M/s to 5.51M/s, showing a 3.4x gain.
  • The SVE2 implementation also outperformed the previous SVE setup, achieving 5.6M/s (NEON) versus 1.27M/s (SVE) for |A|=128, |B|=128, |A∩B|=1, a 4.4x improvement.
  • In larger datasets, such as |A|=1024 and |B|=8192, the pairs/s increased from 49.3k/s to 110.03k/s, with NEON, and further to 109.47k/s with SVE2, nearly doubling the performance.

x86 Benchmarking Setup

The benchmarking was conducted on r7iz AWS instances with Intel Sapphire Rapids CPUs.

sh Running build_release/simsimd_bench Run on (16 X 3900.51 MHz CPU s) CPU Caches: L1 Data 48 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 2048 KiB (x8) L3 Unified 61440 KiB (x1)

Old Serial Baselines

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16serial<|A|=128,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 567 ns 567 ns 24785678 pairs=1.76263M/s intersectu16serial<|A|=128,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 567 ns 567 ns 24598141 pairs=1.76286M/s intersectu16serial<|A|=128,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 569 ns 569 ns 24741572 pairs=1.75684M/s intersectu16serial<|A|=128,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 568 ns 568 ns 24871638 pairs=1.76073M/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 2508 ns 2508 ns 5591748 pairs=398.803k/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 2509 ns 2509 ns 5589871 pairs=398.535k/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 2530 ns 2530 ns 5564535 pairs=395.33k/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 2522 ns 2522 ns 5532306 pairs=396.447k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 4791 ns 4791 ns 2920833 pairs=208.737k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 4800 ns 4800 ns 2923139 pairs=208.346k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 4821 ns 4820 ns 2906942 pairs=207.448k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 4843 ns 4843 ns 2897334 pairs=206.504k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 4484 ns 4484 ns 3122873 pairs=223.023k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 4479 ns 4479 ns 3124662 pairs=223.261k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 4484 ns 4484 ns 3125584 pairs=223.034k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 4500 ns 4500 ns 3104588 pairs=222.229k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 20118 ns 20117 ns 696244 pairs=49.7084k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 20134 ns 20134 ns 696160 pairs=49.6682k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 20125 ns 20124 ns 695799 pairs=49.6911k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 20102 ns 20102 ns 695762 pairs=49.7464k/s ```

Existing AVX-512 Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16ice<|A|=128d,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 875 ns 875 ns 16248886 pairs=1.14342M/s intersectu16ice<|A|=128d,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 873 ns 873 ns 16081249 pairs=1.14555M/s intersectu16ice<|A|=128d,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 882 ns 882 ns 15851609 pairs=1.13354M/s intersectu16ice<|A|=128d,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 916 ns 916 ns 15282595 pairs=1091.32k/s intersectu16ice<|A|=128d,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 955 ns 955 ns 14660187 pairs=1047.53k/s intersectu16ice<|A|=128d,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 955 ns 955 ns 14663375 pairs=1047.57k/s intersectu16ice<|A|=128d,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 952 ns 952 ns 14702462 pairs=1050.17k/s intersectu16ice<|A|=128d,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 949 ns 949 ns 14743103 pairs=1053.59k/s intersectu16ice<|A|=128d,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 2718 ns 2718 ns 5168053 pairs=367.871k/s intersectu16ice<|A|=128d,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 2698 ns 2698 ns 5155819 pairs=370.664k/s intersectu16ice<|A|=128d,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 2705 ns 2705 ns 5203675 pairs=369.686k/s intersectu16ice<|A|=128d,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 2693 ns 2693 ns 5187007 pairs=371.377k/s intersectu16ice<|A|=1024d,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 7310 ns 7310 ns 1910292 pairs=136.8k/s intersectu16ice<|A|=1024d,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 7312 ns 7312 ns 1913190 pairs=136.759k/s intersectu16ice<|A|=1024d,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 7365 ns 7365 ns 1900946 pairs=135.781k/s intersectu16ice<|A|=1024d,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 7439 ns 7439 ns 1882319 pairs=134.43k/s intersectu16ice<|A|=1024d,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 7682 ns 7681 ns 1821784 pairs=130.183k/s intersectu16ice<|A|=1024d,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 7695 ns 7695 ns 1821861 pairs=129.955k/s intersectu16ice<|A|=1024d,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 7643 ns 7643 ns 1829955 pairs=130.842k/s intersectu16ice<|A|=1024d,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 7617 ns 7617 ns 1838612 pairs=131.279k/s ```

New AVX-512 Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16ice<|A|=128,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 129 ns 129 ns 101989513 pairs=7.72559M/s intersectu16ice<|A|=128,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 134 ns 134 ns 107140278 pairs=7.46949M/s intersectu16ice<|A|=128,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 122 ns 122 ns 113134485 pairs=8.18634M/s intersectu16ice<|A|=128,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 114 ns 114 ns 122765163 pairs=8.75268M/s intersectu16ice<|A|=128,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 1042 ns 1042 ns 13412933 pairs=959.711k/s intersectu16ice<|A|=128,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 1035 ns 1035 ns 13423867 pairs=966.278k/s intersectu16ice<|A|=128,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 1038 ns 1038 ns 13401265 pairs=963.267k/s intersectu16ice<|A|=128,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 1055 ns 1055 ns 13170438 pairs=948.193k/s intersectu16ice<|A|=128,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 4315 ns 4315 ns 3024069 pairs=231.776k/s intersectu16ice<|A|=128,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 3999 ns 3999 ns 3371134 pairs=250.088k/s intersectu16ice<|A|=128,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 4486 ns 4486 ns 3278143 pairs=222.9k/s intersectu16ice<|A|=128,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 4525 ns 4525 ns 3170802 pairs=220.991k/s intersectu16ice<|A|=1024,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 817 ns 817 ns 17102654 pairs=1.22419M/s intersectu16ice<|A|=1024,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 820 ns 820 ns 17168886 pairs=1.22003M/s intersectu16ice<|A|=1024,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 793 ns 793 ns 17756237 pairs=1.26107M/s intersectu16ice<|A|=1024,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 747 ns 747 ns 18261381 pairs=1.33794M/s intersectu16ice<|A|=1024,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 5142 ns 5142 ns 2728465 pairs=194.496k/s intersectu16ice<|A|=1024,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 5114 ns 5114 ns 2727670 pairs=195.56k/s intersectu16ice<|A|=1024,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 5142 ns 5142 ns 2716714 pairs=194.491k/s intersectu16ice<|A|=1024,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 5151 ns 5151 ns 2721708 pairs=194.148k/s ```

Arm Benchmarking Setup

The benchmarking was conducted on r8g AWS instances with Graviton 4 CPUs.

sh Running build_release/simsimd_bench Run on (2 X 2000 MHz CPU s) CPU Caches: L1 Data 64 KiB (x2) L1 Instruction 64 KiB (x2) L2 Unified 2048 KiB (x2) L3 Unified 36864 KiB (x1)

Old Serial Baselines

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16serial<|A|=128,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 615 ns 614 ns 22780083 pairs=1.62833M/s intersectu16serial<|A|=128,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 610 ns 608 ns 22727971 pairs=1.64341M/s intersectu16serial<|A|=128,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 622 ns 622 ns 22356453 pairs=1.60786M/s intersectu16serial<|A|=128,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 679 ns 679 ns 20641056 pairs=1.47332M/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 2542 ns 2542 ns 5511491 pairs=393.332k/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 2539 ns 2539 ns 5512132 pairs=393.822k/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 2535 ns 2535 ns 5511950 pairs=394.436k/s intersectu16serial<|A|=128,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 2546 ns 2546 ns 5504586 pairs=392.843k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 4122 ns 4122 ns 3374465 pairs=242.586k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 4117 ns 4117 ns 3372418 pairs=242.884k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 4138 ns 4138 ns 3374977 pairs=241.657k/s intersectu16serial<|A|=128,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 4142 ns 4142 ns 3361656 pairs=241.412k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 4569 ns 4564 ns 3072148 pairs=219.129k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 4557 ns 4557 ns 3075313 pairs=219.419k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 4577 ns 4577 ns 3052064 pairs=218.472k/s intersectu16serial<|A|=1024,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 4728 ns 4728 ns 2980530 pairs=211.504k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 20278 ns 20273 ns 690191 pairs=49.3276k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 21192 ns 20272 ns 691680 pairs=49.3302k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 21438 ns 20268 ns 689617 pairs=49.3384k/s intersectu16serial<|A|=1024,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 22010 ns 20317 ns 692675 pairs=49.2207k/s ```

Old SVE Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16sve<|A|=128,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 794 ns 788 ns 17715501 pairs=1.26918M/s intersectu16sve<|A|=128,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 809 ns 785 ns 17579527 pairs=1.27438M/s intersectu16sve<|A|=128,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 819 ns 810 ns 17229391 pairs=1.23482M/s intersectu16sve<|A|=128,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 878 ns 856 ns 16347952 pairs=1.16827M/s intersectu16sve<|A|=128,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 1475 ns 1380 ns 10129190 pairs=724.869k/s intersectu16sve<|A|=128,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 1400 ns 1361 ns 10312201 pairs=734.514k/s intersectu16sve<|A|=128,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 1353 ns 1344 ns 10427410 pairs=743.793k/s intersectu16sve<|A|=128,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 1369 ns 1350 ns 10516190 pairs=740.815k/s intersectu16sve<|A|=128,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 7156 ns 7009 ns 1991602 pairs=142.677k/s intersectu16sve<|A|=128,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 7095 ns 6982 ns 2006057 pairs=143.232k/s intersectu16sve<|A|=128,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 7328 ns 6967 ns 2004803 pairs=143.537k/s intersectu16sve<|A|=128,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 6966 ns 6963 ns 2013422 pairs=143.624k/s intersectu16sve<|A|=1024,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 7119 ns 6517 ns 2143784 pairs=153.437k/s intersectu16sve<|A|=1024,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 6978 ns 6522 ns 2146331 pairs=153.331k/s intersectu16sve<|A|=1024,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 6721 ns 6533 ns 2141325 pairs=153.067k/s intersectu16sve<|A|=1024,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 7046 ns 6675 ns 2095016 pairs=149.823k/s intersectu16sve<|A|=1024,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 10819 ns 10722 ns 1307796 pairs=93.2695k/s intersectu16sve<|A|=1024,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 11295 ns 10729 ns 1305575 pairs=93.2031k/s intersectu16sve<|A|=1024,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 10596 ns 10596 ns 1317798 pairs=94.3769k/s intersectu16sve<|A|=1024,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 10527 ns 10486 ns 1337148 pairs=95.3626k/s ```

New NEON Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16neon<|A|=128,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 195 ns 195 ns 72473251 pairs=5.12346M/s intersectu16neon<|A|=128,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 193 ns 193 ns 71826322 pairs=5.17983M/s intersectu16neon<|A|=128,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 181 ns 181 ns 76859132 pairs=5.51211M/s intersectu16neon<|A|=128,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 161 ns 161 ns 86301671 pairs=6.22906M/s intersectu16neon<|A|=128,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 1199 ns 1027 ns 13866808 pairs=973.295k/s intersectu16neon<|A|=128,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 1171 ns 1034 ns 13729254 pairs=966.886k/s intersectu16neon<|A|=128,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 1120 ns 1038 ns 13671085 pairs=963.804k/s intersectu16neon<|A|=128,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 1150 ns 1051 ns 13070692 pairs=951.238k/s intersectu16neon<|A|=128,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 2587 ns 2446 ns 5685615 pairs=408.885k/s intersectu16neon<|A|=128,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 2595 ns 2490 ns 5538880 pairs=401.615k/s intersectu16neon<|A|=128,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 2482 ns 2460 ns 5704185 pairs=406.459k/s intersectu16neon<|A|=128,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 2512 ns 2512 ns 5592948 pairs=398.064k/s intersectu16neon<|A|=1024,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 1599 ns 1573 ns 8893290 pairs=635.781k/s intersectu16neon<|A|=1024,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 1570 ns 1570 ns 8950291 pairs=637.098k/s intersectu16neon<|A|=1024,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 1488 ns 1488 ns 9449103 pairs=672.121k/s intersectu16neon<|A|=1024,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 1332 ns 1332 ns 10582682 pairs=751.007k/s intersectu16neon<|A|=1024,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 8997 ns 8997 ns 1556944 pairs=111.144k/s intersectu16neon<|A|=1024,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 8999 ns 8999 ns 1554324 pairs=111.128k/s intersectu16neon<|A|=1024,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 9126 ns 9070 ns 1543769 pairs=110.257k/s intersectu16neon<|A|=1024,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 9089 ns 9089 ns 1536462 pairs=110.029k/s ```

New SVE2 Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

intersectu16sve2<|A|=128,|B|=128,|A∩B|=1>/mintime:10.000/threads:1 179 ns 178 ns 77997900 pairs=5.60245M/s intersectu16sve2<|A|=128,|B|=128,|A∩B|=6>/mintime:10.000/threads:1 179 ns 179 ns 77959137 pairs=5.59776M/s intersectu16sve2<|A|=128,|B|=128,|A∩B|=64>/mintime:10.000/threads:1 170 ns 170 ns 82829421 pairs=5.88598M/s intersectu16sve2<|A|=128,|B|=128,|A∩B|=121>/mintime:10.000/threads:1 143 ns 143 ns 97771708 pairs=6.9995M/s intersectu16sve2<|A|=128,|B|=1024,|A∩B|=1>/mintime:10.000/threads:1 900 ns 900 ns 15430306 pairs=1.11111M/s intersectu16sve2<|A|=128,|B|=1024,|A∩B|=6>/mintime:10.000/threads:1 909 ns 909 ns 15374525 pairs=1099.58k/s intersectu16sve2<|A|=128,|B|=1024,|A∩B|=64>/mintime:10.000/threads:1 922 ns 922 ns 15025863 pairs=1085.12k/s intersectu16sve2<|A|=128,|B|=1024,|A∩B|=121>/mintime:10.000/threads:1 932 ns 932 ns 15083373 pairs=1072.6k/s intersectu16sve2<|A|=128,|B|=8192,|A∩B|=1>/mintime:10.000/threads:1 2135 ns 2135 ns 6460842 pairs=468.333k/s intersectu16sve2<|A|=128,|B|=8192,|A∩B|=6>/mintime:10.000/threads:1 2118 ns 2118 ns 6509484 pairs=472.238k/s intersectu16sve2<|A|=128,|B|=8192,|A∩B|=64>/mintime:10.000/threads:1 2138 ns 2138 ns 6468742 pairs=467.706k/s intersectu16sve2<|A|=128,|B|=8192,|A∩B|=121>/mintime:10.000/threads:1 2136 ns 2136 ns 6419653 pairs=468.097k/s intersectu16sve2<|A|=1024,|B|=1024,|A∩B|=10>/mintime:10.000/threads:1 1502 ns 1502 ns 9329372 pairs=665.698k/s intersectu16sve2<|A|=1024,|B|=1024,|A∩B|=51>/mintime:10.000/threads:1 1492 ns 1492 ns 9375601 pairs=670.246k/s intersectu16sve2<|A|=1024,|B|=1024,|A∩B|=512>/mintime:10.000/threads:1 1416 ns 1416 ns 9859829 pairs=706.16k/s intersectu16sve2<|A|=1024,|B|=1024,|A∩B|=972>/mintime:10.000/threads:1 1274 ns 1274 ns 11052636 pairs=785.05k/s intersectu16sve2<|A|=1024,|B|=8192,|A∩B|=10>/mintime:10.000/threads:1 9148 ns 9148 ns 1528714 pairs=109.319k/s intersectu16sve2<|A|=1024,|B|=8192,|A∩B|=51>/mintime:10.000/threads:1 9150 ns 9150 ns 1529679 pairs=109.287k/s intersectu16sve2<|A|=1024,|B|=8192,|A∩B|=512>/mintime:10.000/threads:1 9148 ns 9147 ns 1527762 pairs=109.32k/s intersectu16sve2<|A|=1024,|B|=8192,|A∩B|=972>/mintime:10.000/threads:1 9135 ns 9135 ns 1529316 pairs=109.473k/s ```

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.2.1

Release: v5.2.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.2: 2x Faster `bf16` Euclidean Distance on AMD Genoa

Performance Improvements

The older bf16 Euclidean Distance implementation had an inefficient implementation for mixed-precision vector subtractions. The new one is very similar, but avoids a couple of serial operations and doubles the throughput:

c SIMSIMD_INTERNAL __m512i simsimd_substract_bf16x32_genoa(__m512i a_i16, __m512i b_i16) { union { __m512 fvec; __m512i ivec; simsimd_f32_t f32[16]; simsimd_u16_t u16[32]; simsimd_bf16_t bf16[32]; } d_odd, d_even, d, a_f32_even, b_f32_even, d_f32_even, a_f32_odd, b_f32_odd, d_f32_odd, a, b; a.ivec = a_i16; b.ivec = b_i16; a_f32_odd.ivec = _mm512_and_si512(a_i16, _mm512_set1_epi32(0xFFFF0000)); a_f32_even.ivec = _mm512_slli_epi32(a_i16, 16); b_f32_odd.ivec = _mm512_and_si512(b_i16, _mm512_set1_epi32(0xFFFF0000)); b_f32_even.ivec = _mm512_slli_epi32(b_i16, 16); d_f32_odd.fvec = _mm512_sub_ps(a_f32_odd.fvec, b_f32_odd.fvec); d_f32_even.fvec = _mm512_sub_ps(a_f32_even.fvec, b_f32_even.fvec); d_f32_even.ivec = _mm512_srli_epi32(d_f32_even.ivec, 16); d.ivec = _mm512_mask_blend_epi16(0x55555555, d_f32_odd.ivec, d_f32_even.ivec); return d.ivec; }

Old Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

l2sqbf16haswell128d/mintime:10.000/threads:1 15.2 ns 15.2 ns 890417569 absdelta=41.5895n bytes=33.6296G/s pairs=65.6828M/s relativeerror=20.6195n l2sqbf16genoa128d/mintime:10.000/threads:1 16.3 ns 16.3 ns 867745590 absdelta=7.74925m bytes=31.3522G/s pairs=61.2348M/s relativeerror=3.87658m l2sqbf16serial128d/mintime:10.000/threads:1 599 ns 599 ns 23382373 absdelta=489.092n bytes=855.039M/s pairs=1.67M/s relativeerror=244.952n ```

New Implementation

```sh

Benchmark Time CPU Iterations UserCounters...

l2sqbf16haswell128d/mintime:10.000/threads:1 14.7 ns 14.7 ns 952634662 absdelta=37.4399n bytes=34.7926G/s pairs=67.9544M/s relativeerror=18.9709n l2sqbf16genoa128d/mintime:10.000/threads:1 8.45 ns 8.45 ns 1000000000 absdelta=7.70743m bytes=60.5856G/s pairs=118.331M/s relativeerror=3.85599m l2sqbf16serial128d/mintime:10.000/threads:1 599 ns 599 ns 23376884 absdelta=471.586n bytes=854.885M/s pairs=1.6697M/s relativeerror=236.592n ```

Other Changes

  • @rschu1ze added a very handy feature-detection function simsimd_uses_dynamic_dispatch() 🙌 Do we need to expose it to Rust, Python, and other bindings?
  • @MarkReedZ added missing kernels to the benchmark utility 🙌

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.1.4

Release: v5.1.4 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.1.3

Release: v5.1.3 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.1.2

Release: v5.1.2 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.1.1: Faster Dispatch & Better Docs in 🐍

Preview

Most importantly, this is the first SimSIMD release deprecating Python 3.6, released in 2016. Now, 8 years later, we deprecated it to more broadly utilize the Fast Calling Convention. Read more in a dedicated article on the cost of function arguments parsing in Pyhton - 35% discount on keyword arguments 😄

  • Thanks to @stuartatnosible for noticing dtype= issues 👓
  • Thanks to @MarkReedZ for accelerating bf16 dot-product on Arm 🦾

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v5.1: Curved Spaces & Sparse Sets

Major additions:

  • [x] Set Intersections & Sparse Distances #100
  • [x] Bilinear Forms & Curved Spaces #149, suggested by @guyrosin
  • [x] Reduced Memory Consumption #142, suggested by @Charlyo
  • [x] Fixing accuracy issues #153, spotted by @cbornet

Minor fixes:

  • Extends benchmarks, thanks to @MarkReedZ
  • Cleans up CMake, suggested by @rschu1ze
  • Exposing f16, i8, b8 to Python buffers in 33f1b13
  • Documenting algorithms in 405bef0
  • Exposing bf16 to Rust, thanks to @Wyctus

Some crazy findings:

  • rsqrt precision on Arm is not documented at all
  • rsqrt approximation for double in AVX-512 is only 6x more accurate, than for float
  • SciPy raises the wrong errors when overflows - ValueError instead of OverflowError
  • SciPy occasionally misuses math.sqrt over numpy.sqrt when dealing with NumPy array
  • sqrt in libc is bit-precise

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - Release v5.0.1

Release: v5.0.1 [skip ci]

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - New bf16 capabilities on Arm 🦾

Handling bf16 & Dynamic Dispatch on Arm

This major release adds new capability levels for Arm allowing us to differentiate f16, bf16. and i8-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 the bf16 kernels perform very well:

dot_bf16_neon_1536d/min_time:10.000/threads:1 183 ns 183 ns 76204478 abs_delta=0 bytes=33.5194G/s pairs=5.45563M/s relative_error=0 cos_bf16_neon_1536d/min_time:10.000/threads:1 239 ns 239 ns 58180403 abs_delta=0 bytes=25.7056G/s pairs=4.18386M/s relative_error=0 l2sq_bf16_neon_1536d/min_time:10.000/threads:1 312 ns 312 ns 43724273 abs_delta=0 bytes=19.7064G/s pairs=3.20742M/s relative_error=0

The bf16 kernels reach 33 GB/s as opposed to 19 GB/s for f16:

dot_f16_neon_1536d/min_time:10.000/threads:1 323 ns 323 ns 43311367 abs_delta=82.3015n bytes=19.0324G/s pairs=3.09772M/s relative_error=109.717n cos_f16_neon_1536d/min_time:10.000/threads:1 367 ns 367 ns 38007895 abs_delta=1.5456m bytes=16.7349G/s pairs=2.72377M/s relative_error=6.19568m l2sq_f16_neon_1536d/min_time:10.000/threads:1 341 ns 341 ns 41010555 abs_delta=66.7783n bytes=18.0436G/s pairs=2.93679M/s relative_error=133.449n

Researching MMLA Extensions & Future Work

Arm supports 2x2 matrix multiplications for i8 and bf16. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:

cos_i8_neon_16d/min_time:10.000/threads:1 5.41 ns 5.41 ns 1000000000 abs_delta=910.184u bytes=5.91441G/s pairs=184.825M/s relative_error=4.20295m cos_i8_neon_64d/min_time:10.000/threads:1 7.63 ns 7.63 ns 1000000000 abs_delta=939.825u bytes=16.7729G/s pairs=131.039M/s relative_error=3.82144m cos_i8_neon_1536d/min_time:10.000/threads:1 101 ns 101 ns 139085845 abs_delta=917.35u bytes=30.394G/s pairs=9.89387M/s relative_error=3.63925m

Attempts with i8 for different dimensionality vectors:

cos_i8_neon_16d/min_time:10.000/threads:1 5.72 ns 5.72 ns 1000000000 abs_delta=0.282084 bytes=5.59562G/s pairs=174.863M/s relative_error=1.15086 cos_i8_neon_64d/min_time:10.000/threads:1 8.40 ns 8.40 ns 1000000000 abs_delta=0.234385 bytes=15.2345G/s pairs=119.02M/s relative_error=0.923009 cos_i8_neon_1536d/min_time:10.000/threads:1 117 ns 117 ns 118998604 abs_delta=0.23264 bytes=26.2707G/s pairs=8.55167M/s relative_error=0.920099

This line of work is to be continued, in parallel with new similarity metrics and distance functions.

Other Improvements

  • Added the missing Haswell kernels for f32 spatial distances on older x86 CPUs. Thanks to @makarr 👏
  • Added Python typing extensions for the native modules. Thanks to @cbornet 👏
  • Fixed offline builds from the .tar.gz packages, that were missing a VERSION file. Thanks to @OD-Ice 👏

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - New `bf16` & faster `i8` kernels ❤️ AMD Genoa

New bf16 kernels

The "brain-float-16" is a popular machine learning format. It's broadly supported in hardware and is very machine-friendly, but software support is still lagging behind - https://github.com/numpy/numpy/issues/19808. Most importantly, low-precision bf16 dot-products are supported by the most recent Zen4-based AMD Genoa CPUs. Those have up-to 96 cores, and just one of those cores is capable of computing 86 GB/s worth of such dot-products.

```bash

Benchmark Time CPU Iterations UserCounters...

dotbf16haswell1536d/mintime:10.000/threads:1 203 ns 203 ns 68785823 absdelta=29.879n bytes=30.1978G/s pairs=4.91501M/s relativeerror=39.8289n dotbf16haswell1536b/mintime:10.000/threads:1 93.0 ns 93.0 ns 150582910 absdelta=24.8365n bytes=33.0344G/s pairs=10.7534M/s relativeerror=33.1108n dotbf16genoa1536d/mintime:10.000/threads:1 71.0 ns 71.0 ns 197340105 absdelta=23.6042n bytes=86.5917G/s pairs=14.0937M/s relativeerror=31.4977n dotbf16genoa1536b/mintime:10.000/threads:1 36.1 ns 36.1 ns 387637713 absdelta=22.3063n bytes=85.0019G/s pairs=27.6699M/s relativeerror=29.7341n dotbf16serial1536d/mintime:10.000/threads:1 15992 ns 15991 ns 874491 absdelta=311.896n bytes=384.216M/s pairs=62.5352k/s relativeerror=415.887n dotbf16serial1536b/mintime:10.000/threads:1 7979 ns 7978 ns 1754703 absdelta=193.719n bytes=385.045M/s pairs=125.34k/s relativeerror=258.429n dotbf16cserial1536d/mintime:10.000/threads:1 16430 ns 16429 ns 852438 absdelta=251.692n bytes=373.964M/s pairs=60.8665k/s relativeerror=336.065n dotbf16cserial1536b/mintime:10.000/threads:1 8207 ns 8202 ns 1707289 absdelta=165.209n bytes=374.54M/s pairs=121.92k/s relativeerror=220.35n vdotbf16cserial1536d/mintime:10.000/threads:1 16489 ns 16488 ns 849194 absdelta=247.646n bytes=372.639M/s pairs=60.6509k/s relativeerror=330.485n vdotbf16cserial1536b/mintime:10.000/threads:1 8224 ns 8217 ns 1704397 absdelta=162.036n bytes=373.839M/s pairs=121.693k/s relativeerror=216.042n ```

That's a steep 3x improvement over single-precision FMA throughput we can obtain by simply shifting bf16 left by 16 bits and using _mm256_fmadd_ps intrinsic / vfmadd instruction available since Intel Haswell.

Faster i8 kernels

We can't directly use _mm512_dpbusd_epi32 every time we want to compute a low-precision integer dot-product, as it's asymmetric with respect to the sign of the input arguments:

Signed(ZeroExtend16(a.byte[4j]) * SignExtend16(b.byte[4j]))

In the past we would just upcast to 16-bit integers and resort to _mm512_dpwssds_epi32. It is a much more costly multiplication circuit, and, assuming that I avoid loop unrolling, also implies 2x fewer scalars per loop. But for cosine distances there is something simple we can do. Assuming that we multiply the vector by itself, even if a certain vector component is negative, its square will always be positive. So we can avoid the expensive 16-bit operation at least where we compute the vector norms:

c a_abs_vec = _mm512_abs_epi8(a_vec); b_abs_vec = _mm512_abs_epi8(b_vec); a2_i32s_vec = _mm512_dpbusds_epi32(a2_i32s_vec, a_abs_vec, a_abs_vec); b2_i32s_vec = _mm512_dpbusds_epi32(b2_i32s_vec, b_abs_vec, b_abs_vec);

On Intel Sapphire Rapids it resulted in a higher single-thread utilization, but didn't lead to improvements on other platforms.

```sh

Benchmark Time CPU Iterations UserCounters...

cosi8haswell1536d/mintime:10.000/threads:1 92.4 ns 92.4 ns 151487077 absdelta=105.739u bytes=33.2344G/s pairs=10.8185M/s relativeerror=405.868u cosi8haswell1536b/mintime:10.000/threads:1 92.4 ns 92.4 ns 151478714 absdelta=0 bytes=33.2383G/s pairs=10.8198M/s relativeerror=0 cosi8ice1536d/mintime:10.000/threads:1 61.6 ns 61.6 ns 227445214 absdelta=0 bytes=49.898G/s pairs=16.2428M/s relativeerror=0 cosi8ice1536b/mintime:10.000/threads:1 61.5 ns 61.5 ns 227609621 absdelta=0 bytes=49.9167G/s pairs=16.2489M/s relativeerror=0 cosi8serial1536d/mintime:10.000/threads:1 299 ns 299 ns 46788061 absdelta=0 bytes=10.2666G/s pairs=3.34198M/s relativeerror=0 cosi8serial1536b/mintime:10.000/threads:1 299 ns 299 ns 46787275 absdelta=0 bytes=10.2663G/s pairs=3.34191M/s relativeerror=0 ```

New timings:

```sh

Benchmark Time CPU Iterations UserCounters...

cosi8haswell1536d/mintime:10.000/threads:1 92.4 ns 92.4 ns 151463294 absdelta=105.739u bytes=33.2359G/s pairs=10.819M/s relativeerror=405.868u cosi8haswell1536b/mintime:10.000/threads:1 92.4 ns 92.4 ns 151470519 absdelta=0 bytes=33.2392G/s pairs=10.82M/s relativeerror=0 cosi8ice1536d/mintime:10.000/threads:1 48.1 ns 48.1 ns 292087642 absdelta=0 bytes=63.8408G/s pairs=20.7815M/s relativeerror=0 cosi8ice1536b/mintime:10.000/threads:1 48.2 ns 48.2 ns 291716009 absdelta=0 bytes=63.7662G/s pairs=20.7572M/s relativeerror=0 cosi8serial1536d/mintime:10.000/threads:1 299 ns 299 ns 46784120 absdelta=0 bytes=10.2647G/s pairs=3.34139M/s relativeerror=0 cosi8serial1536b/mintime:10.000/threads:1 299 ns 299 ns 46781350 absdelta=0 bytes=10.2654G/s pairs=3.3416M/s relativeerror=0 ```

- C
Published by ashvardanian over 1 year ago

https://github.com/ashvardanian/simsimd - v4.3.1

4.3.1 (2024-04-10)

Make

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v4.3.0

4.3.0 (2024-04-08)

Add

  • toBinary for JavaScript (1f1fd3a)

Improve

  • Procedural Rust benchmarks (e01ec6c)
  • Unrolled Rust benchmarks (#108) (508e7a0), closes #108

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v4.2.2

4.2.2 (2024-03-31)

Fix

  • f16 casting on Arm (dda096b)
  • Dot-products compilation on SVE (f5fe36d)

Make

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v4.2.1

4.2.1 (2024-03-26)

Fix

Make

  • Avoid native f16 by default (bd02af2)
  • Shorter Rust compilation (db9a9d2)

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v4.2.0

4.2.0 (2024-03-25)

Add

Fix

Make

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v4.1.1

4.1.1 (2024-03-24)

Fix

Make

  • Unused function attributes (c85e098)

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - SIMD on Windows 🤗 🪟

This release refactors compiler attributes and intrinsics usage to make it compatible with MSVC. Most noticeably, function defined like this:

c __attribute__((target("+simd"))) inline static void simsimd_cos_f32_neon(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) { }

Now look like this:

```c

pragma GCC push_options

pragma GCC target("+simd")

pragma clang attribute push(attribute((target("+simd"))), apply_to = function)

inline static void simsimdcosf32neon(simsimdf32t const* a, simsimdf32t const* b, simsimdsizet n, simsimddistance_t* result) { }

pragma clang attribute pop

pragma GCC pop_options

```

Thanks to that SimSIMD on Windows is gonna be just as fast as on Linux and MacOS 🤗 🪟


4.1.0 vs 4.0.0 Change Log (2024-03-23)

Add

  • Bench against TF, PyTorch, JAX (5e64152)
  • Complex dot-products for Rust (936fe73)
  • Double-precision interfaces for JS (07f2aca)

Docs

Fix

  • Avoid vld2_f16 on MSVC (ce9800e)
  • Complex dispatch in Rust & C (d349bcd)
  • Missing _mm_rsqrt14_ps in MSVC (21e30fe)
  • Missing float16_t in MSVC Arm64 builds (94442c3)

Improve

Make

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v4.0.0

This is a packed redesign! Let's start with what's cool about it and later cover the mechanics.

  1. Extends dot products covering the entire matrix:
    • all IEEE 754 floating-point formats (f16, f32, f64)
    • real, complex, complex-conjugate dot-products
    • Arm NEON & SVE, x86 Haswell, Skylake, Ice Lake, Sapphire Rapids
    • Add support for complex32 Python type ... that:

SimSIMD is now the fastest and most popular library for computing half-precision products/similarities for Fourier Series and other complex data 🥳


What breaks:

  • Return types are now 64-bit floats, up from 32.
  • Inner products are now defined as AB, instead of 1 - AB for broader applicability.

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v3.9.0

3.9.0 (2024-03-04)

Add

  • Complex numbers support (0a0665a)
  • Hamming & Jaccard for pre-AVX512 CPUs (4f1eba1), closes #69
  • Rust binary distances (960af05), closes #84

Fix

Improve

Make

  • Bump ip from 2.0.0 to 2.0.1 (#92) (559a16d), closes #92

- C
Published by ashvardanian almost 2 years ago

https://github.com/ashvardanian/simsimd - v3.8.1

3.8.1 (2024-02-22)

Docs

  • Reorg contribution guide (e8be593)

Fix

  • Detect tests running in QEMU (80b9fec)
  • Numerical issues (cdd7516)
  • NumPy and SciPy issues with PyPy (9bd4ada)

Improve

  • Accurate tail math on Aarch64 (5b39a8e)
  • Accurate tail math with AVX2 (f61d5be)
  • Drop NumPy dependency (819a406)
  • Rust bench performance and style (#85) (da62146), closes #85
  • Test and ship w/out NumPy and SciPy (5b800d3)

Make

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.8.0

3.8.0 (2024-02-13)

Add

  • enable/disable_capability in Python (c7e90f9), closes #68

Improve

  • Reuse caps in Rust and C standard compliance (aa164f6)

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.7.7

3.7.7 (2024-02-11)

Make

  • crates.io allows only 5 keywords/categories (25ea3c8)

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.7.6

3.7.6 (2024-02-11)

Make

  • Don't download Google Benchmark by default (0d14000)
  • Extend Rust package metadata (6f85161)
  • Remove redundant file (352cfe8)

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.7.4

3.7.4 (2024-01-30)

Fix

  • Rust cosine function (#77) (cdd282d), closes #77

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.7.3

3.7.3 (2024-01-28)

Make

  • Rever Crate dependency versions (3807f28)

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.7.1

3.7.1 (2024-01-28)

Fix

Make

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.7.0

3.7.0 (2024-01-28)

Add

  • Rust binding for SimSIMD (#75) (ec4c686), closes #75

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.7

3.6.7 (2024-01-24)

Fix

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.6

3.6.6 (2024-01-22)

  • Fallback for Vercel-based apps (#66) (dc6de11), closes #66
  • Memory leak in cdist (#61) (0469ec2), closes #61
  • Py version is inferred from macros (234a282)

Thanks to @sroussey and @smthngslv 👏

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.5

3.6.5 (2024-01-18)

Make

  • ESM and CommonJS release with fallbacks (#63) (d57f82b), closes #63

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.4

3.6.4 (2024-01-08)

Docs

  • TypeScript declaration file (#53) (5f6a688), closes #53

Make

  • Prebuild JavaScript bindings (#56) (1bd9001), closes #56

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.3

3.6.3 (2024-01-06)

Make

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.2

3.6.2 (2024-01-06)

Docs

  • Describe usage in C (555ce0c)
  • JS installation, grammar and counters (#50) (ba0e233), closes #50
  • typo in README.md (#49) (330c039), closes #49

Fix

  • Type errors in JS benchmarks (#51) (57ced28), closes #51

- C
Published by ashvardanian about 2 years ago

https://github.com/ashvardanian/simsimd - v3.6.1

3.6.1 (2023-12-19)

Docs

Fix

  • SEGFAULT creating NumPy Array (6cccca9)

Improve

Make

  • Update Python library __version__ (14559ed)

Test

  • Increase error tolerance (d216035)

- C
Published by ashvardanian about 2 years ago