Releases | Open Source Science

Highlights

Support for mxfp4 quantization (Metal, CPU)
More performance improvements, bug fixes, features in CUDA backend
mx.distributed supports NCCL back-end for CUDA

What's Changed

[CUDA] Optimize setmmdevice_pointers for small ndim by @zcbenz in https://github.com/ml-explore/mlx/pull/2473
Fix logsumexp/softmax not fused for some cases by @zcbenz in https://github.com/ml-explore/mlx/pull/2474
Use CMake <4.1 to avoid the nvpl error by @angeloskath in https://github.com/ml-explore/mlx/pull/2489
Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in https://github.com/ml-explore/mlx/pull/2477
make code blocks copyable by @Dan-Yeh in https://github.com/ml-explore/mlx/pull/2480
Rename cu::Matmul to CublasGemm by @zcbenz in https://github.com/ml-explore/mlx/pull/2488
Faster general unary op by @awni in https://github.com/ml-explore/mlx/pull/2472
The naiveconv2d is no longer used by @zcbenz in https://github.com/ml-explore/mlx/pull/2496
Remove the hack around SmallVector in cpu compile by @zcbenz in https://github.com/ml-explore/mlx/pull/2494
Clean up code handling both std::vector and SmallVector by @zcbenz in https://github.com/ml-explore/mlx/pull/2493
[CUDA] Fix conv grads with groups by @zcbenz in https://github.com/ml-explore/mlx/pull/2495
Update cuDNN Frontend to v1.14 by @zcbenz in https://github.com/ml-explore/mlx/pull/2505
Ensure small sort doesn't use indices if not argsort by @angeloskath in https://github.com/ml-explore/mlx/pull/2506
Ensure no oob read in gemv_masked by @angeloskath in https://github.com/ml-explore/mlx/pull/2508
fix custom kernel test by @awni in https://github.com/ml-explore/mlx/pull/2510
No segfault with uninitialized array.at by @awni in https://github.com/ml-explore/mlx/pull/2514
Fix lapack svd by @awni in https://github.com/ml-explore/mlx/pull/2515
Split cuDNN helpers into a separate header by @zcbenz in https://github.com/ml-explore/mlx/pull/2491
[CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in https://github.com/ml-explore/mlx/pull/2511
Fix docs by @russellizadi in https://github.com/ml-explore/mlx/pull/2518
Fix overflow in large filter small channels by @angeloskath in https://github.com/ml-explore/mlx/pull/2520
[CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in https://github.com/ml-explore/mlx/pull/2521
Custom cuda kernel by @angeloskath in https://github.com/ml-explore/mlx/pull/2517
Fix docs omission by @angeloskath in https://github.com/ml-explore/mlx/pull/2524
Fix power by @awni in https://github.com/ml-explore/mlx/pull/2523
NCCL backend by @nastya236 in https://github.com/ml-explore/mlx/pull/2476
[CUDA] Nccl pypi dep + default for cuda by @awni in https://github.com/ml-explore/mlx/pull/2526
Fix warning 186-D from nvcc by @zcbenz in https://github.com/ml-explore/mlx/pull/2527
[CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 by @andportnoy in https://github.com/ml-explore/mlx/pull/2525
nccl default for backend=any by @awni in https://github.com/ml-explore/mlx/pull/2528
Fix allocation bug in NCCL by @awni in https://github.com/ml-explore/mlx/pull/2530
Enable COMPILEWARNINGAS_ERROR for linux builds in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2534
[CUDA] Remove thrust in arange by @zcbenz in https://github.com/ml-explore/mlx/pull/2535
Use nccl header only when nccl is not present by @awni in https://github.com/ml-explore/mlx/pull/2539
Allow pathlib.Path to save/load functions by @awni in https://github.com/ml-explore/mlx/pull/2541
Remove nccl install in release by @awni in https://github.com/ml-explore/mlx/pull/2542
[CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in https://github.com/ml-explore/mlx/pull/2533
Remove stream from average grads so it uses default by @awni in https://github.com/ml-explore/mlx/pull/2532
Enable cuda graph toggle by @awni in https://github.com/ml-explore/mlx/pull/2545
Tests for save/load with Path by @awni in https://github.com/ml-explore/mlx/pull/2543
Run CPP tests for CUDA build in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2544
Separate cpu compilation cache by versions by @zcbenz in https://github.com/ml-explore/mlx/pull/2548
[CUDA] Link with nccl by @awni in https://github.com/ml-explore/mlx/pull/2546
[CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in https://github.com/ml-explore/mlx/pull/2549
[CUDA] fix sort by @awni in https://github.com/ml-explore/mlx/pull/2550
Add mode parameter for quantization by @awni in https://github.com/ml-explore/mlx/pull/2499
Bump xcode in circle by @awni in https://github.com/ml-explore/mlx/pull/2551
Fix METAL quantization in JIT + fix release build by @awni in https://github.com/ml-explore/mlx/pull/2553
Faster contiguous gather for indices in the first axis by @awni in https://github.com/ml-explore/mlx/pull/2552
version bump by @awni in https://github.com/ml-explore/mlx/pull/2554
Fix quantized vjp for mxfp4 by @awni in https://github.com/ml-explore/mlx/pull/2555

New Contributors

@Dan-Yeh made their first contribution in https://github.com/ml-explore/mlx/pull/2480
@russellizadi made their first contribution in https://github.com/ml-explore/mlx/pull/2518
@andportnoy made their first contribution in https://github.com/ml-explore/mlx/pull/2525

Full Changelog: https://github.com/ml-explore/mlx/compare/v0.28.0...v0.29.0

- C++
Published by awni 6 months ago

Highlights

First version of fused sdpa vector for CUDA
Convolutions in CUDA
Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more

What's Changed

[CUDA] Fix segfault on exit by @awni in https://github.com/ml-explore/mlx/pull/2424
[CUDA] No occupancy query for launch params by @awni in https://github.com/ml-explore/mlx/pull/2426
[CUDA] More sizes for gemv by @awni in https://github.com/ml-explore/mlx/pull/2429
Add more CUDA architectures for PyPi package by @awni in https://github.com/ml-explore/mlx/pull/2427
Use ccache in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2414
[CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in https://github.com/ml-explore/mlx/pull/2433
Cuda faster softmax by @awni in https://github.com/ml-explore/mlx/pull/2435
Remove the kernel arg from getlaunchargs by @zcbenz in https://github.com/ml-explore/mlx/pull/2437
Move arange to its own file by @zcbenz in https://github.com/ml-explore/mlx/pull/2438
Use loadvector in argreduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2439
Make CI faster by @zcbenz in https://github.com/ml-explore/mlx/pull/2440
[CUDA] Quantized refactoring by @angeloskath in https://github.com/ml-explore/mlx/pull/2442
fix circular reference by @awni in https://github.com/ml-explore/mlx/pull/2443
[CUDA] Fix gemv regression by @awni in https://github.com/ml-explore/mlx/pull/2445
Fix wrong graph key when using concurrent context by @zcbenz in https://github.com/ml-explore/mlx/pull/2447
Fix custom metal extension by @awni in https://github.com/ml-explore/mlx/pull/2446
Add tests for export including control flow models and quantized models by @junpeiz in https://github.com/ml-explore/mlx/pull/2430
[CUDA] Backward convolution by @zcbenz in https://github.com/ml-explore/mlx/pull/2431
[CUDA] Save primitive inputs faster by @zcbenz in https://github.com/ml-explore/mlx/pull/2449
[CUDA] Vectorize generated kernels by @angeloskath in https://github.com/ml-explore/mlx/pull/2444
[CUDA] Matmul utils initial commit by @angeloskath in https://github.com/ml-explore/mlx/pull/2441
Fix arctan2 grads by @angeloskath in https://github.com/ml-explore/mlx/pull/2453
Use LRU cache for cuda graph by @zcbenz in https://github.com/ml-explore/mlx/pull/2448
Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in https://github.com/ml-explore/mlx/pull/2460
Default install cuda on linux by @awni in https://github.com/ml-explore/mlx/pull/2462
fix wraps compile by @awni in https://github.com/ml-explore/mlx/pull/2461
Feat: add USESYSTEMFMT CMake option by @GaetanLepage in https://github.com/ml-explore/mlx/pull/2219
Use SmallVector for shapes and strides by @zcbenz in https://github.com/ml-explore/mlx/pull/2454
Fix install tags by @awni in https://github.com/ml-explore/mlx/pull/2464
Faster gather qmm sorted test by @awni in https://github.com/ml-explore/mlx/pull/2463
Fix cublas on h100 by @awni in https://github.com/ml-explore/mlx/pull/2466
revert default cuda install by @awni in https://github.com/ml-explore/mlx/pull/2465
feat: support a destinations based in tree flatten/unflatten by @LVivona in https://github.com/ml-explore/mlx/pull/2450
Fix typo in metal command encoder by @angeloskath in https://github.com/ml-explore/mlx/pull/2471
Update CUDA sdpa by @jagrit06 in https://github.com/ml-explore/mlx/pull/2468
version by @awni in https://github.com/ml-explore/mlx/pull/2470

New Contributors

@junpeiz made their first contribution in https://github.com/ml-explore/mlx/pull/2430
@zamderax made their first contribution in https://github.com/ml-explore/mlx/pull/2460
@GaetanLepage made their first contribution in https://github.com/ml-explore/mlx/pull/2219
@LVivona made their first contribution in https://github.com/ml-explore/mlx/pull/2450

Full Changelog: https://github.com/ml-explore/mlx/compare/v0.27.1...v0.28.0

- C++
Published by angeloskath 7 months ago

Highlights

Initial PyPi release of the CUDA back-end.
CUDA back-end works for well with mlx-lm:
- Reasonably fast for LLM inference
- Supports single-machine training and LoRA fine-tuning

What's Changed

Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in https://github.com/ml-explore/mlx/pull/2232
Share more common code in Compiled by @zcbenz in https://github.com/ml-explore/mlx/pull/2240
Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in https://github.com/ml-explore/mlx/pull/2231
Perf regression fix by @angeloskath in https://github.com/ml-explore/mlx/pull/2243
Add profiler annotations in common primitives for CUDA backend by @zcbenz in https://github.com/ml-explore/mlx/pull/2244
Default strict mode for module update and update_modules by @awni in https://github.com/ml-explore/mlx/pull/2239
Fix linux linking error by @awni in https://github.com/ml-explore/mlx/pull/2248
Improve metal elementwise kernels by @awni in https://github.com/ml-explore/mlx/pull/2247
CUDA backend: matmul by @zcbenz in https://github.com/ml-explore/mlx/pull/2241
Change layernorms to two pass algorithm by @angeloskath in https://github.com/ml-explore/mlx/pull/2246
Fix unintuitive metal kernel caching by @awni in https://github.com/ml-explore/mlx/pull/2242
Refactor the lu test by @emmanuel-ferdman in https://github.com/ml-explore/mlx/pull/2250
CUDA backend: unary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2158
Fix export to work with gather/scatter axis by @awni in https://github.com/ml-explore/mlx/pull/2263
CUDA backend: binary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2259
Report number of missing parameters by @FL33TW00D in https://github.com/ml-explore/mlx/pull/2264
CUDA backend: sort by @zcbenz in https://github.com/ml-explore/mlx/pull/2262
CUDA backend: random by @zcbenz in https://github.com/ml-explore/mlx/pull/2261
Fix conv export by @awni in https://github.com/ml-explore/mlx/pull/2265
CUDA backend: copy ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2260
Fix building cpp benchmarks on Linux by @zcbenz in https://github.com/ml-explore/mlx/pull/2268
Add load_safe to the general conv loaders by @angeloskath in https://github.com/ml-explore/mlx/pull/2258
start cuda circle config by @awni in https://github.com/ml-explore/mlx/pull/2256
CUDA backend: reduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2269
CUDA backend: argreduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2270
CUDA backend: softmax by @zcbenz in https://github.com/ml-explore/mlx/pull/2272
CUDA backend: layernorm by @zcbenz in https://github.com/ml-explore/mlx/pull/2271
Fix warnings from latest CUDA toolkit by @zcbenz in https://github.com/ml-explore/mlx/pull/2275
Make sliceUpdate general by @awni in https://github.com/ml-explore/mlx/pull/2282
CUDA backend: compile by @zcbenz in https://github.com/ml-explore/mlx/pull/2276
[CUDA] RMSNorm and VJP by @awni in https://github.com/ml-explore/mlx/pull/2280
[CUDA] Fix build by @awni in https://github.com/ml-explore/mlx/pull/2284
[CUDA] ternary with select op by @awni in https://github.com/ml-explore/mlx/pull/2283
CUDA backend: indexing ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2277
Collection of refactors by @jagrit06 in https://github.com/ml-explore/mlx/pull/2274
Fix complex power and print by @awni in https://github.com/ml-explore/mlx/pull/2286
fix cuda jit by @awni in https://github.com/ml-explore/mlx/pull/2287
Fix cuda gemm for bf16 by @awni in https://github.com/ml-explore/mlx/pull/2288
Fix cuda arg reduce by @awni in https://github.com/ml-explore/mlx/pull/2291
RoPE for CUDA by @angeloskath in https://github.com/ml-explore/mlx/pull/2293
Add python testing for cuda with ability to skip list of tests by @awni in https://github.com/ml-explore/mlx/pull/2295
[CUDA] Fix back-end bugs and enable corresponding tests by @awni in https://github.com/ml-explore/mlx/pull/2296
Cuda bug fixes 2 by @awni in https://github.com/ml-explore/mlx/pull/2298
[CUDA] Divmod, Partition, and sort fixes by @awni in https://github.com/ml-explore/mlx/pull/2302
[CUDA] synch properly waits for all tasks to finish and clear by @awni in https://github.com/ml-explore/mlx/pull/2303
Make ptx cache settable by environment variable by @angeloskath in https://github.com/ml-explore/mlx/pull/2304
Build CUDA release in Circle by @awni in https://github.com/ml-explore/mlx/pull/2306
Cuda perf tuning by @awni in https://github.com/ml-explore/mlx/pull/2307
Fix update_modules() when providing a subset by @angeloskath in https://github.com/ml-explore/mlx/pull/2308
Compile float64 functions on CPU by @awni in https://github.com/ml-explore/mlx/pull/2311
Fix get 2d grid dims by @angeloskath in https://github.com/ml-explore/mlx/pull/2316
Split broadcast so it is always fused in compile by @angeloskath in https://github.com/ml-explore/mlx/pull/2318
[CUDA] Fix reductions by @angeloskath in https://github.com/ml-explore/mlx/pull/2314
Fix module update in strict mode by @awni in https://github.com/ml-explore/mlx/pull/2321
MLX_SWITCH macros to templates by @angeloskath in https://github.com/ml-explore/mlx/pull/2320
Use fp32 for testing, add more complex ops by @awni in https://github.com/ml-explore/mlx/pull/2322
Patch bump by @awni in https://github.com/ml-explore/mlx/pull/2324
Allow parameters to be deleted from a module by @awni in https://github.com/ml-explore/mlx/pull/2325
Fix compilation error from integral_constant by @zcbenz in https://github.com/ml-explore/mlx/pull/2326
[CUDA] Switch to CUDA graphs by @awni in https://github.com/ml-explore/mlx/pull/2317
[CUDA] Fix graphs for older cuda by @awni in https://github.com/ml-explore/mlx/pull/2328
[CUDA] Add MLXCUDAGRAPHCACHESIZE env for setting graph cache size by @zcbenz in https://github.com/ml-explore/mlx/pull/2329
Fix layernorm race condition by @angeloskath in https://github.com/ml-explore/mlx/pull/2340
Build with all cpu cores by default by @zcbenz in https://github.com/ml-explore/mlx/pull/2336
[CUDA] Do vectorized store/load in binary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2330
Auto build linux release by @awni in https://github.com/ml-explore/mlx/pull/2341
MoE backward improvements by @angeloskath in https://github.com/ml-explore/mlx/pull/2335
Fix compilation with CUDA 11 by @zcbenz in https://github.com/ml-explore/mlx/pull/2331
patch bump by @awni in https://github.com/ml-explore/mlx/pull/2343
Align mlx::core::max op nan propagation with NumPy by @jhavukainen in https://github.com/ml-explore/mlx/pull/2339
Add zero for argsort vjp by @awni in https://github.com/ml-explore/mlx/pull/2345
[CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2342
Align mlx::core::min op nan propagation with NumPy by @jhavukainen in https://github.com/ml-explore/mlx/pull/2346
[CUDA] Set current device before cudaGraphLaunch by @zcbenz in https://github.com/ml-explore/mlx/pull/2351
[CUDA] Put version in ptx cache dir path by @zcbenz in https://github.com/ml-explore/mlx/pull/2352
Fix type promotion in Adam with bias correction by @angeloskath in https://github.com/ml-explore/mlx/pull/2350
Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in https://github.com/ml-explore/mlx/pull/2355
[CUDA] Implement Scan kernel by @zcbenz in https://github.com/ml-explore/mlx/pull/2347
[Metal] fix copy dispatch by @awni in https://github.com/ml-explore/mlx/pull/2360
[CUDA] Bundle CCCL for JIT compilation by @zcbenz in https://github.com/ml-explore/mlx/pull/2357
[CUDA] Do not put kernels in annoymous namespace by @zcbenz in https://github.com/ml-explore/mlx/pull/2362
Fix imag() vjp by @angeloskath in https://github.com/ml-explore/mlx/pull/2367
Add Primitive::name and remove Primitive::print by @zcbenz in https://github.com/ml-explore/mlx/pull/2365
update linux build by @awni in https://github.com/ml-explore/mlx/pull/2370
[CUDA] Affine quantize by @awni in https://github.com/ml-explore/mlx/pull/2354
Fix flaky linux test by @awni in https://github.com/ml-explore/mlx/pull/2371
Install linux with mlx[cuda] and mlx[cpu] by @awni in https://github.com/ml-explore/mlx/pull/2356
[CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in https://github.com/ml-explore/mlx/pull/2372
lower memory uniform sampling by @awni in https://github.com/ml-explore/mlx/pull/2361
[CUDA] Fix complex reduce + nan propagation in min and max by @awni in https://github.com/ml-explore/mlx/pull/2377
Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in https://github.com/ml-explore/mlx/pull/2378
fix ring distributed test by @awni in https://github.com/ml-explore/mlx/pull/2380
Test with CUDA 12.2 by @awni in https://github.com/ml-explore/mlx/pull/2375
[CUDA] Add work per thread to compile by @angeloskath in https://github.com/ml-explore/mlx/pull/2368
[CUDA] Fix resource leaks in matmul and graph by @awni in https://github.com/ml-explore/mlx/pull/2383
[CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in https://github.com/ml-explore/mlx/pull/2382
Add contiguouscopygpu util for copying array by @zcbenz in https://github.com/ml-explore/mlx/pull/2379
Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in https://github.com/ml-explore/mlx/pull/1914
Patch bump by @awni in https://github.com/ml-explore/mlx/pull/2386
Fix release build + patch bump by @awni in https://github.com/ml-explore/mlx/pull/2387
Fix cuda manylinux version to match others by @awni in https://github.com/ml-explore/mlx/pull/2388
[CUDA] speedup handling scalars by @awni in https://github.com/ml-explore/mlx/pull/2389
Remove thrust iterators by @zcbenz in https://github.com/ml-explore/mlx/pull/2396
Add contiguouscopycpu util for copying array by @zcbenz in https://github.com/ml-explore/mlx/pull/2397
Fix including stubs in wheel by @awni in https://github.com/ml-explore/mlx/pull/2398
use size option in binary by @awni in https://github.com/ml-explore/mlx/pull/2399
[CUDA] Simplify allocator by @awni in https://github.com/ml-explore/mlx/pull/2392
Add cuda gemv by @awni in https://github.com/ml-explore/mlx/pull/2400
Fix an error in the comment for mx.dequantize by @csukuangfj in https://github.com/ml-explore/mlx/pull/2409
Remove unused code in Convolution::vjp by @zcbenz in https://github.com/ml-explore/mlx/pull/2408
[CUDA] --compress-mode requires CUDA 12.8 by @zcbenz in https://github.com/ml-explore/mlx/pull/2407
full row mask in sdpa consistently gives nan by @awni in https://github.com/ml-explore/mlx/pull/2406
Fix uv install and add dev release by @awni in https://github.com/ml-explore/mlx/pull/2411
[Metal] Release metal events by @awni in https://github.com/ml-explore/mlx/pull/2412
Test on cuda 12.2 and 12.9 by @awni in https://github.com/ml-explore/mlx/pull/2413
[CUDA] Initial implementation of Convolution with cuDNN by @zcbenz in https://github.com/ml-explore/mlx/pull/2385
[DOCS]: Fix eps placement in Adam and AdamW by @Skonor in https://github.com/ml-explore/mlx/pull/2416
[CUDA] Always use batched matmul by @awni in https://github.com/ml-explore/mlx/pull/2404
Fix qvm splitk by @awni in https://github.com/ml-explore/mlx/pull/2415
Update install docs and requirements by @awni in https://github.com/ml-explore/mlx/pull/2419
version by @awni in https://github.com/ml-explore/mlx/pull/2420

New Contributors

@emmanuel-ferdman made their first contribution in https://github.com/ml-explore/mlx/pull/2250
@FL33TW00D made their first contribution in https://github.com/ml-explore/mlx/pull/2264
@jhavukainen made their first contribution in https://github.com/ml-explore/mlx/pull/2339
@Goekdeniz-Guelmez made their first contribution in https://github.com/ml-explore/mlx/pull/1914
@Skonor made their first contribution in https://github.com/ml-explore/mlx/pull/2416

Full Changelog: https://github.com/ml-explore/mlx/compare/v0.26.0...v0.27.0

- C++
Published by awni 8 months ago

Highlights

5 bit quantization
Significant progress on CUDA back-end by @zcbenz

Core

Features

5bit quants
Allow per-target Metal debug flags
Add complex eigh
reduce vjp for mx.all and mx.any
real and imag properties
Non-symmetric mx.linalg.eig and mx.linalg.eigh
convolution vmap
Add more complex unary ops (sqrt, square, ...)
Complex scan
Add mx.broadcast_shapes
Added output_padding parameters in conv_transpose
Add random normal distribution for complex numbers
Add mx.fft.fftshift andmx.fft.ifftshift` helpers
Enable vjp for quantized scale and bias

Performance

Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
Much faster 1D conv

Cuda

Generalize gpu backend
Use fallbacks in fast primitives when eval_gpu is not implemented
Add memory cache to CUDA backend
Do not check event.is_signaled() in eval_impl
Build for compute capability 70 instead of 75 in CUDA backend
CUDA backend: backbone

Bug Fixes

Fix out-of-bounds default value in logsumexp/softmax
include mlx::core::version() symbols in the mlx static library
Fix Nearest upsample
Fix large arg reduce
fix conv grad
Fix some complex vjps
Fix typo in rowreducesmall
Fix put_along_axis for empty arrays
Close a couple edge case bugs: hadamard and addmm on empty inputs
Fix fft for integer overflow with large batches
fix: conv_general differences between gpu, cpu
Fix batched vector sdpa
GPU Hadamard for large N
Improve bandwidth for elementwise ops
Fix compile merging
Fix shapeless export to throw on dim mismatch
Fix mx.linalg.pinv for singular matrices
Fixed shift operations
Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

- C++
Published by awni 9 months ago

mlx - v0.25.2

🚀

- C++
Published by awni 10 months ago

mlx - v0.25.1

🚀

- C++
Published by awni 11 months ago

mlx - v0.25.0

Highlights

Custom logsumexp for reduced memory in training (benchmark)
Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
- benchmarks
- more benchmarks

Core

Performance

Fused vector attention supports 256 dim
Tune quantized matrix vector dispatch for small batches of vectors

Features

Move memory API in the top level mlx.core and enable for CPU only allocator
Enable using MPI from all platforms and allow only OpenMPI
Add a ring all gather for the ring distributed backend
Enable gemm for complex numbers
Fused attention supports literal "causal" mask
Log for complex numbers
Distributed all_min and all_max both for MPI and the ring backend
Add logcumsumexp
Add additive mask for fused vector attention
Improve the usage of the residency set

NN

Add sharded layers for model/tensor parallelism

Bugfixes

Fix possible allocator deadlock when using multiple streams
Ring backend supports 32 bit platforms and FreeBSD
Fix FFT bugs
Fix attention mask type for fused attention kernel
Fix fused attention numerical instability with masking
Add a fallback for float16 gemm
Fix simd sign for uint64
Fix issues in docs

- C++
Published by angeloskath 11 months ago

mlx - v0.24.2

🐛 🚀

- C++
Published by awni 11 months ago

mlx - v0.24.1

🐛

- C++
Published by awni 12 months ago

mlx - v0.24.0

Highlights

Much faster fused attention with support for causal masking
- Benchmarks
- Improvements in prompt processing speed and memory use, benchmarks
- Much faster small batch fused attention for e.g. speculative decoding, benchmarks
Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

Support fused masking in scaled_dot_product_attention
Support transposed head/seq for fused vector scaled_dot_product_attention
SDPA support for small batch (over sequence) queries
Enabling fused attention for head dim 128
Redesign CPU back-end for faster cpu/gpu synch

Features

Allow debugging in distributed mode
Support mx.fast.rms_norm without scale
Adds nuclear norm support in mx.linalg.norm
Add XOR on arrays
Added mlx::core::version()
Allow non-square lu in mx.linalg.lu
Double for lapack ops (eigh, svd, etc)
Add a prepare tb ring script
Ring docs
Affine quant always in fp32

Optimizers

Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

Do not define MLX_VERSION globally
Reduce binary size post fast synch
Fix vmap for flatten
Fix copy for large arrays with JIT
Fix grad with inplace updates
Use same accumulation precision in gemv as gemm
Fix slice data size
Use a heap for small sizes
Fix donation in scan
Ensure linspace always contains start and stop
Raise an exception in the rope op if input is integer
Limit compile buffers by
fix mx.float64 type promotion
Fix CPU SIMD erf_inv
Update smoothl1loss in losses.

- C++
Published by jagrit06 12 months ago

mlx - v0.23.2

🚀

- C++
Published by awni about 1 year ago

mlx - v0.23.1

🐞

- C++
Published by angeloskath about 1 year ago

mlx - v0.23.0

Highlights

4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
- Faster unified CPU back-end with vector operations
Double precision (mx.float64) support on the CPU

Core

Features

Bitwise invert mx.bitwise_invert
mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
Support loading F8_E4M3 from safetensors
mx.float64 supported on the CPU
Matmul JVPs
Distributed launch helper :mlx.launch
Support non-square QR factorization with mx.linalg.qr
Support ellipsis in mx.einsum
Refactor and unify accelerate and common back-ends

Performance

Faster synchronization Fence for synchronizing CPU-GPU
Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
Fast winograd convolutions, benchmarks
Allow dynamic ops per buffer based on dispatches and memory, benchmarks
Up to 3x faster sort, benchmarks
Faster small batch qmv, benchmarks
Ring distributed backend
- Uses raw sockets for faster all reduce
Some CPU ops are much faster with the new Simd<T, N>

NN

Orthogonal initializer nn.init.orthogonal
Add dilation for conv 3d layers

Bug fixes

Limit grad recursion depth by not recursing through non-grad inputs
Fix synchronization bug for GPU stream async CPU work
Fix shapeless compile on ubuntu24
Recompile when shapeless changes
Fix rope fallback to not upcast
Fix metal sort for certain cases
Fix a couple of slicing bugs
Avoid duplicate malloc with custom kernel init
Fix compilation error on Windows
Allow Python garbage collector to break cycles on custom objects
Fix grad with copies
Loading empty list is ok when strict = false
Fix split vmap
Fixes output donation for IO ops on the GPU
Fix creating an array with an int64 scalar
Catch stream errors earlier to avoid aborts

- C++
Published by awni about 1 year ago

mlx - v0.22.1

🚀

- C++
Published by awni about 1 year ago

mlx - v0.22.0

Highlights

Export and import MLX functions to a file (example, bigger example)
- Functions can be exported from Python and run in C++ and vice versa

Core

Add slice and slice_update which take arrays for starting locations
Add an example for using MLX in C++ with CMake
Fused attention for generation now supports boolean masking (benchmark)
Allow array offset for mx.fast.rope
Add mx.finfo
Allow negative strides without resorting to copying for slice and as_strided
Add Flatten, Unflatten and ExpandDims primitives
Enable the compilation of lambdas in C++
Add a lot more primitives for shapeless compilation (full list)
Fix performance regression in qvm
Introduce separate types for Shape and Strides and switch to int64 strides from uint64
Reduced copies for fused-attention kernel
Recompile a function when the stream changes
Several steps to improve the linux / x86_64 experience (#1625, #1627, #1635)
Several steps to improve/enable the windows experience (#1628, #1660, #1662, #1661, #1672, #1663, #1664, ...)
Update to newer Metal-cpp
Throw when exceeding the maximum number of buffers possible
Add mx.kron
mx.distributed.send now implements the identity function instead of returning an empty array
Better errors reporting for mx.compile on CPU and for unrecoverable errors

NN

Add optional bias correction in Adam/AdamW
Enable mixed quantization by nn.quantize
Remove reshapes from nn.QuantizedEmbedding

Bug fixes

Fix qmv/qvm bug for batch size 2-5
Fix some leaks and races (#1629)
Fix transformer postnorm in mlx.nn
Fix some mx.fast fallbacks
Fix the hashing for string constants in compile
Fix small sort in Metal
Fix memory leak of non-evaled arrays with siblings
Fix concatenate/slice_update vjp in edge-case where the inputs have different type

- C++
Published by angeloskath about 1 year ago

mlx - v0.21.1

🚀 🚀

- C++
Published by awni over 1 year ago

mlx - v0.21.0

Highlights

Support 3 and 6 bit quantization: benchmarks
Much faster memory efficient attention for headdim 64, 80: benchmarks
Much faster sdpa inference kernel for longer sequences: benchmarks

Core

contiguous op (C++ only) + primitive
Bfs width limit to reduce memory consumption during eval
Fast CPU quantization
Faster indexing math in several kernels:
- unary, binary, ternary, copy, compiled, reduce
Improve dispatch threads for a few kernels:
- conv, gemm splitk, custom kernels
More buffer donation with no-ops to reduce memory use
Use CMAKE_OSX_DEPLOYMENT_TARGET to pick Metal version
Dispatch Metal bf16 type at runtime when using the JIT

NN

nn.AvgPool3d and nn.MaxPool3d
Support groups in nn.Conv2d

Bug fixes

Fix per-example mask + docs in sdpa
Fix FFT synchronization bug (use dispatch method everywhere)
Throw for invalid *fft{2,n} cases
Fix OOB access in qmv
Fix donation in sdpa to reduce memory use
Allocate safetensors header on the heap to avoid stack overflow
Fix sibling memory leak
Fix view segfault for scalars input
Fix concatenate vmap

- C++
Published by awni over 1 year ago

mlx - v0.20.0

Highlights

Even faster GEMMs
- Peaking at 23.89 TFlops on M2 Ultra benchmarks
BFS graph optimizations
- Over 120tks with Mistral 7B!
Fast batched QMV/QVM for KV quantized attention benchmarks

Core

New Features
- mx.linalg.eigh and mx.linalg.eigvalsh
- mx.nn.init.sparse
- 64bit type support for mx.cumprod, mx.cumsum
Performance
- Faster long column reductions
- Wired buffer support for large models
- Better Winograd dispatch condition for convs
- Faster scatter/gather
- Faster mx.random.uniform and mx.random.bernoulli
- Better threadgroup sizes for large arrays
Misc
- Added Python 3.13 to CI
- C++20 compatibility

Bugfixes

Fix command encoder synchronization
Fix mx.vmap with gather and constant outputs
Fix fused sdpa with differing key and value strides
Support mx.array.__format__ with spec
Fix multi output array leak
Fix RMSNorm weight mismatch error

- C++
Published by barronalex over 1 year ago

mlx - v0.19.3

🚀

- C++
Published by awni over 1 year ago

mlx - v0.19.2

🚀🚀

- C++
Published by angeloskath over 1 year ago

mlx - v0.19.1

🚀

- C++
Published by awni over 1 year ago

mlx - v0.19.0

Highlights

Speed improvements
- Up to 6x faster CPU indexing benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
Gradient for grouped convolutions
Due to Python 3.8's end-of-life we no longer test with it on CI

Core

New features
- Gradient for grouped convolutions
- mx.roll
- mx.random.permutation
- mx.real and mx.imag
Performance
- Up to 6x faster CPU indexing benchmarks
- Faster CPU sort benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Bulk eval in safetensors to avoid unnecessary serialization of work
Misc
- Bump to nanobind 2.2
- Move testing to python 3.9 due to 3.8's end-of-life
- Make the GPU device more thread safe
- Fix the submodule stubs for better IDE support
- CI generated docs that will never be stale

NN

Add support for grouped 1D convolutions to the nn API
Add some missing type annotations

Bugfixes

Fix and speedup row-reduce with few rows
Fix normalization primitive segfault with unexpected inputs
Fix complex power on the GPU
Fix freeing deep unevaluated graphs details
Fix race with array::is_available
Consistently handle softmax with all -inf inputs
Fix streams in affine quantize
Fix CPU compile preamble for some linux machines
Stream safety in CPU compilation
Fix CPU compile segfault at program shutdown

- C++
Published by angeloskath over 1 year ago

mlx - v0.18.1

🚀

- C++
Published by awni over 1 year ago

mlx - v0.18.0

Highlights

Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
- CPU benchmarks here.
- GPU benchmarks here and here.
Transposed convolutions
Improvements to mx.distributed (send/recv/average_gradients)

Core

New features:
- mx.conv_transpose{1,2,3}d
- Allow mx.take to work with integer index
- Add std as method on mx.array
- mx.put_along_axis
- mx.cross_product
- int() and float() work on scalar mx.array
- Add optional headers to mx.fast.metal_kernel
- mx.distributed.send and mx.distributed.recv
- mx.linalg.pinv
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in mx.fast.metal_kernel
- Improve donation heuristics to reduce memory use
Misc
- Support Xcode 160

NN

Faster RNN layers
nn.ConvTranspose{1,2,3}d
mlx.nn.average_gradients data parallel helper for distributed training

Bug Fixes

Fix boolean all reduce bug
Fix extension metal library finding
Fix ternary for large arrays
Make eval just wait if all arrays are scheduled
Fix CPU softmax by removing redundant coefficient in neonfastexp
Fix JIT reductions
Fix overflow in quantize/dequantize
Fix compile with byte sized constants
Fix copy in the sort primitive
Fix reduce edge case
Fix slice data size
Throw for certain cases of non captured inputs in compile
Fix copying scalars by adding fill_gpu
Fix bug in module attribute set, reset, set
Ensure io/comm streams are active before eval
Fix mx.clip
Override class function in Repr so mx.array is not confused with array.array
Avoid using find_library to make install truly portable
Remove fmt dependencies from MLX install
Fix for partition VJP
Avoid command buffer timeout for IO on large arrays

- C++
Published by awni over 1 year ago

mlx - v0.17.3

🚀

- C++
Published by angeloskath over 1 year ago

mlx - v0.17.1

🐛

- C++
Published by angeloskath over 1 year ago

mlx - v0.17.0

Highlights

mx.einsum: PR
Big speedups in reductions: benchmarks
2x faster model loading: PR
mx.fast.metal_kernel for custom GPU kernels: docs

Core

Faster program exits
Laplace sampling
mx.nan_to_num
nn.tanh gelu approximation
Fused GPU quantization ops
Faster group norm
bf16 winograd conv
vmap support for mx.scatter
mx.pad "edge" padding
More numerically stable mx.var
mx.linalg.cholesky_inv/mx.linalg.tri_inv
mx.isfinite
Complex mx.sign now mirrors NumPy 2.0 behaviour
More flexible mx.fast.rope
Update to nanobind 2.1

Bug Fixes

gguf zero initialization
expm1f overflow handling
bfloat16 hadamard
large arrays for various ops
rope fix
bf16 array creation
preserve dtype in nn.Dropout
nn.TransformerEncoder with norm_first=False
excess copies from contiguity bug

- C++
Published by barronalex over 1 year ago

mlx - v0.16.3

🚀

- C++
Published by awni over 1 year ago

mlx - v0.16.2

🚀🚀

- C++
Published by angeloskath over 1 year ago

mlx - 0.16.1

🚀

- C++
Published by awni over 1 year ago

mlx - v0.16.0

Highlights

@mx.custom_function for custom vjp/jvp/vmap transforms
Up to 2x faster Metal GEMV and fast masked GEMV
- benchmarks
Fast hadamard_transform
- benchmarks

Core

Metal 3.2 support
Reduced CPU binary size
Added quantized GPU ops to JIT
Faster GPU compilation
Added grads for bitwise ops + indexing

Bug Fixes

1D scatter bug
Strided sort bug
Reshape copy bug
Seg fault in mx.compile
Donation condition in compilation
Compilation of accelerate on iOS

- C++
Published by barronalex over 1 year ago

mlx - v0.15.2

🚀

- C++
Published by awni over 1 year ago

mlx - v0.15.1

🚀

- C++
Published by awni over 1 year ago

mlx - v0.15.0

Highlights

Fast Metal GPU FFTs
- On average ~30x faster than CPU
- More benchmarks
mx.distributed with all_sum and all_gather

Core

Added dlpack device __dlpack_device__
Fast GPU FFTs benchmarks
Add docs for the mx.distributed
Add mx.view op

NN

softmin, hardshrink, and hardtanh activations

Bugfixes

Fix broadcast bug in bitwise ops
Allow more buffers for JIT compilation
Fix matvec vector stride bug
Fix multi-block sort stride management
Stable cumprod grad at 0
Buf fix with race condition in scan

- C++
Published by awni almost 2 years ago

mlx - v0.14.1

🚀

- C++
Published by awni almost 2 years ago

mlx - v0.14.0

Highlights

Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
- Series of PRs 1, 2, 3, 4, 5
mx.gather_qmm quantized equivalent for mx.gather_mm which speeds up MoE inference by ~2x
- Some numbers
Grouped 2D convolutions
- Some numbers

Core

mx.conjugate
mx.conv3d and nn.Conv3d
List based indexing
Started mx.distributed which uses MPI (if installed) for communication across machines
- mx.distributed.init
- mx.distributed.all_gather
- mx.distributed.all_reduce_sum
Support conversion to and from dlpack
mx.linalg.cholesky on CPU
mx.quantized_matmul sped up for vector-matrix products
mx.trace
mx.block_masked_mm now supports floating point masks!

Fixes

Error messaging in eval
Add some missing docs
Scatter index bug
The extensions example now compiles and runs
CPU copy bug with many dimensions

- C++
Published by angeloskath almost 2 years ago

mlx - v0.13.1

🚀

- C++
Published by awni almost 2 years ago

mlx - v0.13.0

Highlights

Block sparse matrix multiply speeds up MoEs by >2x
- some numbers
Improved quantization algorithm should work well for all networks
- see evaluations
Improved gpu command submission speeds up training and inference
- some numbers

Core

Bitwise ops added:
- mx.bitwise_[or|and|xor], mx.[left|right]_shift, operator overloads
Groups added to Conv1d
Added mx.metal.device_info to get better informed memory limits
Added resettable memory stats
mlx.optimizers.clip_grad_norm and mlx.utils.tree_reduce added
Add mx.arctan2
Unary ops now accept array-like inputs ie one can do mx.sqrt(2)

Bugfixes

Fixed shape for slice update
Bugfix in quantize that used slightly wrong scales/biases
Fixed memory leak for multi-output primitives encountered with gradient checkpointing
Fixed conversion from other frameworks for all datatypes
Fixed index overflow for matmul with large batch size
Fixed initialization ordering that occasionally caused segfaults

- C++
Published by angeloskath almost 2 years ago

mlx - v0.12.2

- C++
Published by awni almost 2 years ago

mlx - v0.12.0

Highlights

Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers

Core

mx.synchronize to wait for computation dispatched with mx.async_eval
mx.radians and mx.degrees
mx.metal.clear_cache to return to the OS the memory held by MLX as a cache for future allocations
Change quantization to always represent 0 exactly (relevant issue)

Bugfixes

Fixed quantization of a block with all 0s that produced NaNs
Fixed the len field in the buffer protocol implementation

- C++
Published by angeloskath almost 2 years ago

mlx - v0.11.0

Core

mx.block_masked_mm for block-level sparse matrix multiplication
Shared events for synchronization and asynchronous evaluation

NN

nn.QuantizedEmbedding layer
nn.quantize for quantizing modules
gelu_approx uses tanh for consistency with PyTorch

- C++
Published by awni almost 2 years ago

mlx - v0.10.0

Highlights

Improvements for LLM generation
- Reshapeless quant matmul/matvec
- mx.async_eval
- Async command encoding

Core

Slightly faster reshapeless quantized gemms
Option for precise softmax
mx.metal.start_capture and mx.metal.stop_capture for GPU debug/profile
mx.expm1
mx.std
mx.meshgrid
CPU only mx.random.multivariate_normal
mx.cumsum (and other scans) for bfloat
Async command encoder with explicit barriers / dependency management

NN

nn.upsample support bicubic interpolation

Misc

Updated MLX Extension to work with nanobind

Bugfixes

Fix buffer donation in softmax and fast ops
Bug in layer norm vjp
Bug initializing from lists with scalar
Bug in indexing
CPU compilation bug
Multi-output compilation bug
Fix stack overflow issues in eval and array destruction

- C++
Published by awni almost 2 years ago

mlx - v0.9.0

Highlights:

Fast partial RoPE (used by Phi-2)
Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, benchmarks

Core

More overhead reductions
Partial fast RoPE (fast Phi-2)
Better buffer donation for copy
Type hierarchy and issubdtype
Fast VJPs for RoPE, RMSNorm, and LayerNorm

NN

Module.set_dtype
Chaining in nn.Module (model.freeze().update(…))

Bugfixes

Fix set item bugs
Fix scatter vjp
Check shape integer overlow on array construction
Fix bug with module attributes
Fix two bugs for odd shaped QMV
Fix GPU sort for large sizes
Fix bug in negative padding for convolutions
Fix bug in multi-stream race condition for graph evaluation
Fix random normal generation for half precision

- C++
Published by awni almost 2 years ago

mlx - v0.8.0

Highlights

More perf!
mx.fast.rms_norm and mx.fast.layer_norm
Switch to Nanobind substantially reduces overhead
Up to 4x faster __setitem__ (e.g. a[...] = b)

Core

mx.inverse, CPU only
vmap over mx.matmul and mx.addmm
Switch to nanobind from pybind11
Faster setitem indexing
- Benchmarks
mx.fast.rms_norm, token generation benchmark
mx.fast.layer_norm, token generation benchmark
vmap for inverse and svd
Faster non-overlapping pooling

Optimizers

Set minimum value in cosine decay scheduler

Bugfixes

Fix bug in multi-dimensional reduction

- C++
Published by awni almost 2 years ago

mlx -

Highlights

Perf improvements for attention ops:
- No copy broadcast matmul (benchmarks)
- Fewer copies in reshape

Core

Faster broadcast + gemm
- benchmarks
mx.linalg.svd (CPU only)
Fewer copies in reshape
Faster small reductions
- benchmarks

NN

nn.RNN, nn.LSTM, nn.GRU

Bugfixes

Fix bug in depth traversal ordering
Fix two edge case bugs in compilation
Fix bug with modules with dictionaries of weights
Fix bug with scatter which broke MOE training
Fix bug with compilation kernel collision

- C++
Published by awni almost 2 years ago

mlx - v0.6.0

Highlights:

Faster quantized matrix-vector multiplies
- Benchmarks
mx.fast.scaled_dot_product_attention fused op

Core

Memory allocation API improvements
Faster GPU reductions for smaller sizes (between 2 and 7x)
- Benchmarks
mx.fast.scaled_dot_product_attention fused op
Faster quantized matrix-vector multiplications
Pickle support for mx.array

NN

Dilation on convolution layers

Bugfixes

Fix mx.topk
Fix reshape for zero sizes

- C++
Published by angeloskath about 2 years ago

mlx - v0.5.0

Highlights:

Faster convolutions.
- Up to 14x faster for some common sizes.
- See benchmarks

Core

mx.where properly handles inf
Faster and more general convolutions
- Input and kernel dilation
- Asymmetric padding
- Support for cross-correlation and convolution
atleast_{1,2,3}d accept any number of arrays

NN

nn.Upsample layer
- Supports nearest neighbor and linear interpolation
- Any number of dimensions

Optimizers

Linear schedule and schedule joiner:
- Use for e.g. linear warmup + cosine decay

Bugfixes

arange throws on inf inputs
Fix Cmake build with MLX
Fix logsumexp inf edge case
Fix grad of power w.r.t. to exponent edge case
Fix compile with inf constants
Bug temporary bug in convolution

- C++
Published by jagrit06 about 2 years ago

mlx - v0.4.0

Highlights:

Partial shapeless compilation
- Default shapeless compilation for all activations
- Can be more than 5x faster than uncompiled versions
CPU kernel fusion
- Some functions can be up to 10x faster

Core

CPU compilation
Shapeless compilation for some cases
- mx.compile(function, shapeless=True)
Up to 10x faster scatter: benchmarks
mx.atleast_1d, mx.atleast_2d, mx.atleast_3d

Bugfixes

Bug with tolist with bfloat16 and float16
Bug with argmax on M3

- C++
Published by awni about 2 years ago

mlx - v0.3.0

- C++
Published by awni about 2 years ago

mlx - v0.2.0

Highlights:

mx.compile makes stuff go fast
- Some functions are up to 10x faster (benchmarks)
- Training models anywhere from 10% to twice as fast (benchmarks)
- Simple syntax for compiling full training steps

Core

mx.compile function transformation
Find devices properly for iOS
Up to 10x faster GPU gather
__abs__ overload for abs on arrays
loc and scale in parameter for mx.random.normal

NN

Margin ranking loss
BCE loss with weights

Bugfixes

Fix for broken eval during function transformations
Fix mx.var to give inf with doff >= nelem
Fix loading empty modules in nn.Sequential

- C++
Published by awni about 2 years ago

mlx - v0.1.0

Highlights

Memory use improvements:
- Gradient checkpointing for training with mx.checkpoint
- Better graph execution order
- Buffer donation

Core

Gradient checkpointing with mx.checkpoint
CPU only QR factorization mx.linalg.qr
Release Python GIL during mx.eval
Depth-based graph execution order
Lazy loading arrays from files
Buffer donation for reduced memory use
mx.diag, mx.diagonal
Breaking: array.shape is a Python tuple
GPU support for int64 and uint64 reductions
vmap over reductions and arg reduction:
- sum, prod, max, min, all, any
- argmax, argmin

NN

Softshrink activation

Bugfixes

Comparisons with inf work, and fix mx.isinf
Bug fix with RoPE cache
Handle empty Matmul on the CPU
Negative shape checking for mx.full
Correctly propagate NaN in some binary ops
- mx.logaddexp, mx.maximum, mx.minimum
Fix > 4D non-contiguous binary ops
Fix mx.log1p with inf input
Fix SGD to apply weight decay even with 0 momentum

- C++
Published by angeloskath about 2 years ago

mlx - v0.0.11

Highlights:

GGUF improvements:
- Native quantizations Q4_0, Q4_1, and Q8_0
- Metadata

Core

Support for reading and writing GGUF metadata
Native GGUF quantization (Q4_0, Q4_1, and Q8_0)
Quantize with group size of 32 (2x32, 4x32, and 8x32)

NN

Module.save_weights supports safetensors
nn.init package with several commonly used neural network initializers
Binary cross entropy and cross entropy losses can take probabilities as targets
Adafactor in nn.optimizers

Bugfixes

Fix isinf and friends for integer types
Fix array creation from list Python ints to int64, uint, and float32
Fix power VJP for 0 inputs
Fix out of bounds inf reads in gemv
mx.arange crashes on NaN inputs

- C++
Published by angeloskath about 2 years ago

mlx - v0.0.10

Highlights:

Faster matmul: up to 2.5x faster for certain sizes, benchmarks
Fused matmul + addition (for faster linear layers)

Core

Quantization supports sizes other than multiples of 32
Faster GEMM (matmul)
ADMM primitive (fused addition and matmul)
mx.isnan, mx.isinf, isposinf, isneginf
mx.tile
VJPs for scatter_min and scatter_max
Multi output split primitive

NN

Losses: Gaussian negative log-likelihood

Misc

Performance enhancements for graph evaluation with lots of outputs
Default PRNG seed is based on current time instead of 0
Primitive VJP takes output as input. Reduces redundant work without need for simplification
PRNGs default seed based on system time rather than fixed to 0
Format boolean printing in Python style when in Python

Bugfixes

Scatter < 32 bit precision and integer overflow fix
Overflow with mx.eye
Report Metal out of memory issues instead of silent failure
Change mx.round to follow NumPy which rounds to even

- C++
Published by awni about 2 years ago

mlx - v0.0.9

Highlights:

Initial (and experimental) GGUF support
Support Python buffer protocol (easy interoperability with NumPy, Jax, Tensorflow, PyTorch, etc)
at[] syntax for scatter style operations: x.at[idx].add(y), (min, max, prod, etc)

Core

Array creation from other mx.array’s (mx.array([x, y]))
Complete support for Python buffer protocol
mx.inner, mx.outer
mx.logicaland, mx.logicalor, and operator overloads
Array at syntax for scatter ops
Better support for in-place operations (+=, *=, -=, ...)
VJP for scatter and scatter add
Constants (mx.pi, mx.inf, mx.newaxis, …)

NN

GLU activation
cosine_similarity loss
Cache for RoPE and ALiBi

Bugfixes / Misc

Fix data type with tri
Fix saving non-contiguous arrays
Fix graph retention for inlace state, and remove retain_graph
Multi-output primitives
Better support for loading devices

- C++
Published by awni about 2 years ago

mlx - v0.0.7

Core

Support for loading and saving HuggingFace's safetensor format
Transposed quantization matmul kernels
mlx.core.linalg sub-package with mx.linalg.norm (Frobenius, infininty, p-norms)
tensordot and repeat

NN

Layers
- Bilinear,Identity, InstanceNorm
- Dropout2D, Dropout3D
- more customizable Transformer (pre/post norm, dropout)
- More activations: SoftSign, Softmax, HardSwish, LogSoftmax
- Configurable scale in RoPE positional encodings
Losses: hinge, huber, log_cosh

Misc

Faster GPU reductions for certain cases
Change to memory allocation to allow swapping

- C++
Published by awni about 2 years ago

mlx - v0.0.6

Core

quantize, dequantize, quantized_matmul
moveaxis, swapaxes, flatten
stack
floor, ceil, clip
tril, triu, tri
linspace

Optimizers

RMSProp, Adamax, Adadelta, Lion

NN

Layers: QuantizedLinear, ALiBi positional encodings
Losses: Label smoothing, Smooth L1 loss, Triplet loss

Misc

Bug fixes

- C++
Published by angeloskath about 2 years ago

mlx - v0.0.5

Core ops remainder, eye, identity
Additional functionality in mlx.nn
- Losses: binary cross entropy, kl divergence, mse, l1
- Activations: PRELU, Mish, and several others
More optimizers: AdamW, Nesterov momentum, Adagrad
Bug fixes

- C++
Published by awni about 2 years ago

Recent Releases of mlx

mlx - v0.29.0

Highlights

What's Changed

New Contributors

mlx - v0.28.0

Highlights

What's Changed

New Contributors

mlx - v0.27.1

Highlights

What's Changed

New Contributors

mlx - v0.26.5

mlx - v0.26.3

mlx - v0.26.2

mlx - v0.26.0

Highlights

Core

Features

Performance

Cuda

Bug Fixes

Contributors

mlx - v0.25.2

mlx - v0.25.1

mlx - v0.25.0

Highlights

Core

Performance

Features

NN

Bugfixes

mlx - v0.24.2

mlx - v0.24.1

mlx - v0.24.0

Highlights

Core

Performance

Features

Optimizers

Bug Fixes

mlx - v0.23.2

mlx - v0.23.1

mlx - v0.23.0

Highlights

Core

Features

Performance

NN

Bug fixes

mlx - v0.22.1

mlx - v0.22.0

Highlights

Core

NN

Bug fixes

mlx - v0.21.1

mlx - v0.21.0

Highlights

Core

NN

Bug fixes

mlx - v0.20.0

Highlights

Core

Bugfixes

mlx - v0.19.3

mlx - v0.19.2

mlx - v0.19.1

mlx - v0.19.0

Highlights

Core

NN

Bugfixes

mlx - v0.18.1

mlx - v0.18.0

Highlights

Core

NN