Recent Releases of mlx
mlx - v0.29.0
Highlights
- Support for
mxfp4quantization (Metal, CPU) - More performance improvements, bug fixes, features in CUDA backend
mx.distributedsupports NCCL back-end for CUDA
What's Changed
- [CUDA] Optimize setmmdevice_pointers for small ndim by @zcbenz in https://github.com/ml-explore/mlx/pull/2473
- Fix logsumexp/softmax not fused for some cases by @zcbenz in https://github.com/ml-explore/mlx/pull/2474
- Use CMake <4.1 to avoid the nvpl error by @angeloskath in https://github.com/ml-explore/mlx/pull/2489
- Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in https://github.com/ml-explore/mlx/pull/2477
- make code blocks copyable by @Dan-Yeh in https://github.com/ml-explore/mlx/pull/2480
- Rename cu::Matmul to CublasGemm by @zcbenz in https://github.com/ml-explore/mlx/pull/2488
- Faster general unary op by @awni in https://github.com/ml-explore/mlx/pull/2472
- The naiveconv2d is no longer used by @zcbenz in https://github.com/ml-explore/mlx/pull/2496
- Remove the hack around SmallVector in cpu compile by @zcbenz in https://github.com/ml-explore/mlx/pull/2494
- Clean up code handling both std::vector and SmallVector by @zcbenz in https://github.com/ml-explore/mlx/pull/2493
- [CUDA] Fix conv grads with groups by @zcbenz in https://github.com/ml-explore/mlx/pull/2495
- Update cuDNN Frontend to v1.14 by @zcbenz in https://github.com/ml-explore/mlx/pull/2505
- Ensure small sort doesn't use indices if not argsort by @angeloskath in https://github.com/ml-explore/mlx/pull/2506
- Ensure no oob read in gemv_masked by @angeloskath in https://github.com/ml-explore/mlx/pull/2508
- fix custom kernel test by @awni in https://github.com/ml-explore/mlx/pull/2510
- No segfault with uninitialized array.at by @awni in https://github.com/ml-explore/mlx/pull/2514
- Fix lapack svd by @awni in https://github.com/ml-explore/mlx/pull/2515
- Split cuDNN helpers into a separate header by @zcbenz in https://github.com/ml-explore/mlx/pull/2491
- [CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in https://github.com/ml-explore/mlx/pull/2511
- Fix docs by @russellizadi in https://github.com/ml-explore/mlx/pull/2518
- Fix overflow in large filter small channels by @angeloskath in https://github.com/ml-explore/mlx/pull/2520
- [CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in https://github.com/ml-explore/mlx/pull/2521
- Custom cuda kernel by @angeloskath in https://github.com/ml-explore/mlx/pull/2517
- Fix docs omission by @angeloskath in https://github.com/ml-explore/mlx/pull/2524
- Fix power by @awni in https://github.com/ml-explore/mlx/pull/2523
- NCCL backend by @nastya236 in https://github.com/ml-explore/mlx/pull/2476
- [CUDA] Nccl pypi dep + default for cuda by @awni in https://github.com/ml-explore/mlx/pull/2526
- Fix warning 186-D from nvcc by @zcbenz in https://github.com/ml-explore/mlx/pull/2527
- [CUDA] Update calls to
cudaMemAdviseandcudaGraphAddDependenciesfor CUDA 13 by @andportnoy in https://github.com/ml-explore/mlx/pull/2525 - nccl default for backend=any by @awni in https://github.com/ml-explore/mlx/pull/2528
- Fix allocation bug in NCCL by @awni in https://github.com/ml-explore/mlx/pull/2530
- Enable COMPILEWARNINGAS_ERROR for linux builds in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2534
- [CUDA] Remove thrust in arange by @zcbenz in https://github.com/ml-explore/mlx/pull/2535
- Use nccl header only when nccl is not present by @awni in https://github.com/ml-explore/mlx/pull/2539
- Allow pathlib.Path to save/load functions by @awni in https://github.com/ml-explore/mlx/pull/2541
- Remove nccl install in release by @awni in https://github.com/ml-explore/mlx/pull/2542
- [CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in https://github.com/ml-explore/mlx/pull/2533
- Remove stream from average grads so it uses default by @awni in https://github.com/ml-explore/mlx/pull/2532
- Enable cuda graph toggle by @awni in https://github.com/ml-explore/mlx/pull/2545
- Tests for save/load with
Pathby @awni in https://github.com/ml-explore/mlx/pull/2543 - Run CPP tests for CUDA build in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2544
- Separate cpu compilation cache by versions by @zcbenz in https://github.com/ml-explore/mlx/pull/2548
- [CUDA] Link with nccl by @awni in https://github.com/ml-explore/mlx/pull/2546
- [CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in https://github.com/ml-explore/mlx/pull/2549
- [CUDA] fix sort by @awni in https://github.com/ml-explore/mlx/pull/2550
- Add mode parameter for quantization by @awni in https://github.com/ml-explore/mlx/pull/2499
- Bump xcode in circle by @awni in https://github.com/ml-explore/mlx/pull/2551
- Fix METAL quantization in JIT + fix release build by @awni in https://github.com/ml-explore/mlx/pull/2553
- Faster contiguous gather for indices in the first axis by @awni in https://github.com/ml-explore/mlx/pull/2552
- version bump by @awni in https://github.com/ml-explore/mlx/pull/2554
- Fix quantized vjp for mxfp4 by @awni in https://github.com/ml-explore/mlx/pull/2555
New Contributors
- @Dan-Yeh made their first contribution in https://github.com/ml-explore/mlx/pull/2480
- @russellizadi made their first contribution in https://github.com/ml-explore/mlx/pull/2518
- @andportnoy made their first contribution in https://github.com/ml-explore/mlx/pull/2525
Full Changelog: https://github.com/ml-explore/mlx/compare/v0.28.0...v0.29.0
- C++
Published by awni 6 months ago
mlx - v0.28.0
Highlights
- First version of fused sdpa vector for CUDA
- Convolutions in CUDA
- Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more
What's Changed
- [CUDA] Fix segfault on exit by @awni in https://github.com/ml-explore/mlx/pull/2424
- [CUDA] No occupancy query for launch params by @awni in https://github.com/ml-explore/mlx/pull/2426
- [CUDA] More sizes for gemv by @awni in https://github.com/ml-explore/mlx/pull/2429
- Add more CUDA architectures for PyPi package by @awni in https://github.com/ml-explore/mlx/pull/2427
- Use ccache in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2414
- [CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in https://github.com/ml-explore/mlx/pull/2433
- Cuda faster softmax by @awni in https://github.com/ml-explore/mlx/pull/2435
- Remove the kernel arg from getlaunchargs by @zcbenz in https://github.com/ml-explore/mlx/pull/2437
- Move arange to its own file by @zcbenz in https://github.com/ml-explore/mlx/pull/2438
- Use loadvector in argreduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2439
- Make CI faster by @zcbenz in https://github.com/ml-explore/mlx/pull/2440
- [CUDA] Quantized refactoring by @angeloskath in https://github.com/ml-explore/mlx/pull/2442
- fix circular reference by @awni in https://github.com/ml-explore/mlx/pull/2443
- [CUDA] Fix gemv regression by @awni in https://github.com/ml-explore/mlx/pull/2445
- Fix wrong graph key when using concurrent context by @zcbenz in https://github.com/ml-explore/mlx/pull/2447
- Fix custom metal extension by @awni in https://github.com/ml-explore/mlx/pull/2446
- Add tests for export including control flow models and quantized models by @junpeiz in https://github.com/ml-explore/mlx/pull/2430
- [CUDA] Backward convolution by @zcbenz in https://github.com/ml-explore/mlx/pull/2431
- [CUDA] Save primitive inputs faster by @zcbenz in https://github.com/ml-explore/mlx/pull/2449
- [CUDA] Vectorize generated kernels by @angeloskath in https://github.com/ml-explore/mlx/pull/2444
- [CUDA] Matmul utils initial commit by @angeloskath in https://github.com/ml-explore/mlx/pull/2441
- Fix arctan2 grads by @angeloskath in https://github.com/ml-explore/mlx/pull/2453
- Use LRU cache for cuda graph by @zcbenz in https://github.com/ml-explore/mlx/pull/2448
- Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in https://github.com/ml-explore/mlx/pull/2460
- Default install cuda on linux by @awni in https://github.com/ml-explore/mlx/pull/2462
- fix wraps compile by @awni in https://github.com/ml-explore/mlx/pull/2461
- Feat: add USESYSTEMFMT CMake option by @GaetanLepage in https://github.com/ml-explore/mlx/pull/2219
- Use SmallVector for shapes and strides by @zcbenz in https://github.com/ml-explore/mlx/pull/2454
- Fix install tags by @awni in https://github.com/ml-explore/mlx/pull/2464
- Faster gather qmm sorted test by @awni in https://github.com/ml-explore/mlx/pull/2463
- Fix cublas on h100 by @awni in https://github.com/ml-explore/mlx/pull/2466
- revert default cuda install by @awni in https://github.com/ml-explore/mlx/pull/2465
- feat: support a destinations based in tree flatten/unflatten by @LVivona in https://github.com/ml-explore/mlx/pull/2450
- Fix typo in metal command encoder by @angeloskath in https://github.com/ml-explore/mlx/pull/2471
- Update CUDA sdpa by @jagrit06 in https://github.com/ml-explore/mlx/pull/2468
- version by @awni in https://github.com/ml-explore/mlx/pull/2470
New Contributors
- @junpeiz made their first contribution in https://github.com/ml-explore/mlx/pull/2430
- @zamderax made their first contribution in https://github.com/ml-explore/mlx/pull/2460
- @GaetanLepage made their first contribution in https://github.com/ml-explore/mlx/pull/2219
- @LVivona made their first contribution in https://github.com/ml-explore/mlx/pull/2450
Full Changelog: https://github.com/ml-explore/mlx/compare/v0.27.1...v0.28.0
- C++
Published by angeloskath 7 months ago
mlx - v0.27.1
Highlights
- Initial PyPi release of the CUDA back-end.
- CUDA back-end works for well with mlx-lm:
- Reasonably fast for LLM inference
- Supports single-machine training and LoRA fine-tuning
What's Changed
- Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in https://github.com/ml-explore/mlx/pull/2232
- Share more common code in Compiled by @zcbenz in https://github.com/ml-explore/mlx/pull/2240
- Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in https://github.com/ml-explore/mlx/pull/2231
- Perf regression fix by @angeloskath in https://github.com/ml-explore/mlx/pull/2243
- Add profiler annotations in common primitives for CUDA backend by @zcbenz in https://github.com/ml-explore/mlx/pull/2244
- Default strict mode for module
updateandupdate_modulesby @awni in https://github.com/ml-explore/mlx/pull/2239 - Fix linux linking error by @awni in https://github.com/ml-explore/mlx/pull/2248
- Improve metal elementwise kernels by @awni in https://github.com/ml-explore/mlx/pull/2247
- CUDA backend: matmul by @zcbenz in https://github.com/ml-explore/mlx/pull/2241
- Change layernorms to two pass algorithm by @angeloskath in https://github.com/ml-explore/mlx/pull/2246
- Fix unintuitive metal kernel caching by @awni in https://github.com/ml-explore/mlx/pull/2242
- Refactor the lu test by @emmanuel-ferdman in https://github.com/ml-explore/mlx/pull/2250
- CUDA backend: unary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2158
- Fix export to work with gather/scatter axis by @awni in https://github.com/ml-explore/mlx/pull/2263
- CUDA backend: binary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2259
- Report number of missing parameters by @FL33TW00D in https://github.com/ml-explore/mlx/pull/2264
- CUDA backend: sort by @zcbenz in https://github.com/ml-explore/mlx/pull/2262
- CUDA backend: random by @zcbenz in https://github.com/ml-explore/mlx/pull/2261
- Fix conv export by @awni in https://github.com/ml-explore/mlx/pull/2265
- CUDA backend: copy ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2260
- Fix building cpp benchmarks on Linux by @zcbenz in https://github.com/ml-explore/mlx/pull/2268
- Add load_safe to the general conv loaders by @angeloskath in https://github.com/ml-explore/mlx/pull/2258
- start cuda circle config by @awni in https://github.com/ml-explore/mlx/pull/2256
- CUDA backend: reduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2269
- CUDA backend: argreduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2270
- CUDA backend: softmax by @zcbenz in https://github.com/ml-explore/mlx/pull/2272
- CUDA backend: layernorm by @zcbenz in https://github.com/ml-explore/mlx/pull/2271
- Fix warnings from latest CUDA toolkit by @zcbenz in https://github.com/ml-explore/mlx/pull/2275
- Make sliceUpdate general by @awni in https://github.com/ml-explore/mlx/pull/2282
- CUDA backend: compile by @zcbenz in https://github.com/ml-explore/mlx/pull/2276
- [CUDA] RMSNorm and VJP by @awni in https://github.com/ml-explore/mlx/pull/2280
- [CUDA] Fix build by @awni in https://github.com/ml-explore/mlx/pull/2284
- [CUDA] ternary with select op by @awni in https://github.com/ml-explore/mlx/pull/2283
- CUDA backend: indexing ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2277
- Collection of refactors by @jagrit06 in https://github.com/ml-explore/mlx/pull/2274
- Fix complex power and print by @awni in https://github.com/ml-explore/mlx/pull/2286
- fix cuda jit by @awni in https://github.com/ml-explore/mlx/pull/2287
- Fix cuda gemm for bf16 by @awni in https://github.com/ml-explore/mlx/pull/2288
- Fix cuda arg reduce by @awni in https://github.com/ml-explore/mlx/pull/2291
- RoPE for CUDA by @angeloskath in https://github.com/ml-explore/mlx/pull/2293
- Add python testing for cuda with ability to skip list of tests by @awni in https://github.com/ml-explore/mlx/pull/2295
- [CUDA] Fix back-end bugs and enable corresponding tests by @awni in https://github.com/ml-explore/mlx/pull/2296
- Cuda bug fixes 2 by @awni in https://github.com/ml-explore/mlx/pull/2298
- [CUDA] Divmod, Partition, and sort fixes by @awni in https://github.com/ml-explore/mlx/pull/2302
- [CUDA] synch properly waits for all tasks to finish and clear by @awni in https://github.com/ml-explore/mlx/pull/2303
- Make ptx cache settable by environment variable by @angeloskath in https://github.com/ml-explore/mlx/pull/2304
- Build CUDA release in Circle by @awni in https://github.com/ml-explore/mlx/pull/2306
- Cuda perf tuning by @awni in https://github.com/ml-explore/mlx/pull/2307
- Fix
update_modules()when providing a subset by @angeloskath in https://github.com/ml-explore/mlx/pull/2308 - Compile float64 functions on CPU by @awni in https://github.com/ml-explore/mlx/pull/2311
- Fix get 2d grid dims by @angeloskath in https://github.com/ml-explore/mlx/pull/2316
- Split broadcast so it is always fused in compile by @angeloskath in https://github.com/ml-explore/mlx/pull/2318
- [CUDA] Fix reductions by @angeloskath in https://github.com/ml-explore/mlx/pull/2314
- Fix module update in strict mode by @awni in https://github.com/ml-explore/mlx/pull/2321
- MLX_SWITCH macros to templates by @angeloskath in https://github.com/ml-explore/mlx/pull/2320
- Use fp32 for testing, add more complex ops by @awni in https://github.com/ml-explore/mlx/pull/2322
- Patch bump by @awni in https://github.com/ml-explore/mlx/pull/2324
- Allow parameters to be deleted from a module by @awni in https://github.com/ml-explore/mlx/pull/2325
- Fix compilation error from integral_constant by @zcbenz in https://github.com/ml-explore/mlx/pull/2326
- [CUDA] Switch to CUDA graphs by @awni in https://github.com/ml-explore/mlx/pull/2317
- [CUDA] Fix graphs for older cuda by @awni in https://github.com/ml-explore/mlx/pull/2328
- [CUDA] Add MLXCUDAGRAPHCACHESIZE env for setting graph cache size by @zcbenz in https://github.com/ml-explore/mlx/pull/2329
- Fix layernorm race condition by @angeloskath in https://github.com/ml-explore/mlx/pull/2340
- Build with all cpu cores by default by @zcbenz in https://github.com/ml-explore/mlx/pull/2336
- [CUDA] Do vectorized store/load in binary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2330
- Auto build linux release by @awni in https://github.com/ml-explore/mlx/pull/2341
- MoE backward improvements by @angeloskath in https://github.com/ml-explore/mlx/pull/2335
- Fix compilation with CUDA 11 by @zcbenz in https://github.com/ml-explore/mlx/pull/2331
- patch bump by @awni in https://github.com/ml-explore/mlx/pull/2343
- Align mlx::core::max op nan propagation with NumPy by @jhavukainen in https://github.com/ml-explore/mlx/pull/2339
- Add zero for argsort vjp by @awni in https://github.com/ml-explore/mlx/pull/2345
- [CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2342
- Align mlx::core::min op nan propagation with NumPy by @jhavukainen in https://github.com/ml-explore/mlx/pull/2346
- [CUDA] Set current device before cudaGraphLaunch by @zcbenz in https://github.com/ml-explore/mlx/pull/2351
- [CUDA] Put version in ptx cache dir path by @zcbenz in https://github.com/ml-explore/mlx/pull/2352
- Fix type promotion in Adam with bias correction by @angeloskath in https://github.com/ml-explore/mlx/pull/2350
- Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in https://github.com/ml-explore/mlx/pull/2355
- [CUDA] Implement Scan kernel by @zcbenz in https://github.com/ml-explore/mlx/pull/2347
- [Metal] fix copy dispatch by @awni in https://github.com/ml-explore/mlx/pull/2360
- [CUDA] Bundle CCCL for JIT compilation by @zcbenz in https://github.com/ml-explore/mlx/pull/2357
- [CUDA] Do not put kernels in annoymous namespace by @zcbenz in https://github.com/ml-explore/mlx/pull/2362
- Fix imag() vjp by @angeloskath in https://github.com/ml-explore/mlx/pull/2367
- Add Primitive::name and remove Primitive::print by @zcbenz in https://github.com/ml-explore/mlx/pull/2365
- update linux build by @awni in https://github.com/ml-explore/mlx/pull/2370
- [CUDA] Affine quantize by @awni in https://github.com/ml-explore/mlx/pull/2354
- Fix flaky linux test by @awni in https://github.com/ml-explore/mlx/pull/2371
- Install linux with mlx[cuda] and mlx[cpu] by @awni in https://github.com/ml-explore/mlx/pull/2356
- [CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in https://github.com/ml-explore/mlx/pull/2372
- lower memory uniform sampling by @awni in https://github.com/ml-explore/mlx/pull/2361
- [CUDA] Fix complex reduce + nan propagation in min and max by @awni in https://github.com/ml-explore/mlx/pull/2377
- Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in https://github.com/ml-explore/mlx/pull/2378
- fix ring distributed test by @awni in https://github.com/ml-explore/mlx/pull/2380
- Test with CUDA 12.2 by @awni in https://github.com/ml-explore/mlx/pull/2375
- [CUDA] Add work per thread to compile by @angeloskath in https://github.com/ml-explore/mlx/pull/2368
- [CUDA] Fix resource leaks in matmul and graph by @awni in https://github.com/ml-explore/mlx/pull/2383
- [CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in https://github.com/ml-explore/mlx/pull/2382
- Add contiguouscopygpu util for copying array by @zcbenz in https://github.com/ml-explore/mlx/pull/2379
- Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in https://github.com/ml-explore/mlx/pull/1914
- Patch bump by @awni in https://github.com/ml-explore/mlx/pull/2386
- Fix release build + patch bump by @awni in https://github.com/ml-explore/mlx/pull/2387
- Fix cuda manylinux version to match others by @awni in https://github.com/ml-explore/mlx/pull/2388
- [CUDA] speedup handling scalars by @awni in https://github.com/ml-explore/mlx/pull/2389
- Remove thrust iterators by @zcbenz in https://github.com/ml-explore/mlx/pull/2396
- Add contiguouscopycpu util for copying array by @zcbenz in https://github.com/ml-explore/mlx/pull/2397
- Fix including stubs in wheel by @awni in https://github.com/ml-explore/mlx/pull/2398
- use size option in binary by @awni in https://github.com/ml-explore/mlx/pull/2399
- [CUDA] Simplify allocator by @awni in https://github.com/ml-explore/mlx/pull/2392
- Add cuda gemv by @awni in https://github.com/ml-explore/mlx/pull/2400
- Fix an error in the comment for mx.dequantize by @csukuangfj in https://github.com/ml-explore/mlx/pull/2409
- Remove unused code in Convolution::vjp by @zcbenz in https://github.com/ml-explore/mlx/pull/2408
- [CUDA] --compress-mode requires CUDA 12.8 by @zcbenz in https://github.com/ml-explore/mlx/pull/2407
- full row mask in sdpa consistently gives nan by @awni in https://github.com/ml-explore/mlx/pull/2406
- Fix uv install and add dev release by @awni in https://github.com/ml-explore/mlx/pull/2411
- [Metal] Release metal events by @awni in https://github.com/ml-explore/mlx/pull/2412
- Test on cuda 12.2 and 12.9 by @awni in https://github.com/ml-explore/mlx/pull/2413
- [CUDA] Initial implementation of Convolution with cuDNN by @zcbenz in https://github.com/ml-explore/mlx/pull/2385
- [DOCS]: Fix eps placement in Adam and AdamW by @Skonor in https://github.com/ml-explore/mlx/pull/2416
- [CUDA] Always use batched matmul by @awni in https://github.com/ml-explore/mlx/pull/2404
- Fix qvm splitk by @awni in https://github.com/ml-explore/mlx/pull/2415
- Update install docs and requirements by @awni in https://github.com/ml-explore/mlx/pull/2419
- version by @awni in https://github.com/ml-explore/mlx/pull/2420
New Contributors
- @emmanuel-ferdman made their first contribution in https://github.com/ml-explore/mlx/pull/2250
- @FL33TW00D made their first contribution in https://github.com/ml-explore/mlx/pull/2264
- @jhavukainen made their first contribution in https://github.com/ml-explore/mlx/pull/2339
- @Goekdeniz-Guelmez made their first contribution in https://github.com/ml-explore/mlx/pull/1914
- @Skonor made their first contribution in https://github.com/ml-explore/mlx/pull/2416
Full Changelog: https://github.com/ml-explore/mlx/compare/v0.26.0...v0.27.0
- C++
Published by awni 8 months ago
mlx - v0.26.0
Highlights
- 5 bit quantization
- Significant progress on CUDA back-end by @zcbenz
Core
Features
- 5bit quants
- Allow per-target Metal debug flags
- Add complex eigh
- reduce vjp for
mx.allandmx.any realandimagproperties- Non-symmetric
mx.linalg.eigandmx.linalg.eigh - convolution vmap
- Add more complex unary ops (
sqrt,square, ...) - Complex scan
- Add
mx.broadcast_shapes - Added
output_paddingparameters inconv_transpose - Add random normal distribution for complex numbers
- Add
mx.fft.fftshift andmx.fft.ifftshift` helpers - Enable vjp for quantized scale and bias
Performance
- Optimizing Complex Matrix Multiplication using Karatsubaβs Algorithm
- Much faster 1D conv
Cuda
- Generalize gpu backend
- Use fallbacks in fast primitives when
eval_gpuis not implemented - Add memory cache to CUDA backend
- Do not check
event.is_signaled()ineval_impl - Build for compute capability 70 instead of 75 in CUDA backend
- CUDA backend: backbone
Bug Fixes
- Fix out-of-bounds default value in logsumexp/softmax
- include
mlx::core::version()symbols in the mlx static library - Fix Nearest upsample
- Fix large arg reduce
- fix conv grad
- Fix some complex vjps
- Fix typo in rowreducesmall
- Fix
put_along_axisfor empty arrays - Close a couple edge case bugs:
hadamardandaddmmon empty inputs - Fix fft for integer overflow with large batches
- fix:
conv_generaldifferences between gpu, cpu - Fix batched vector sdpa
- GPU Hadamard for large N
- Improve bandwidth for elementwise ops
- Fix compile merging
- Fix shapeless export to throw on dim mismatch
- Fix
mx.linalg.pinvfor singular matrices - Fixed shift operations
- Fix integer overflow in qmm
Contributors
Thanks to some awesome contributors!
@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1
- C++
Published by awni 9 months ago
mlx - v0.25.0
Highlights
- Custom logsumexp for reduced memory in training (benchmark)
- Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
- Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
Core
Performance
- Fused vector attention supports 256 dim
- Tune quantized matrix vector dispatch for small batches of vectors
Features
- Move memory API in the top level mlx.core and enable for CPU only allocator
- Enable using MPI from all platforms and allow only OpenMPI
- Add a ring all gather for the ring distributed backend
- Enable gemm for complex numbers
- Fused attention supports literal "causal" mask
- Log for complex numbers
- Distributed
all_minandall_maxboth for MPI and the ring backend - Add
logcumsumexp - Add additive mask for fused vector attention
- Improve the usage of the residency set
NN
- Add sharded layers for model/tensor parallelism
Bugfixes
- Fix possible allocator deadlock when using multiple streams
- Ring backend supports 32 bit platforms and FreeBSD
- Fix FFT bugs
- Fix attention mask type for fused attention kernel
- Fix fused attention numerical instability with masking
- Add a fallback for float16 gemm
- Fix simd sign for uint64
- Fix issues in docs
- C++
Published by angeloskath 11 months ago
mlx - v0.24.0
Highlights
- Much faster fused attention with support for causal masking
- Benchmarks
- Improvements in prompt processing speed and memory use, benchmarks
- Much faster small batch fused attention for e.g. speculative decoding, benchmarks
- Major redesign of CPU back-end for faster CPU-GPU synchronization
Core
Performance
- Support fused masking in
scaled_dot_product_attention - Support transposed head/seq for fused vector
scaled_dot_product_attention - SDPA support for small batch (over sequence) queries
- Enabling fused attention for head dim 128
- Redesign CPU back-end for faster cpu/gpu synch
Features
- Allow debugging in distributed mode
- Support
mx.fast.rms_normwithout scale - Adds nuclear norm support in
mx.linalg.norm - Add XOR on arrays
- Added
mlx::core::version() - Allow non-square lu in
mx.linalg.lu - Double for lapack ops (
eigh,svd, etc) - Add a prepare tb ring script
- Ring docs
- Affine quant always in fp32
Optimizers
- Add a multi optimizer
optimizers.MultiOptimizer
Bug Fixes
- Do not define
MLX_VERSIONglobally - Reduce binary size post fast synch
- Fix vmap for flatten
- Fix copy for large arrays with JIT
- Fix grad with inplace updates
- Use same accumulation precision in gemv as gemm
- Fix slice data size
- Use a heap for small sizes
- Fix donation in scan
- Ensure linspace always contains start and stop
- Raise an exception in the rope op if input is integer
- Limit compile buffers by
- fix
mx.float64type promotion - Fix CPU SIMD erf_inv
- Update smoothl1loss in losses.
- C++
Published by jagrit06 12 months ago
mlx - v0.23.0
Highlights
- 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
- More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster
mx.put_along_axisandmx.take_along_axis, benchmarks - Faster unified CPU back-end with vector operations
- Double precision (
mx.float64) support on the CPU
Core
Features
- Bitwise invert
mx.bitwise_invert mx.linalg.lu,mx.linalg.lu_factor,mx.linalg.solve,mx.linalg.solve_triangular- Support loading F8_E4M3 from safetensors
mx.float64supported on the CPU- Matmul JVPs
- Distributed launch helper :
mlx.launch - Support non-square QR factorization with
mx.linalg.qr - Support ellipsis in
mx.einsum - Refactor and unify accelerate and common back-ends
Performance
- Faster synchronization
Fencefor synchronizing CPU-GPU - Much faster
mx.put_along_axisandmx.take_along_axis, benchmarks - Fast winograd convolutions, benchmarks
- Allow dynamic ops per buffer based on dispatches and memory, benchmarks
- Up to 3x faster sort, benchmarks
- Faster small batch qmv, benchmarks
- Ring distributed backend
- Uses raw sockets for faster all reduce
- Some CPU ops are much faster with the new
Simd<T, N>
NN
- Orthogonal initializer
nn.init.orthogonal - Add dilation for conv 3d layers
Bug fixes
- Limit grad recursion depth by not recursing through non-grad inputs
- Fix synchronization bug for GPU stream async CPU work
- Fix shapeless compile on ubuntu24
- Recompile when
shapelesschanges - Fix rope fallback to not upcast
- Fix metal sort for certain cases
- Fix a couple of slicing bugs
- Avoid duplicate malloc with custom kernel init
- Fix compilation error on Windows
- Allow Python garbage collector to break cycles on custom objects
- Fix grad with copies
- Loading empty list is ok when
strict = false - Fix split vmap
- Fixes output donation for IO ops on the GPU
- Fix creating an array with an int64 scalar
- Catch stream errors earlier to avoid aborts
- C++
Published by awni about 1 year ago
mlx - v0.22.0
Highlights
- Export and import MLX functions to a file (example, bigger example)
- Functions can be exported from Python and run in C++ and vice versa
Core
- Add
sliceandslice_updatewhich take arrays for starting locations - Add an example for using MLX in C++ with CMake
- Fused attention for generation now supports boolean masking (benchmark)
- Allow array offset for
mx.fast.rope - Add
mx.finfo - Allow negative strides without resorting to copying for
sliceandas_strided - Add
Flatten,UnflattenandExpandDimsprimitives - Enable the compilation of lambdas in C++
- Add a lot more primitives for shapeless compilation (full list)
- Fix performance regression in
qvm - Introduce separate types for
ShapeandStridesand switch to int64 strides from uint64 - Reduced copies for fused-attention kernel
- Recompile a function when the stream changes
- Several steps to improve the linux / x86_64 experience (#1625, #1627, #1635)
- Several steps to improve/enable the windows experience (#1628, #1660, #1662, #1661, #1672, #1663, #1664, ...)
- Update to newer Metal-cpp
- Throw when exceeding the maximum number of buffers possible
- Add
mx.kron mx.distributed.sendnow implements the identity function instead of returning an empty array- Better errors reporting for
mx.compileon CPU and for unrecoverable errors
NN
- Add optional bias correction in Adam/AdamW
- Enable mixed quantization by
nn.quantize - Remove reshapes from
nn.QuantizedEmbedding
Bug fixes
- Fix qmv/qvm bug for batch size 2-5
- Fix some leaks and races (#1629)
- Fix transformer postnorm in
mlx.nn - Fix some
mx.fastfallbacks - Fix the hashing for string constants in
compile - Fix small sort in Metal
- Fix memory leak of non-evaled arrays with siblings
- Fix concatenate/slice_update vjp in edge-case where the inputs have different type
- C++
Published by angeloskath about 1 year ago
mlx - v0.21.0
Highlights
- Support 3 and 6 bit quantization: benchmarks
- Much faster memory efficient attention for headdim 64, 80: benchmarks
- Much faster sdpa inference kernel for longer sequences: benchmarks
Core
contiguousop (C++ only) + primitive- Bfs width limit to reduce memory consumption during
eval - Fast CPU quantization
- Faster indexing math in several kernels:
- unary, binary, ternary, copy, compiled, reduce
- Improve dispatch threads for a few kernels:
- conv, gemm splitk, custom kernels
- More buffer donation with no-ops to reduce memory use
- Use
CMAKE_OSX_DEPLOYMENT_TARGETto pick Metal version - Dispatch Metal bf16 type at runtime when using the JIT
NN
nn.AvgPool3dandnn.MaxPool3d- Support
groupsinnn.Conv2d
Bug fixes
- Fix per-example mask + docs in sdpa
- Fix FFT synchronization bug (use dispatch method everywhere)
- Throw for invalid
*fft{2,n}cases - Fix OOB access in qmv
- Fix donation in sdpa to reduce memory use
- Allocate safetensors header on the heap to avoid stack overflow
- Fix sibling memory leak
- Fix
viewsegfault for scalars input - Fix concatenate vmap
- C++
Published by awni over 1 year ago
mlx - v0.20.0
Highlights
- Even faster GEMMs
- Peaking at 23.89 TFlops on M2 Ultra benchmarks
- BFS graph optimizations
- Over 120tks with Mistral 7B!
- Fast batched QMV/QVM for KV quantized attention benchmarks
Core
- New Features
mx.linalg.eighandmx.linalg.eigvalshmx.nn.init.sparse- 64bit type support for
mx.cumprod,mx.cumsum
- Performance
- Faster long column reductions
- Wired buffer support for large models
- Better Winograd dispatch condition for convs
- Faster scatter/gather
- Faster
mx.random.uniformandmx.random.bernoulli - Better threadgroup sizes for large arrays
- Misc
- Added Python 3.13 to CI
- C++20 compatibility
Bugfixes
- Fix command encoder synchronization
- Fix
mx.vmapwith gather and constant outputs - Fix fused sdpa with differing key and value strides
- Support
mx.array.__format__with spec - Fix multi output array leak
- Fix RMSNorm weight mismatch error
- C++
Published by barronalex over 1 year ago
mlx - v0.19.0
Highlights
- Speed improvements
- Up to 6x faster CPU indexing benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Gradient for grouped convolutions
- Due to Python 3.8's end-of-life we no longer test with it on CI
Core
- New features
- Gradient for grouped convolutions
mx.rollmx.random.permutationmx.realandmx.imag
- Performance
- Up to 6x faster CPU indexing benchmarks
- Faster CPU sort benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Bulk eval in safetensors to avoid unnecessary serialization of work
- Misc
- Bump to nanobind 2.2
- Move testing to python 3.9 due to 3.8's end-of-life
- Make the GPU device more thread safe
- Fix the submodule stubs for better IDE support
- CI generated docs that will never be stale
NN
- Add support for grouped 1D convolutions to the nn API
- Add some missing type annotations
Bugfixes
- Fix and speedup row-reduce with few rows
- Fix normalization primitive segfault with unexpected inputs
- Fix complex power on the GPU
- Fix freeing deep unevaluated graphs details
- Fix race with
array::is_available - Consistently handle softmax with all
-infinputs - Fix streams in affine quantize
- Fix CPU compile preamble for some linux machines
- Stream safety in CPU compilation
- Fix CPU compile segfault at program shutdown
- C++
Published by angeloskath over 1 year ago
mlx - v0.18.0
Highlights
- Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
- CPU benchmarks here.
- GPU benchmarks here and here.
- Transposed convolutions
- Improvements to
mx.distributed(send/recv/average_gradients)
Core
New features:
mx.conv_transpose{1,2,3}d- Allow
mx.taketo work with integer index - Add
stdas method onmx.array mx.put_along_axismx.cross_productint()andfloat()work on scalarmx.array- Add optional headers to
mx.fast.metal_kernel mx.distributed.sendandmx.distributed.recvmx.linalg.pinv
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in
mx.fast.metal_kernel - Improve donation heuristics to reduce memory use
Misc
- Support Xcode 160
NN
- Faster RNN layers
nn.ConvTranspose{1,2,3}dmlx.nn.average_gradientsdata parallel helper for distributed training
Bug Fixes
- Fix boolean all reduce bug
- Fix extension metal library finding
- Fix ternary for large arrays
- Make eval just wait if all arrays are scheduled
- Fix CPU softmax by removing redundant coefficient in neonfastexp
- Fix JIT reductions
- Fix overflow in quantize/dequantize
- Fix compile with byte sized constants
- Fix copy in the sort primitive
- Fix reduce edge case
- Fix slice data size
- Throw for certain cases of non captured inputs in compile
- Fix copying scalars by adding fill_gpu
- Fix bug in module attribute set, reset, set
- Ensure io/comm streams are active before eval
- Fix
mx.clip - Override class function in Repr so
mx.arrayis not confused witharray.array - Avoid using find_library to make install truly portable
- Remove fmt dependencies from MLX install
- Fix for partition VJP
- Avoid command buffer timeout for IO on large arrays
- C++
Published by awni over 1 year ago
mlx - v0.17.0
Highlights
mx.einsum: PR- Big speedups in reductions: benchmarks
- 2x faster model loading: PR
mx.fast.metal_kernelfor custom GPU kernels: docs
Core
- Faster program exits
- Laplace sampling
mx.nan_to_numnn.tanhgelu approximation- Fused GPU quantization ops
- Faster group norm
- bf16 winograd conv
- vmap support for
mx.scatter mx.pad"edge" padding- More numerically stable
mx.var mx.linalg.cholesky_inv/mx.linalg.tri_invmx.isfinite- Complex
mx.signnow mirrors NumPy 2.0 behaviour - More flexible
mx.fast.rope - Update to
nanobind2.1
Bug Fixes
- gguf zero initialization
- expm1f overflow handling
- bfloat16 hadamard
- large arrays for various ops
- rope fix
- bf16 array creation
- preserve dtype in
nn.Dropout nn.TransformerEncoderwithnorm_first=False- excess copies from contiguity bug
- C++
Published by barronalex over 1 year ago
mlx - v0.16.0
Highlights
@mx.custom_functionfor customvjp/jvp/vmaptransforms- Up to 2x faster Metal GEMV and fast masked GEMV
- Fast
hadamard_transform
Core
- Metal 3.2 support
- Reduced CPU binary size
- Added quantized GPU ops to JIT
- Faster GPU compilation
- Added grads for bitwise ops + indexing
Bug Fixes
- 1D scatter bug
- Strided sort bug
- Reshape copy bug
- Seg fault in
mx.compile - Donation condition in compilation
- Compilation of accelerate on iOS
- C++
Published by barronalex over 1 year ago
mlx - v0.15.0
Highlights
- Fast Metal GPU FFTs
- On average ~30x faster than CPU
- More benchmarks
mx.distributedwithall_sumandall_gather
Core
- Added dlpack device
__dlpack_device__ - Fast GPU FFTs benchmarks
- Add docs for the
mx.distributed - Add
mx.viewop
NN
softmin,hardshrink, andhardtanhactivations
Bugfixes
- Fix broadcast bug in bitwise ops
- Allow more buffers for JIT compilation
- Fix matvec vector stride bug
- Fix multi-block sort stride management
- Stable cumprod grad at 0
- Buf fix with race condition in scan
- C++
Published by awni almost 2 years ago
mlx - v0.14.0
Highlights
- Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
mx.gather_qmmquantized equivalent formx.gather_mmwhich speeds up MoE inference by ~2x- Grouped 2D convolutions
Core
mx.conjugatemx.conv3dandnn.Conv3d- List based indexing
- Started
mx.distributedwhich uses MPI (if installed) for communication across machinesmx.distributed.initmx.distributed.all_gathermx.distributed.all_reduce_sum
- Support conversion to and from dlpack
mx.linalg.choleskyon CPUmx.quantized_matmulsped up for vector-matrix productsmx.tracemx.block_masked_mmnow supports floating point masks!
Fixes
- Error messaging in eval
- Add some missing docs
- Scatter index bug
- The extensions example now compiles and runs
- CPU copy bug with many dimensions
- C++
Published by angeloskath almost 2 years ago
mlx - v0.13.0
Highlights
- Block sparse matrix multiply speeds up MoEs by >2x
- Improved quantization algorithm should work well for all networks
- Improved gpu command submission speeds up training and inference
Core
- Bitwise ops added:
mx.bitwise_[or|and|xor],mx.[left|right]_shift, operator overloads
- Groups added to Conv1d
- Added
mx.metal.device_infoto get better informed memory limits - Added resettable memory stats
mlx.optimizers.clip_grad_normandmlx.utils.tree_reduceadded- Add
mx.arctan2 - Unary ops now accept array-like inputs ie one can do
mx.sqrt(2)
Bugfixes
- Fixed shape for slice update
- Bugfix in quantize that used slightly wrong scales/biases
- Fixed memory leak for multi-output primitives encountered with gradient checkpointing
- Fixed conversion from other frameworks for all datatypes
- Fixed index overflow for matmul with large batch size
- Fixed initialization ordering that occasionally caused segfaults
- C++
Published by angeloskath almost 2 years ago
mlx - v0.12.0
Highlights
- Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers
Core
mx.synchronizeto wait for computation dispatched withmx.async_evalmx.radiansandmx.degreesmx.metal.clear_cacheto return to the OS the memory held by MLX as a cache for future allocations- Change quantization to always represent 0 exactly (relevant issue)
Bugfixes
- Fixed quantization of a block with all 0s that produced NaNs
- Fixed the
lenfield in the buffer protocol implementation
- C++
Published by angeloskath almost 2 years ago
mlx - v0.11.0
Core
mx.block_masked_mmfor block-level sparse matrix multiplication- Shared events for synchronization and asynchronous evaluation
NN
nn.QuantizedEmbeddinglayernn.quantizefor quantizing modulesgelu_approxuses tanh for consistency with PyTorch
- C++
Published by awni almost 2 years ago
mlx - v0.10.0
Highlights
- Improvements for LLM generation
- Reshapeless quant matmul/matvec
mx.async_eval- Async command encoding
Core
- Slightly faster reshapeless quantized gemms
- Option for precise softmax
mx.metal.start_captureandmx.metal.stop_capturefor GPU debug/profilemx.expm1mx.stdmx.meshgrid- CPU only
mx.random.multivariate_normal mx.cumsum(and other scans) forbfloat- Async command encoder with explicit barriers / dependency management
NN
nn.upsamplesupport bicubic interpolation
Misc
- Updated MLX Extension to work with nanobind
Bugfixes
- Fix buffer donation in softmax and fast ops
- Bug in layer norm vjp
- Bug initializing from lists with scalar
- Bug in indexing
- CPU compilation bug
- Multi-output compilation bug
- Fix stack overflow issues in eval and array destruction
- C++
Published by awni almost 2 years ago
mlx - v0.9.0
Highlights:
- Fast partial RoPE (used by Phi-2)
- Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, benchmarks
Core
- More overhead reductions
- Partial fast RoPE (fast Phi-2)
- Better buffer donation for copy
- Type hierarchy and issubdtype
- Fast VJPs for RoPE, RMSNorm, and LayerNorm
NN
Module.set_dtype- Chaining in
nn.Module(model.freeze().update(β¦))
Bugfixes
- Fix set item bugs
- Fix scatter vjp
- Check shape integer overlow on array construction
- Fix bug with module attributes
- Fix two bugs for odd shaped QMV
- Fix GPU sort for large sizes
- Fix bug in negative padding for convolutions
- Fix bug in multi-stream race condition for graph evaluation
- Fix random normal generation for half precision
- C++
Published by awni almost 2 years ago
mlx - v0.8.0
Highlights
- More perf!
mx.fast.rms_normandmx.fast.layer_norm- Switch to Nanobind substantially reduces overhead
- Up to 4x faster
__setitem__(e.g.a[...] = b)
Core
mx.inverse, CPU only- vmap over
mx.matmulandmx.addmm - Switch to nanobind from pybind11
- Faster setitem indexing
mx.fast.rms_norm, token generation benchmarkmx.fast.layer_norm, token generation benchmark- vmap for inverse and svd
- Faster non-overlapping pooling
Optimizers
- Set minimum value in cosine decay scheduler
Bugfixes
- Fix bug in multi-dimensional reduction
- C++
Published by awni almost 2 years ago
mlx -
Highlights
- Perf improvements for attention ops:
- No copy broadcast matmul (benchmarks)
- Fewer copies in reshape
Core
- Faster broadcast + gemm
mx.linalg.svd(CPU only)- Fewer copies in reshape
- Faster small reductions
NN
nn.RNN,nn.LSTM,nn.GRU
Bugfixes
- Fix bug in depth traversal ordering
- Fix two edge case bugs in compilation
- Fix bug with modules with dictionaries of weights
- Fix bug with scatter which broke MOE training
- Fix bug with compilation kernel collision
- C++
Published by awni almost 2 years ago
mlx - v0.6.0
Highlights:
- Faster quantized matrix-vector multiplies
mx.fast.scaled_dot_product_attentionfused op
Core
- Memory allocation API improvements
- Faster GPU reductions for smaller sizes (between 2 and 7x)
mx.fast.scaled_dot_product_attentionfused op- Faster quantized matrix-vector multiplications
- Pickle support for
mx.array
NN
- Dilation on convolution layers
Bugfixes
- Fix
mx.topk - Fix reshape for zero sizes
- C++
Published by angeloskath about 2 years ago
mlx - v0.5.0
Highlights:
- Faster convolutions.
- Up to 14x faster for some common sizes.
- See benchmarks
Core
mx.whereproperly handlesinf- Faster and more general convolutions
- Input and kernel dilation
- Asymmetric padding
- Support for cross-correlation and convolution
atleast_{1,2,3}daccept any number of arrays
NN
nn.Upsamplelayer- Supports nearest neighbor and linear interpolation
- Any number of dimensions
Optimizers
- Linear schedule and schedule joiner:
- Use for e.g. linear warmup + cosine decay
Bugfixes
arangethrows oninfinputs- Fix Cmake build with MLX
- Fix
logsumexpinfedge case - Fix grad of power w.r.t. to exponent edge case
- Fix compile with
infconstants - Bug temporary bug in convolution
- C++
Published by jagrit06 about 2 years ago
mlx - v0.4.0
Highlights:
- Partial shapeless compilation
- Default shapeless compilation for all activations
- Can be more than 5x faster than uncompiled versions
- CPU kernel fusion
- Some functions can be up to 10x faster
Core
- CPU compilation
- Shapeless compilation for some cases
mx.compile(function, shapeless=True)
- Up to 10x faster scatter: benchmarks
mx.atleast_1d,mx.atleast_2d,mx.atleast_3d
Bugfixes
- Bug with
tolistwithbfloat16andfloat16 - Bug with
argmaxon M3
- C++
Published by awni about 2 years ago
mlx - v0.2.0
Highlights:
mx.compilemakes stuff go fast- Some functions are up to 10x faster (benchmarks)
- Training models anywhere from 10% to twice as fast (benchmarks)
- Simple syntax for compiling full training steps
Core
mx.compilefunction transformation- Find devices properly for iOS
- Up to 10x faster GPU gather
__abs__overload forabson arrayslocandscalein parameter formx.random.normal
NN
- Margin ranking loss
- BCE loss with weights
Bugfixes
- Fix for broken eval during function transformations
- Fix
mx.varto giveinfwithdoff >= nelem - Fix loading empty modules in
nn.Sequential
- C++
Published by awni about 2 years ago
mlx - v0.1.0
Highlights
- Memory use improvements:
- Gradient checkpointing for training with
mx.checkpoint - Better graph execution order
- Buffer donation
- Gradient checkpointing for training with
Core
- Gradient checkpointing with
mx.checkpoint - CPU only QR factorization
mx.linalg.qr - Release Python GIL during
mx.eval - Depth-based graph execution order
- Lazy loading arrays from files
- Buffer donation for reduced memory use
mx.diag,mx.diagonal- Breaking:
array.shapeis a Python tuple - GPU support for
int64anduint64reductions - vmap over reductions and arg reduction:
-
sum,prod,max,min,all,any argmax,argmin
-
NN
- Softshrink activation
Bugfixes
- Comparisons with
infwork, and fixmx.isinf - Bug fix with RoPE cache
- Handle empty Matmul on the CPU
- Negative shape checking for
mx.full - Correctly propagate
NaNin some binary opsmx.logaddexp,mx.maximum,mx.minimum
- Fix > 4D non-contiguous binary ops
- Fix
mx.log1pwithinfinput - Fix SGD to apply weight decay even with 0 momentum
- C++
Published by angeloskath about 2 years ago
mlx - v0.0.11
Highlights:
- GGUF improvements:
- Native quantizations
Q4_0,Q4_1, andQ8_0 - Metadata
- Native quantizations
Core
- Support for reading and writing GGUF metadata
- Native GGUF quantization (
Q4_0,Q4_1, andQ8_0) - Quantize with group size of 32 (2x32, 4x32, and 8x32)
NN
Module.save_weightssupports safetensorsnn.initpackage with several commonly used neural network initializers- Binary cross entropy and cross entropy losses can take probabilities as targets
Adafactorinnn.optimizers
Bugfixes
- Fix
isinfand friends for integer types - Fix array creation from list Python ints to
int64,uint, andfloat32 - Fix power VJP for
0inputs - Fix out of bounds
infreads ingemv mx.arangecrashes on NaN inputs
- C++
Published by angeloskath about 2 years ago
mlx - v0.0.10
Highlights:
- Faster matmul: up to 2.5x faster for certain sizes, benchmarks
- Fused matmul + addition (for faster linear layers)
Core
- Quantization supports sizes other than multiples of 32
- Faster GEMM (matmul)
- ADMM primitive (fused addition and matmul)
mx.isnan,mx.isinf,isposinf,isneginfmx.tile- VJPs for
scatter_minandscatter_max - Multi output split primitive
NN
- Losses: Gaussian negative log-likelihood
Misc
- Performance enhancements for graph evaluation with lots of outputs
- Default PRNG seed is based on current time instead of 0
- Primitive VJP takes output as input. Reduces redundant work without need for simplification
- PRNGs default seed based on system time rather than fixed to 0
- Format boolean printing in Python style when in Python
Bugfixes
- Scatter < 32 bit precision and integer overflow fix
- Overflow with
mx.eye - Report Metal out of memory issues instead of silent failure
- Change
mx.roundto follow NumPy which rounds to even
- C++
Published by awni about 2 years ago
mlx - v0.0.9
Highlights:
- Initial (and experimental) GGUF support
- Support Python buffer protocol (easy interoperability with NumPy, Jax, Tensorflow, PyTorch, etc)
at[]syntax for scatter style operations:x.at[idx].add(y), (min,max,prod, etc)
Core
- Array creation from other mx.arrayβs (
mx.array([x, y])) - Complete support for Python buffer protocol
mx.inner,mx.outer- mx.logicaland, mx.logicalor, and operator overloads
- Array at syntax for scatter ops
- Better support for in-place operations (
+=,*=,-=, ...) - VJP for scatter and scatter add
- Constants (
mx.pi,mx.inf,mx.newaxis, β¦)
NN
- GLU activation
cosine_similarityloss- Cache for
RoPEandALiBi
Bugfixes / Misc
- Fix data type with
tri - Fix saving non-contiguous arrays
- Fix graph retention for inlace state, and remove
retain_graph - Multi-output primitives
- Better support for loading devices
- C++
Published by awni about 2 years ago
mlx - v0.0.7
Core
- Support for loading and saving HuggingFace's safetensor format
- Transposed quantization matmul kernels
mlx.core.linalgsub-package withmx.linalg.norm(Frobenius, infininty, p-norms)tensordotandrepeat
NN
- Layers
Bilinear,Identity,InstanceNormDropout2D,Dropout3D- more customizable
Transformer(pre/post norm, dropout) - More activations:
SoftSign,Softmax,HardSwish,LogSoftmax - Configurable scale in
RoPEpositional encodings
- Losses:
hinge,huber,log_cosh
Misc
- Faster GPU reductions for certain cases
- Change to memory allocation to allow swapping
- C++
Published by awni about 2 years ago
mlx - v0.0.6
Core
- quantize, dequantize, quantized_matmul
- moveaxis, swapaxes, flatten
- stack
- floor, ceil, clip
- tril, triu, tri
- linspace
Optimizers
- RMSProp, Adamax, Adadelta, Lion
NN
- Layers:
QuantizedLinear,ALiBipositional encodings - Losses: Label smoothing, Smooth L1 loss, Triplet loss
Misc
- Bug fixes
- C++
Published by angeloskath about 2 years ago