Recent Releases of mlx

mlx - v0.29.0

Highlights

  • Support for mxfp4 quantization (Metal, CPU)
  • More performance improvements, bug fixes, features in CUDA backend
  • mx.distributed supports NCCL back-end for CUDA

What's Changed

  • [CUDA] Optimize setmmdevice_pointers for small ndim by @zcbenz in https://github.com/ml-explore/mlx/pull/2473
  • Fix logsumexp/softmax not fused for some cases by @zcbenz in https://github.com/ml-explore/mlx/pull/2474
  • Use CMake <4.1 to avoid the nvpl error by @angeloskath in https://github.com/ml-explore/mlx/pull/2489
  • Fix incorrect interpretation of unsigned dtypes in reduce ops by @abeleinin in https://github.com/ml-explore/mlx/pull/2477
  • make code blocks copyable by @Dan-Yeh in https://github.com/ml-explore/mlx/pull/2480
  • Rename cu::Matmul to CublasGemm by @zcbenz in https://github.com/ml-explore/mlx/pull/2488
  • Faster general unary op by @awni in https://github.com/ml-explore/mlx/pull/2472
  • The naiveconv2d is no longer used by @zcbenz in https://github.com/ml-explore/mlx/pull/2496
  • Remove the hack around SmallVector in cpu compile by @zcbenz in https://github.com/ml-explore/mlx/pull/2494
  • Clean up code handling both std::vector and SmallVector by @zcbenz in https://github.com/ml-explore/mlx/pull/2493
  • [CUDA] Fix conv grads with groups by @zcbenz in https://github.com/ml-explore/mlx/pull/2495
  • Update cuDNN Frontend to v1.14 by @zcbenz in https://github.com/ml-explore/mlx/pull/2505
  • Ensure small sort doesn't use indices if not argsort by @angeloskath in https://github.com/ml-explore/mlx/pull/2506
  • Ensure no oob read in gemv_masked by @angeloskath in https://github.com/ml-explore/mlx/pull/2508
  • fix custom kernel test by @awni in https://github.com/ml-explore/mlx/pull/2510
  • No segfault with uninitialized array.at by @awni in https://github.com/ml-explore/mlx/pull/2514
  • Fix lapack svd by @awni in https://github.com/ml-explore/mlx/pull/2515
  • Split cuDNN helpers into a separate header by @zcbenz in https://github.com/ml-explore/mlx/pull/2491
  • [CUDA] Add GEMM-based fallback convolution kernels by @zcbenz in https://github.com/ml-explore/mlx/pull/2511
  • Fix docs by @russellizadi in https://github.com/ml-explore/mlx/pull/2518
  • Fix overflow in large filter small channels by @angeloskath in https://github.com/ml-explore/mlx/pull/2520
  • [CUDA] Fix stride of singleton dims before passing to cuDNN by @zcbenz in https://github.com/ml-explore/mlx/pull/2521
  • Custom cuda kernel by @angeloskath in https://github.com/ml-explore/mlx/pull/2517
  • Fix docs omission by @angeloskath in https://github.com/ml-explore/mlx/pull/2524
  • Fix power by @awni in https://github.com/ml-explore/mlx/pull/2523
  • NCCL backend by @nastya236 in https://github.com/ml-explore/mlx/pull/2476
  • [CUDA] Nccl pypi dep + default for cuda by @awni in https://github.com/ml-explore/mlx/pull/2526
  • Fix warning 186-D from nvcc by @zcbenz in https://github.com/ml-explore/mlx/pull/2527
  • [CUDA] Update calls to cudaMemAdvise and cudaGraphAddDependencies for CUDA 13 by @andportnoy in https://github.com/ml-explore/mlx/pull/2525
  • nccl default for backend=any by @awni in https://github.com/ml-explore/mlx/pull/2528
  • Fix allocation bug in NCCL by @awni in https://github.com/ml-explore/mlx/pull/2530
  • Enable COMPILEWARNINGAS_ERROR for linux builds in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2534
  • [CUDA] Remove thrust in arange by @zcbenz in https://github.com/ml-explore/mlx/pull/2535
  • Use nccl header only when nccl is not present by @awni in https://github.com/ml-explore/mlx/pull/2539
  • Allow pathlib.Path to save/load functions by @awni in https://github.com/ml-explore/mlx/pull/2541
  • Remove nccl install in release by @awni in https://github.com/ml-explore/mlx/pull/2542
  • [CUDA] Implement DynamicSlice/DynamicSliceUpdate by @zcbenz in https://github.com/ml-explore/mlx/pull/2533
  • Remove stream from average grads so it uses default by @awni in https://github.com/ml-explore/mlx/pull/2532
  • Enable cuda graph toggle by @awni in https://github.com/ml-explore/mlx/pull/2545
  • Tests for save/load with Path by @awni in https://github.com/ml-explore/mlx/pull/2543
  • Run CPP tests for CUDA build in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2544
  • Separate cpu compilation cache by versions by @zcbenz in https://github.com/ml-explore/mlx/pull/2548
  • [CUDA] Link with nccl by @awni in https://github.com/ml-explore/mlx/pull/2546
  • [CUDA] Use ConcurrentContext in concatenate_gpu by @zcbenz in https://github.com/ml-explore/mlx/pull/2549
  • [CUDA] fix sort by @awni in https://github.com/ml-explore/mlx/pull/2550
  • Add mode parameter for quantization by @awni in https://github.com/ml-explore/mlx/pull/2499
  • Bump xcode in circle by @awni in https://github.com/ml-explore/mlx/pull/2551
  • Fix METAL quantization in JIT + fix release build by @awni in https://github.com/ml-explore/mlx/pull/2553
  • Faster contiguous gather for indices in the first axis by @awni in https://github.com/ml-explore/mlx/pull/2552
  • version bump by @awni in https://github.com/ml-explore/mlx/pull/2554
  • Fix quantized vjp for mxfp4 by @awni in https://github.com/ml-explore/mlx/pull/2555

New Contributors

  • @Dan-Yeh made their first contribution in https://github.com/ml-explore/mlx/pull/2480
  • @russellizadi made their first contribution in https://github.com/ml-explore/mlx/pull/2518
  • @andportnoy made their first contribution in https://github.com/ml-explore/mlx/pull/2525

Full Changelog: https://github.com/ml-explore/mlx/compare/v0.28.0...v0.29.0

- C++
Published by awni 6 months ago

mlx - v0.28.0

Highlights

  • First version of fused sdpa vector for CUDA
  • Convolutions in CUDA
  • Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more

What's Changed

  • [CUDA] Fix segfault on exit by @awni in https://github.com/ml-explore/mlx/pull/2424
  • [CUDA] No occupancy query for launch params by @awni in https://github.com/ml-explore/mlx/pull/2426
  • [CUDA] More sizes for gemv by @awni in https://github.com/ml-explore/mlx/pull/2429
  • Add more CUDA architectures for PyPi package by @awni in https://github.com/ml-explore/mlx/pull/2427
  • Use ccache in CI by @zcbenz in https://github.com/ml-explore/mlx/pull/2414
  • [CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in https://github.com/ml-explore/mlx/pull/2433
  • Cuda faster softmax by @awni in https://github.com/ml-explore/mlx/pull/2435
  • Remove the kernel arg from getlaunchargs by @zcbenz in https://github.com/ml-explore/mlx/pull/2437
  • Move arange to its own file by @zcbenz in https://github.com/ml-explore/mlx/pull/2438
  • Use loadvector in argreduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2439
  • Make CI faster by @zcbenz in https://github.com/ml-explore/mlx/pull/2440
  • [CUDA] Quantized refactoring by @angeloskath in https://github.com/ml-explore/mlx/pull/2442
  • fix circular reference by @awni in https://github.com/ml-explore/mlx/pull/2443
  • [CUDA] Fix gemv regression by @awni in https://github.com/ml-explore/mlx/pull/2445
  • Fix wrong graph key when using concurrent context by @zcbenz in https://github.com/ml-explore/mlx/pull/2447
  • Fix custom metal extension by @awni in https://github.com/ml-explore/mlx/pull/2446
  • Add tests for export including control flow models and quantized models by @junpeiz in https://github.com/ml-explore/mlx/pull/2430
  • [CUDA] Backward convolution by @zcbenz in https://github.com/ml-explore/mlx/pull/2431
  • [CUDA] Save primitive inputs faster by @zcbenz in https://github.com/ml-explore/mlx/pull/2449
  • [CUDA] Vectorize generated kernels by @angeloskath in https://github.com/ml-explore/mlx/pull/2444
  • [CUDA] Matmul utils initial commit by @angeloskath in https://github.com/ml-explore/mlx/pull/2441
  • Fix arctan2 grads by @angeloskath in https://github.com/ml-explore/mlx/pull/2453
  • Use LRU cache for cuda graph by @zcbenz in https://github.com/ml-explore/mlx/pull/2448
  • Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in https://github.com/ml-explore/mlx/pull/2460
  • Default install cuda on linux by @awni in https://github.com/ml-explore/mlx/pull/2462
  • fix wraps compile by @awni in https://github.com/ml-explore/mlx/pull/2461
  • Feat: add USESYSTEMFMT CMake option by @GaetanLepage in https://github.com/ml-explore/mlx/pull/2219
  • Use SmallVector for shapes and strides by @zcbenz in https://github.com/ml-explore/mlx/pull/2454
  • Fix install tags by @awni in https://github.com/ml-explore/mlx/pull/2464
  • Faster gather qmm sorted test by @awni in https://github.com/ml-explore/mlx/pull/2463
  • Fix cublas on h100 by @awni in https://github.com/ml-explore/mlx/pull/2466
  • revert default cuda install by @awni in https://github.com/ml-explore/mlx/pull/2465
  • feat: support a destinations based in tree flatten/unflatten by @LVivona in https://github.com/ml-explore/mlx/pull/2450
  • Fix typo in metal command encoder by @angeloskath in https://github.com/ml-explore/mlx/pull/2471
  • Update CUDA sdpa by @jagrit06 in https://github.com/ml-explore/mlx/pull/2468
  • version by @awni in https://github.com/ml-explore/mlx/pull/2470

New Contributors

  • @junpeiz made their first contribution in https://github.com/ml-explore/mlx/pull/2430
  • @zamderax made their first contribution in https://github.com/ml-explore/mlx/pull/2460
  • @GaetanLepage made their first contribution in https://github.com/ml-explore/mlx/pull/2219
  • @LVivona made their first contribution in https://github.com/ml-explore/mlx/pull/2450

Full Changelog: https://github.com/ml-explore/mlx/compare/v0.27.1...v0.28.0

- C++
Published by angeloskath 7 months ago

mlx - v0.27.1

Highlights

  • Initial PyPi release of the CUDA back-end.
  • CUDA back-end works for well with mlx-lm:
    • Reasonably fast for LLM inference
    • Supports single-machine training and LoRA fine-tuning

What's Changed

  • Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in https://github.com/ml-explore/mlx/pull/2232
  • Share more common code in Compiled by @zcbenz in https://github.com/ml-explore/mlx/pull/2240
  • Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in https://github.com/ml-explore/mlx/pull/2231
  • Perf regression fix by @angeloskath in https://github.com/ml-explore/mlx/pull/2243
  • Add profiler annotations in common primitives for CUDA backend by @zcbenz in https://github.com/ml-explore/mlx/pull/2244
  • Default strict mode for module update and update_modules by @awni in https://github.com/ml-explore/mlx/pull/2239
  • Fix linux linking error by @awni in https://github.com/ml-explore/mlx/pull/2248
  • Improve metal elementwise kernels by @awni in https://github.com/ml-explore/mlx/pull/2247
  • CUDA backend: matmul by @zcbenz in https://github.com/ml-explore/mlx/pull/2241
  • Change layernorms to two pass algorithm by @angeloskath in https://github.com/ml-explore/mlx/pull/2246
  • Fix unintuitive metal kernel caching by @awni in https://github.com/ml-explore/mlx/pull/2242
  • Refactor the lu test by @emmanuel-ferdman in https://github.com/ml-explore/mlx/pull/2250
  • CUDA backend: unary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2158
  • Fix export to work with gather/scatter axis by @awni in https://github.com/ml-explore/mlx/pull/2263
  • CUDA backend: binary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2259
  • Report number of missing parameters by @FL33TW00D in https://github.com/ml-explore/mlx/pull/2264
  • CUDA backend: sort by @zcbenz in https://github.com/ml-explore/mlx/pull/2262
  • CUDA backend: random by @zcbenz in https://github.com/ml-explore/mlx/pull/2261
  • Fix conv export by @awni in https://github.com/ml-explore/mlx/pull/2265
  • CUDA backend: copy ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2260
  • Fix building cpp benchmarks on Linux by @zcbenz in https://github.com/ml-explore/mlx/pull/2268
  • Add load_safe to the general conv loaders by @angeloskath in https://github.com/ml-explore/mlx/pull/2258
  • start cuda circle config by @awni in https://github.com/ml-explore/mlx/pull/2256
  • CUDA backend: reduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2269
  • CUDA backend: argreduce by @zcbenz in https://github.com/ml-explore/mlx/pull/2270
  • CUDA backend: softmax by @zcbenz in https://github.com/ml-explore/mlx/pull/2272
  • CUDA backend: layernorm by @zcbenz in https://github.com/ml-explore/mlx/pull/2271
  • Fix warnings from latest CUDA toolkit by @zcbenz in https://github.com/ml-explore/mlx/pull/2275
  • Make sliceUpdate general by @awni in https://github.com/ml-explore/mlx/pull/2282
  • CUDA backend: compile by @zcbenz in https://github.com/ml-explore/mlx/pull/2276
  • [CUDA] RMSNorm and VJP by @awni in https://github.com/ml-explore/mlx/pull/2280
  • [CUDA] Fix build by @awni in https://github.com/ml-explore/mlx/pull/2284
  • [CUDA] ternary with select op by @awni in https://github.com/ml-explore/mlx/pull/2283
  • CUDA backend: indexing ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2277
  • Collection of refactors by @jagrit06 in https://github.com/ml-explore/mlx/pull/2274
  • Fix complex power and print by @awni in https://github.com/ml-explore/mlx/pull/2286
  • fix cuda jit by @awni in https://github.com/ml-explore/mlx/pull/2287
  • Fix cuda gemm for bf16 by @awni in https://github.com/ml-explore/mlx/pull/2288
  • Fix cuda arg reduce by @awni in https://github.com/ml-explore/mlx/pull/2291
  • RoPE for CUDA by @angeloskath in https://github.com/ml-explore/mlx/pull/2293
  • Add python testing for cuda with ability to skip list of tests by @awni in https://github.com/ml-explore/mlx/pull/2295
  • [CUDA] Fix back-end bugs and enable corresponding tests by @awni in https://github.com/ml-explore/mlx/pull/2296
  • Cuda bug fixes 2 by @awni in https://github.com/ml-explore/mlx/pull/2298
  • [CUDA] Divmod, Partition, and sort fixes by @awni in https://github.com/ml-explore/mlx/pull/2302
  • [CUDA] synch properly waits for all tasks to finish and clear by @awni in https://github.com/ml-explore/mlx/pull/2303
  • Make ptx cache settable by environment variable by @angeloskath in https://github.com/ml-explore/mlx/pull/2304
  • Build CUDA release in Circle by @awni in https://github.com/ml-explore/mlx/pull/2306
  • Cuda perf tuning by @awni in https://github.com/ml-explore/mlx/pull/2307
  • Fix update_modules() when providing a subset by @angeloskath in https://github.com/ml-explore/mlx/pull/2308
  • Compile float64 functions on CPU by @awni in https://github.com/ml-explore/mlx/pull/2311
  • Fix get 2d grid dims by @angeloskath in https://github.com/ml-explore/mlx/pull/2316
  • Split broadcast so it is always fused in compile by @angeloskath in https://github.com/ml-explore/mlx/pull/2318
  • [CUDA] Fix reductions by @angeloskath in https://github.com/ml-explore/mlx/pull/2314
  • Fix module update in strict mode by @awni in https://github.com/ml-explore/mlx/pull/2321
  • MLX_SWITCH macros to templates by @angeloskath in https://github.com/ml-explore/mlx/pull/2320
  • Use fp32 for testing, add more complex ops by @awni in https://github.com/ml-explore/mlx/pull/2322
  • Patch bump by @awni in https://github.com/ml-explore/mlx/pull/2324
  • Allow parameters to be deleted from a module by @awni in https://github.com/ml-explore/mlx/pull/2325
  • Fix compilation error from integral_constant by @zcbenz in https://github.com/ml-explore/mlx/pull/2326
  • [CUDA] Switch to CUDA graphs by @awni in https://github.com/ml-explore/mlx/pull/2317
  • [CUDA] Fix graphs for older cuda by @awni in https://github.com/ml-explore/mlx/pull/2328
  • [CUDA] Add MLXCUDAGRAPHCACHESIZE env for setting graph cache size by @zcbenz in https://github.com/ml-explore/mlx/pull/2329
  • Fix layernorm race condition by @angeloskath in https://github.com/ml-explore/mlx/pull/2340
  • Build with all cpu cores by default by @zcbenz in https://github.com/ml-explore/mlx/pull/2336
  • [CUDA] Do vectorized store/load in binary ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2330
  • Auto build linux release by @awni in https://github.com/ml-explore/mlx/pull/2341
  • MoE backward improvements by @angeloskath in https://github.com/ml-explore/mlx/pull/2335
  • Fix compilation with CUDA 11 by @zcbenz in https://github.com/ml-explore/mlx/pull/2331
  • patch bump by @awni in https://github.com/ml-explore/mlx/pull/2343
  • Align mlx::core::max op nan propagation with NumPy by @jhavukainen in https://github.com/ml-explore/mlx/pull/2339
  • Add zero for argsort vjp by @awni in https://github.com/ml-explore/mlx/pull/2345
  • [CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in https://github.com/ml-explore/mlx/pull/2342
  • Align mlx::core::min op nan propagation with NumPy by @jhavukainen in https://github.com/ml-explore/mlx/pull/2346
  • [CUDA] Set current device before cudaGraphLaunch by @zcbenz in https://github.com/ml-explore/mlx/pull/2351
  • [CUDA] Put version in ptx cache dir path by @zcbenz in https://github.com/ml-explore/mlx/pull/2352
  • Fix type promotion in Adam with bias correction by @angeloskath in https://github.com/ml-explore/mlx/pull/2350
  • Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in https://github.com/ml-explore/mlx/pull/2355
  • [CUDA] Implement Scan kernel by @zcbenz in https://github.com/ml-explore/mlx/pull/2347
  • [Metal] fix copy dispatch by @awni in https://github.com/ml-explore/mlx/pull/2360
  • [CUDA] Bundle CCCL for JIT compilation by @zcbenz in https://github.com/ml-explore/mlx/pull/2357
  • [CUDA] Do not put kernels in annoymous namespace by @zcbenz in https://github.com/ml-explore/mlx/pull/2362
  • Fix imag() vjp by @angeloskath in https://github.com/ml-explore/mlx/pull/2367
  • Add Primitive::name and remove Primitive::print by @zcbenz in https://github.com/ml-explore/mlx/pull/2365
  • update linux build by @awni in https://github.com/ml-explore/mlx/pull/2370
  • [CUDA] Affine quantize by @awni in https://github.com/ml-explore/mlx/pull/2354
  • Fix flaky linux test by @awni in https://github.com/ml-explore/mlx/pull/2371
  • Install linux with mlx[cuda] and mlx[cpu] by @awni in https://github.com/ml-explore/mlx/pull/2356
  • [CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in https://github.com/ml-explore/mlx/pull/2372
  • lower memory uniform sampling by @awni in https://github.com/ml-explore/mlx/pull/2361
  • [CUDA] Fix complex reduce + nan propagation in min and max by @awni in https://github.com/ml-explore/mlx/pull/2377
  • Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in https://github.com/ml-explore/mlx/pull/2378
  • fix ring distributed test by @awni in https://github.com/ml-explore/mlx/pull/2380
  • Test with CUDA 12.2 by @awni in https://github.com/ml-explore/mlx/pull/2375
  • [CUDA] Add work per thread to compile by @angeloskath in https://github.com/ml-explore/mlx/pull/2368
  • [CUDA] Fix resource leaks in matmul and graph by @awni in https://github.com/ml-explore/mlx/pull/2383
  • [CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in https://github.com/ml-explore/mlx/pull/2382
  • Add contiguouscopygpu util for copying array by @zcbenz in https://github.com/ml-explore/mlx/pull/2379
  • Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in https://github.com/ml-explore/mlx/pull/1914
  • Patch bump by @awni in https://github.com/ml-explore/mlx/pull/2386
  • Fix release build + patch bump by @awni in https://github.com/ml-explore/mlx/pull/2387
  • Fix cuda manylinux version to match others by @awni in https://github.com/ml-explore/mlx/pull/2388
  • [CUDA] speedup handling scalars by @awni in https://github.com/ml-explore/mlx/pull/2389
  • Remove thrust iterators by @zcbenz in https://github.com/ml-explore/mlx/pull/2396
  • Add contiguouscopycpu util for copying array by @zcbenz in https://github.com/ml-explore/mlx/pull/2397
  • Fix including stubs in wheel by @awni in https://github.com/ml-explore/mlx/pull/2398
  • use size option in binary by @awni in https://github.com/ml-explore/mlx/pull/2399
  • [CUDA] Simplify allocator by @awni in https://github.com/ml-explore/mlx/pull/2392
  • Add cuda gemv by @awni in https://github.com/ml-explore/mlx/pull/2400
  • Fix an error in the comment for mx.dequantize by @csukuangfj in https://github.com/ml-explore/mlx/pull/2409
  • Remove unused code in Convolution::vjp by @zcbenz in https://github.com/ml-explore/mlx/pull/2408
  • [CUDA] --compress-mode requires CUDA 12.8 by @zcbenz in https://github.com/ml-explore/mlx/pull/2407
  • full row mask in sdpa consistently gives nan by @awni in https://github.com/ml-explore/mlx/pull/2406
  • Fix uv install and add dev release by @awni in https://github.com/ml-explore/mlx/pull/2411
  • [Metal] Release metal events by @awni in https://github.com/ml-explore/mlx/pull/2412
  • Test on cuda 12.2 and 12.9 by @awni in https://github.com/ml-explore/mlx/pull/2413
  • [CUDA] Initial implementation of Convolution with cuDNN by @zcbenz in https://github.com/ml-explore/mlx/pull/2385
  • [DOCS]: Fix eps placement in Adam and AdamW by @Skonor in https://github.com/ml-explore/mlx/pull/2416
  • [CUDA] Always use batched matmul by @awni in https://github.com/ml-explore/mlx/pull/2404
  • Fix qvm splitk by @awni in https://github.com/ml-explore/mlx/pull/2415
  • Update install docs and requirements by @awni in https://github.com/ml-explore/mlx/pull/2419
  • version by @awni in https://github.com/ml-explore/mlx/pull/2420

New Contributors

  • @emmanuel-ferdman made their first contribution in https://github.com/ml-explore/mlx/pull/2250
  • @FL33TW00D made their first contribution in https://github.com/ml-explore/mlx/pull/2264
  • @jhavukainen made their first contribution in https://github.com/ml-explore/mlx/pull/2339
  • @Goekdeniz-Guelmez made their first contribution in https://github.com/ml-explore/mlx/pull/1914
  • @Skonor made their first contribution in https://github.com/ml-explore/mlx/pull/2416

Full Changelog: https://github.com/ml-explore/mlx/compare/v0.26.0...v0.27.0

- C++
Published by awni 8 months ago

mlx - v0.26.5

πŸš€

- C++
Published by awni 8 months ago

mlx - v0.26.3

πŸš€

- C++
Published by awni 8 months ago

mlx - v0.26.2

πŸš€

- C++
Published by awni 8 months ago

mlx - v0.26.0

Highlights

  • 5 bit quantization
  • Significant progress on CUDA back-end by @zcbenz

Core

Features

  • 5bit quants
  • Allow per-target Metal debug flags
  • Add complex eigh
  • reduce vjp for mx.all and mx.any
  • real and imag properties
  • Non-symmetric mx.linalg.eig and mx.linalg.eigh
  • convolution vmap
  • Add more complex unary ops (sqrt, square, ...)
  • Complex scan
  • Add mx.broadcast_shapes
  • Added output_padding parameters in conv_transpose
  • Add random normal distribution for complex numbers
  • Add mx.fft.fftshift andmx.fft.ifftshift` helpers
  • Enable vjp for quantized scale and bias

Performance

  • Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
  • Much faster 1D conv

Cuda

  • Generalize gpu backend
  • Use fallbacks in fast primitives when eval_gpu is not implemented
  • Add memory cache to CUDA backend
  • Do not check event.is_signaled() in eval_impl
  • Build for compute capability 70 instead of 75 in CUDA backend
  • CUDA backend: backbone

Bug Fixes

  • Fix out-of-bounds default value in logsumexp/softmax
  • include mlx::core::version() symbols in the mlx static library
  • Fix Nearest upsample
  • Fix large arg reduce
  • fix conv grad
  • Fix some complex vjps
  • Fix typo in rowreducesmall
  • Fix put_along_axis for empty arrays
  • Close a couple edge case bugs: hadamard and addmm on empty inputs
  • Fix fft for integer overflow with large batches
  • fix: conv_general differences between gpu, cpu
  • Fix batched vector sdpa
  • GPU Hadamard for large N
  • Improve bandwidth for elementwise ops
  • Fix compile merging
  • Fix shapeless export to throw on dim mismatch
  • Fix mx.linalg.pinv for singular matrices
  • Fixed shift operations
  • Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

- C++
Published by awni 9 months ago

mlx - v0.25.2

πŸš€

- C++
Published by awni 10 months ago

mlx - v0.25.1

πŸš€

- C++
Published by awni 11 months ago

mlx - v0.25.0

Highlights

  • Custom logsumexp for reduced memory in training (benchmark)
  • Depthwise separable convolutions
  • Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs

Core

Performance

  • Fused vector attention supports 256 dim
  • Tune quantized matrix vector dispatch for small batches of vectors

Features

  • Move memory API in the top level mlx.core and enable for CPU only allocator
  • Enable using MPI from all platforms and allow only OpenMPI
  • Add a ring all gather for the ring distributed backend
  • Enable gemm for complex numbers
  • Fused attention supports literal "causal" mask
  • Log for complex numbers
  • Distributed all_min and all_max both for MPI and the ring backend
  • Add logcumsumexp
  • Add additive mask for fused vector attention
  • Improve the usage of the residency set

NN

  • Add sharded layers for model/tensor parallelism

Bugfixes

  • Fix possible allocator deadlock when using multiple streams
  • Ring backend supports 32 bit platforms and FreeBSD
  • Fix FFT bugs
  • Fix attention mask type for fused attention kernel
  • Fix fused attention numerical instability with masking
  • Add a fallback for float16 gemm
  • Fix simd sign for uint64
  • Fix issues in docs

- C++
Published by angeloskath 11 months ago

mlx - v0.24.2

πŸ› πŸš€

- C++
Published by awni 11 months ago

mlx - v0.24.1

πŸ›

- C++
Published by awni 12 months ago

mlx - v0.24.0

Highlights

  • Much faster fused attention with support for causal masking
    • Benchmarks
    • Improvements in prompt processing speed and memory use, benchmarks
    • Much faster small batch fused attention for e.g. speculative decoding, benchmarks
  • Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

  • Support fused masking in scaled_dot_product_attention
  • Support transposed head/seq for fused vector scaled_dot_product_attention
  • SDPA support for small batch (over sequence) queries
  • Enabling fused attention for head dim 128
  • Redesign CPU back-end for faster cpu/gpu synch

Features

  • Allow debugging in distributed mode
  • Support mx.fast.rms_norm without scale
  • Adds nuclear norm support in mx.linalg.norm
  • Add XOR on arrays
  • Added mlx::core::version()
  • Allow non-square lu in mx.linalg.lu
  • Double for lapack ops (eigh, svd, etc)
  • Add a prepare tb ring script
  • Ring docs
  • Affine quant always in fp32

Optimizers

  • Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

  • Do not define MLX_VERSION globally
  • Reduce binary size post fast synch
  • Fix vmap for flatten
  • Fix copy for large arrays with JIT
  • Fix grad with inplace updates
  • Use same accumulation precision in gemv as gemm
  • Fix slice data size
  • Use a heap for small sizes
  • Fix donation in scan
  • Ensure linspace always contains start and stop
  • Raise an exception in the rope op if input is integer
  • Limit compile buffers by
  • fix mx.float64 type promotion
  • Fix CPU SIMD erf_inv
  • Update smoothl1loss in losses.

- C++
Published by jagrit06 12 months ago

mlx - v0.23.2

πŸš€

- C++
Published by awni about 1 year ago

mlx - v0.23.1

🐞

- C++
Published by angeloskath about 1 year ago

mlx - v0.23.0

Highlights

  • 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
  • More performance improvements across the board:
    • Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
    • Faster winograd convolutions, benchmarks
    • Up to 3x faster sort, benchmarks
    • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
    • Faster unified CPU back-end with vector operations
  • Double precision (mx.float64) support on the CPU

Core

Features

  • Bitwise invert mx.bitwise_invert
  • mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
  • Support loading F8_E4M3 from safetensors
  • mx.float64 supported on the CPU
  • Matmul JVPs
  • Distributed launch helper :mlx.launch
  • Support non-square QR factorization with mx.linalg.qr
  • Support ellipsis in mx.einsum
  • Refactor and unify accelerate and common back-ends

Performance

  • Faster synchronization Fence for synchronizing CPU-GPU
  • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
  • Fast winograd convolutions, benchmarks
  • Allow dynamic ops per buffer based on dispatches and memory, benchmarks
  • Up to 3x faster sort, benchmarks
  • Faster small batch qmv, benchmarks
  • Ring distributed backend
  • Some CPU ops are much faster with the new Simd<T, N>

NN

  • Orthogonal initializer nn.init.orthogonal
  • Add dilation for conv 3d layers

Bug fixes

  • Limit grad recursion depth by not recursing through non-grad inputs
  • Fix synchronization bug for GPU stream async CPU work
  • Fix shapeless compile on ubuntu24
  • Recompile when shapeless changes
  • Fix rope fallback to not upcast
  • Fix metal sort for certain cases
  • Fix a couple of slicing bugs
  • Avoid duplicate malloc with custom kernel init
  • Fix compilation error on Windows
  • Allow Python garbage collector to break cycles on custom objects
  • Fix grad with copies
  • Loading empty list is ok when strict = false
  • Fix split vmap
  • Fixes output donation for IO ops on the GPU
  • Fix creating an array with an int64 scalar
  • Catch stream errors earlier to avoid aborts

- C++
Published by awni about 1 year ago

mlx - v0.22.1

πŸš€

- C++
Published by awni about 1 year ago

mlx - v0.22.0

Highlights

  • Export and import MLX functions to a file (example, bigger example)
    • Functions can be exported from Python and run in C++ and vice versa

Core

  • Add slice and slice_update which take arrays for starting locations
  • Add an example for using MLX in C++ with CMake
  • Fused attention for generation now supports boolean masking (benchmark)
  • Allow array offset for mx.fast.rope
  • Add mx.finfo
  • Allow negative strides without resorting to copying for slice and as_strided
  • Add Flatten, Unflatten and ExpandDims primitives
  • Enable the compilation of lambdas in C++
  • Add a lot more primitives for shapeless compilation (full list)
  • Fix performance regression in qvm
  • Introduce separate types for Shape and Strides and switch to int64 strides from uint64
  • Reduced copies for fused-attention kernel
  • Recompile a function when the stream changes
  • Several steps to improve the linux / x86_64 experience (#1625, #1627, #1635)
  • Several steps to improve/enable the windows experience (#1628, #1660, #1662, #1661, #1672, #1663, #1664, ...)
  • Update to newer Metal-cpp
  • Throw when exceeding the maximum number of buffers possible
  • Add mx.kron
  • mx.distributed.send now implements the identity function instead of returning an empty array
  • Better errors reporting for mx.compile on CPU and for unrecoverable errors

NN

  • Add optional bias correction in Adam/AdamW
  • Enable mixed quantization by nn.quantize
  • Remove reshapes from nn.QuantizedEmbedding

Bug fixes

  • Fix qmv/qvm bug for batch size 2-5
  • Fix some leaks and races (#1629)
  • Fix transformer postnorm in mlx.nn
  • Fix some mx.fast fallbacks
  • Fix the hashing for string constants in compile
  • Fix small sort in Metal
  • Fix memory leak of non-evaled arrays with siblings
  • Fix concatenate/slice_update vjp in edge-case where the inputs have different type

- C++
Published by angeloskath about 1 year ago

mlx - v0.21.1

πŸš€ πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.21.0

Highlights

  • Support 3 and 6 bit quantization: benchmarks
  • Much faster memory efficient attention for headdim 64, 80: benchmarks
  • Much faster sdpa inference kernel for longer sequences: benchmarks

Core

  • contiguous op (C++ only) + primitive
  • Bfs width limit to reduce memory consumption during eval
  • Fast CPU quantization
  • Faster indexing math in several kernels:
    • unary, binary, ternary, copy, compiled, reduce
  • Improve dispatch threads for a few kernels:
    • conv, gemm splitk, custom kernels
  • More buffer donation with no-ops to reduce memory use
  • Use CMAKE_OSX_DEPLOYMENT_TARGET to pick Metal version
  • Dispatch Metal bf16 type at runtime when using the JIT

NN

  • nn.AvgPool3d and nn.MaxPool3d
  • Support groups in nn.Conv2d

Bug fixes

  • Fix per-example mask + docs in sdpa
  • Fix FFT synchronization bug (use dispatch method everywhere)
  • Throw for invalid *fft{2,n} cases
  • Fix OOB access in qmv
  • Fix donation in sdpa to reduce memory use
  • Allocate safetensors header on the heap to avoid stack overflow
  • Fix sibling memory leak
  • Fix view segfault for scalars input
  • Fix concatenate vmap

- C++
Published by awni over 1 year ago

mlx - v0.20.0

Highlights

  • Even faster GEMMs
  • BFS graph optimizations
    • Over 120tks with Mistral 7B!
  • Fast batched QMV/QVM for KV quantized attention benchmarks

Core

  • New Features
    • mx.linalg.eigh and mx.linalg.eigvalsh
    • mx.nn.init.sparse
    • 64bit type support for mx.cumprod, mx.cumsum
  • Performance
    • Faster long column reductions
    • Wired buffer support for large models
    • Better Winograd dispatch condition for convs
    • Faster scatter/gather
    • Faster mx.random.uniform and mx.random.bernoulli
    • Better threadgroup sizes for large arrays
  • Misc
    • Added Python 3.13 to CI
    • C++20 compatibility

Bugfixes

  • Fix command encoder synchronization
  • Fix mx.vmap with gather and constant outputs
  • Fix fused sdpa with differing key and value strides
  • Support mx.array.__format__ with spec
  • Fix multi output array leak
  • Fix RMSNorm weight mismatch error

- C++
Published by barronalex over 1 year ago

mlx - v0.19.3

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.19.2

πŸš€πŸš€

- C++
Published by angeloskath over 1 year ago

mlx - v0.19.1

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.19.0

Highlights

  • Speed improvements
    • Up to 6x faster CPU indexing benchmarks
    • Faster Metal compiled kernels for strided inputs benchmarks
    • Faster generation with fused-attention kernel benchmarks
  • Gradient for grouped convolutions
  • Due to Python 3.8's end-of-life we no longer test with it on CI

Core

  • New features
    • Gradient for grouped convolutions
    • mx.roll
    • mx.random.permutation
    • mx.real and mx.imag
  • Performance
    • Up to 6x faster CPU indexing benchmarks
    • Faster CPU sort benchmarks
    • Faster Metal compiled kernels for strided inputs benchmarks
    • Faster generation with fused-attention kernel benchmarks
    • Bulk eval in safetensors to avoid unnecessary serialization of work
  • Misc
    • Bump to nanobind 2.2
    • Move testing to python 3.9 due to 3.8's end-of-life
    • Make the GPU device more thread safe
    • Fix the submodule stubs for better IDE support
    • CI generated docs that will never be stale

NN

  • Add support for grouped 1D convolutions to the nn API
  • Add some missing type annotations

Bugfixes

  • Fix and speedup row-reduce with few rows
  • Fix normalization primitive segfault with unexpected inputs
  • Fix complex power on the GPU
  • Fix freeing deep unevaluated graphs details
  • Fix race with array::is_available
  • Consistently handle softmax with all -inf inputs
  • Fix streams in affine quantize
  • Fix CPU compile preamble for some linux machines
  • Stream safety in CPU compilation
  • Fix CPU compile segfault at program shutdown

- C++
Published by angeloskath over 1 year ago

mlx - v0.18.1

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.18.0

Highlights

  • Speed improvements:
    • Up to 2x faster I/O: benchmarks.
    • Faster transposed copies, unary, and binary ops
    • CPU benchmarks here.
    • GPU benchmarks here and here.
  • Transposed convolutions
  • Improvements to mx.distributed (send/recv/average_gradients)

Core

  • New features:

    • mx.conv_transpose{1,2,3}d
    • Allow mx.take to work with integer index
    • Add std as method on mx.array
    • mx.put_along_axis
    • mx.cross_product
    • int() and float() work on scalar mx.array
    • Add optional headers to mx.fast.metal_kernel
    • mx.distributed.send and mx.distributed.recv
    • mx.linalg.pinv
  • Performance

    • Up to 2x faster I/O
    • Much faster CPU convolutions
    • Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
    • Put reduction ops in default stream with async for faster comms
    • Overhead reductions in mx.fast.metal_kernel
    • Improve donation heuristics to reduce memory use
  • Misc

    • Support Xcode 160

NN

  • Faster RNN layers
  • nn.ConvTranspose{1,2,3}d
  • mlx.nn.average_gradients data parallel helper for distributed training

Bug Fixes

  • Fix boolean all reduce bug
  • Fix extension metal library finding
  • Fix ternary for large arrays
  • Make eval just wait if all arrays are scheduled
  • Fix CPU softmax by removing redundant coefficient in neonfastexp
  • Fix JIT reductions
  • Fix overflow in quantize/dequantize
  • Fix compile with byte sized constants
  • Fix copy in the sort primitive
  • Fix reduce edge case
  • Fix slice data size
  • Throw for certain cases of non captured inputs in compile
  • Fix copying scalars by adding fill_gpu
  • Fix bug in module attribute set, reset, set
  • Ensure io/comm streams are active before eval
  • Fix mx.clip
  • Override class function in Repr so mx.array is not confused with array.array
  • Avoid using find_library to make install truly portable
  • Remove fmt dependencies from MLX install
  • Fix for partition VJP
  • Avoid command buffer timeout for IO on large arrays

- C++
Published by awni over 1 year ago

mlx - v0.17.3

πŸš€

- C++
Published by angeloskath over 1 year ago

mlx - v0.17.1

πŸ›

- C++
Published by angeloskath over 1 year ago

mlx - v0.17.0

Highlights

  • mx.einsum: PR
  • Big speedups in reductions: benchmarks
  • 2x faster model loading: PR
  • mx.fast.metal_kernel for custom GPU kernels: docs

Core

  • Faster program exits
  • Laplace sampling
  • mx.nan_to_num
  • nn.tanh gelu approximation
  • Fused GPU quantization ops
  • Faster group norm
  • bf16 winograd conv
  • vmap support for mx.scatter
  • mx.pad "edge" padding
  • More numerically stable mx.var
  • mx.linalg.cholesky_inv/mx.linalg.tri_inv
  • mx.isfinite
  • Complex mx.sign now mirrors NumPy 2.0 behaviour
  • More flexible mx.fast.rope
  • Update to nanobind 2.1

Bug Fixes

  • gguf zero initialization
  • expm1f overflow handling
  • bfloat16 hadamard
  • large arrays for various ops
  • rope fix
  • bf16 array creation
  • preserve dtype in nn.Dropout
  • nn.TransformerEncoder with norm_first=False
  • excess copies from contiguity bug

- C++
Published by barronalex over 1 year ago

mlx - v0.16.3

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.16.2

πŸš€πŸš€

- C++
Published by angeloskath over 1 year ago

mlx - 0.16.1

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.16.0

Highlights

  • @mx.custom_function for custom vjp/jvp/vmap transforms
  • Up to 2x faster Metal GEMV and fast masked GEMV
  • Fast hadamard_transform

Core

  • Metal 3.2 support
  • Reduced CPU binary size
  • Added quantized GPU ops to JIT
  • Faster GPU compilation
  • Added grads for bitwise ops + indexing

Bug Fixes

  • 1D scatter bug
  • Strided sort bug
  • Reshape copy bug
  • Seg fault in mx.compile
  • Donation condition in compilation
  • Compilation of accelerate on iOS

- C++
Published by barronalex over 1 year ago

mlx - v0.15.2

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.15.1

πŸš€

- C++
Published by awni over 1 year ago

mlx - v0.15.0

Highlights

  • Fast Metal GPU FFTs
  • mx.distributed with all_sum and all_gather

Core

  • Added dlpack device __dlpack_device__
  • Fast GPU FFTs benchmarks
  • Add docs for the mx.distributed
  • Add mx.view op

NN

  • softmin, hardshrink, and hardtanh activations

Bugfixes

  • Fix broadcast bug in bitwise ops
  • Allow more buffers for JIT compilation
  • Fix matvec vector stride bug
  • Fix multi-block sort stride management
  • Stable cumprod grad at 0
  • Buf fix with race condition in scan

- C++
Published by awni almost 2 years ago

mlx - v0.14.1

πŸš€

- C++
Published by awni almost 2 years ago

mlx - v0.14.0

Highlights

  • Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
    • Series of PRs 1, 2, 3, 4, 5
  • mx.gather_qmm quantized equivalent for mx.gather_mm which speeds up MoE inference by ~2x
  • Grouped 2D convolutions

Core

  • mx.conjugate
  • mx.conv3d and nn.Conv3d
  • List based indexing
  • Started mx.distributed which uses MPI (if installed) for communication across machines
    • mx.distributed.init
    • mx.distributed.all_gather
    • mx.distributed.all_reduce_sum
  • Support conversion to and from dlpack
  • mx.linalg.cholesky on CPU
  • mx.quantized_matmul sped up for vector-matrix products
  • mx.trace
  • mx.block_masked_mm now supports floating point masks!

Fixes

  • Error messaging in eval
  • Add some missing docs
  • Scatter index bug
  • The extensions example now compiles and runs
  • CPU copy bug with many dimensions

- C++
Published by angeloskath almost 2 years ago

mlx - v0.13.1

πŸš€

- C++
Published by awni almost 2 years ago

mlx - v0.13.0

Highlights

  • Block sparse matrix multiply speeds up MoEs by >2x
  • Improved quantization algorithm should work well for all networks
  • Improved gpu command submission speeds up training and inference

Core

  • Bitwise ops added:
    • mx.bitwise_[or|and|xor], mx.[left|right]_shift, operator overloads
  • Groups added to Conv1d
  • Added mx.metal.device_info to get better informed memory limits
  • Added resettable memory stats
  • mlx.optimizers.clip_grad_norm and mlx.utils.tree_reduce added
  • Add mx.arctan2
  • Unary ops now accept array-like inputs ie one can do mx.sqrt(2)

Bugfixes

  • Fixed shape for slice update
  • Bugfix in quantize that used slightly wrong scales/biases
  • Fixed memory leak for multi-output primitives encountered with gradient checkpointing
  • Fixed conversion from other frameworks for all datatypes
  • Fixed index overflow for matmul with large batch size
  • Fixed initialization ordering that occasionally caused segfaults

- C++
Published by angeloskath almost 2 years ago

mlx - v0.12.2

- C++
Published by awni almost 2 years ago

mlx - v0.12.0

Highlights

  • Faster quantized matmul

Core

  • mx.synchronize to wait for computation dispatched with mx.async_eval
  • mx.radians and mx.degrees
  • mx.metal.clear_cache to return to the OS the memory held by MLX as a cache for future allocations
  • Change quantization to always represent 0 exactly (relevant issue)

Bugfixes

  • Fixed quantization of a block with all 0s that produced NaNs
  • Fixed the len field in the buffer protocol implementation

- C++
Published by angeloskath almost 2 years ago

mlx - v0.11.0

Core

  • mx.block_masked_mm for block-level sparse matrix multiplication
  • Shared events for synchronization and asynchronous evaluation

NN

  • nn.QuantizedEmbedding layer
  • nn.quantize for quantizing modules
  • gelu_approx uses tanh for consistency with PyTorch

- C++
Published by awni almost 2 years ago

mlx - v0.10.0

Highlights

  • Improvements for LLM generation
    • Reshapeless quant matmul/matvec
    • mx.async_eval
    • Async command encoding

Core

  • Slightly faster reshapeless quantized gemms
  • Option for precise softmax
  • mx.metal.start_capture and mx.metal.stop_capture for GPU debug/profile
  • mx.expm1
  • mx.std
  • mx.meshgrid
  • CPU only mx.random.multivariate_normal
  • mx.cumsum (and other scans) for bfloat
  • Async command encoder with explicit barriers / dependency management

NN

  • nn.upsample support bicubic interpolation

Misc

  • Updated MLX Extension to work with nanobind

Bugfixes

  • Fix buffer donation in softmax and fast ops
  • Bug in layer norm vjp
  • Bug initializing from lists with scalar
  • Bug in indexing
  • CPU compilation bug
  • Multi-output compilation bug
  • Fix stack overflow issues in eval and array destruction

- C++
Published by awni almost 2 years ago

mlx - v0.9.0

Highlights:

  • Fast partial RoPE (used by Phi-2)
  • Fast gradients for RoPE, RMSNorm, and LayerNorm

Core

  • More overhead reductions
  • Partial fast RoPE (fast Phi-2)
  • Better buffer donation for copy
  • Type hierarchy and issubdtype
  • Fast VJPs for RoPE, RMSNorm, and LayerNorm

NN

  • Module.set_dtype
  • Chaining in nn.Module (model.freeze().update(…))

Bugfixes

  • Fix set item bugs
  • Fix scatter vjp
  • Check shape integer overlow on array construction
  • Fix bug with module attributes
  • Fix two bugs for odd shaped QMV
  • Fix GPU sort for large sizes
  • Fix bug in negative padding for convolutions
  • Fix bug in multi-stream race condition for graph evaluation
  • Fix random normal generation for half precision

- C++
Published by awni almost 2 years ago

mlx - v0.8.0

Highlights

Core

Optimizers

  • Set minimum value in cosine decay scheduler

Bugfixes

  • Fix bug in multi-dimensional reduction

- C++
Published by awni almost 2 years ago

mlx -

Highlights

  • Perf improvements for attention ops:
    • No copy broadcast matmul (benchmarks)
    • Fewer copies in reshape

Core

  • Faster broadcast + gemm
  • mx.linalg.svd (CPU only)
  • Fewer copies in reshape
  • Faster small reductions

NN

  • nn.RNN, nn.LSTM, nn.GRU

Bugfixes

  • Fix bug in depth traversal ordering
  • Fix two edge case bugs in compilation
  • Fix bug with modules with dictionaries of weights
  • Fix bug with scatter which broke MOE training
  • Fix bug with compilation kernel collision

- C++
Published by awni almost 2 years ago

mlx - v0.6.0

Highlights:

  • Faster quantized matrix-vector multiplies
  • mx.fast.scaled_dot_product_attention fused op

Core

  • Memory allocation API improvements
  • Faster GPU reductions for smaller sizes (between 2 and 7x)
  • mx.fast.scaled_dot_product_attention fused op
  • Faster quantized matrix-vector multiplications
  • Pickle support for mx.array

NN

  • Dilation on convolution layers

Bugfixes

  • Fix mx.topk
  • Fix reshape for zero sizes

- C++
Published by angeloskath about 2 years ago

mlx - v0.5.0

Highlights:

  • Faster convolutions.
    • Up to 14x faster for some common sizes.
    • See benchmarks

Core

  • mx.where properly handles inf
  • Faster and more general convolutions
    • Input and kernel dilation
    • Asymmetric padding
    • Support for cross-correlation and convolution
  • atleast_{1,2,3}d accept any number of arrays

NN

  • nn.Upsample layer
    • Supports nearest neighbor and linear interpolation
    • Any number of dimensions

Optimizers

  • Linear schedule and schedule joiner:
    • Use for e.g. linear warmup + cosine decay

Bugfixes

  • arange throws on inf inputs
  • Fix Cmake build with MLX
  • Fix logsumexp inf edge case
  • Fix grad of power w.r.t. to exponent edge case
  • Fix compile with inf constants
  • Bug temporary bug in convolution

- C++
Published by jagrit06 about 2 years ago

mlx - v0.4.0

Highlights:

  • Partial shapeless compilation
    • Default shapeless compilation for all activations
    • Can be more than 5x faster than uncompiled versions
  • CPU kernel fusion

Core

  • CPU compilation
  • Shapeless compilation for some cases
    • mx.compile(function, shapeless=True)
  • Up to 10x faster scatter: benchmarks
  • mx.atleast_1d, mx.atleast_2d, mx.atleast_3d

Bugfixes

  • Bug with tolist with bfloat16 and float16
  • Bug with argmax on M3

- C++
Published by awni about 2 years ago

mlx - v0.3.0

- C++
Published by awni about 2 years ago

mlx - v0.2.0

Highlights:

  • mx.compile makes stuff go fast
    • Some functions are up to 10x faster (benchmarks)
    • Training models anywhere from 10% to twice as fast (benchmarks)
    • Simple syntax for compiling full training steps

Core

  • mx.compile function transformation
  • Find devices properly for iOS
  • Up to 10x faster GPU gather
  • __abs__ overload for abs on arrays
  • loc and scale in parameter for mx.random.normal

NN

  • Margin ranking loss
  • BCE loss with weights

Bugfixes

  • Fix for broken eval during function transformations
  • Fix mx.var to give inf with doff >= nelem
  • Fix loading empty modules in nn.Sequential

- C++
Published by awni about 2 years ago

mlx - v0.1.0

Highlights

  • Memory use improvements:
    • Gradient checkpointing for training with mx.checkpoint
    • Better graph execution order
    • Buffer donation

Core

  • Gradient checkpointing with mx.checkpoint
  • CPU only QR factorization mx.linalg.qr
  • Release Python GIL during mx.eval
  • Depth-based graph execution order
  • Lazy loading arrays from files
  • Buffer donation for reduced memory use
  • mx.diag, mx.diagonal
  • Breaking: array.shape is a Python tuple
  • GPU support for int64 and uint64 reductions
  • vmap over reductions and arg reduction:
    • sum, prod, max, min, all, any
    • argmax, argmin

NN

  • Softshrink activation

Bugfixes

  • Comparisons with inf work, and fix mx.isinf
  • Bug fix with RoPE cache
  • Handle empty Matmul on the CPU
  • Negative shape checking for mx.full
  • Correctly propagate NaN in some binary ops
    • mx.logaddexp, mx.maximum, mx.minimum
  • Fix > 4D non-contiguous binary ops
  • Fix mx.log1p with inf input
  • Fix SGD to apply weight decay even with 0 momentum

- C++
Published by angeloskath about 2 years ago

mlx - v0.0.11

Highlights:

  • GGUF improvements:
    • Native quantizations Q4_0, Q4_1, and Q8_0
    • Metadata

Core

  • Support for reading and writing GGUF metadata
  • Native GGUF quantization (Q4_0, Q4_1, and Q8_0)
  • Quantize with group size of 32 (2x32, 4x32, and 8x32)

NN

  • Module.save_weights supports safetensors
  • nn.init package with several commonly used neural network initializers
  • Binary cross entropy and cross entropy losses can take probabilities as targets
  • Adafactor in nn.optimizers

Bugfixes

  • Fix isinf and friends for integer types
  • Fix array creation from list Python ints to int64, uint, and float32
  • Fix power VJP for 0 inputs
  • Fix out of bounds inf reads in gemv
  • mx.arange crashes on NaN inputs

- C++
Published by angeloskath about 2 years ago

mlx - v0.0.10

Highlights:

  • Faster matmul: up to 2.5x faster for certain sizes, benchmarks
  • Fused matmul + addition (for faster linear layers)

Core

  • Quantization supports sizes other than multiples of 32
  • Faster GEMM (matmul)
  • ADMM primitive (fused addition and matmul)
  • mx.isnan, mx.isinf, isposinf, isneginf
  • mx.tile
  • VJPs for scatter_min and scatter_max
  • Multi output split primitive

NN

  • Losses: Gaussian negative log-likelihood

Misc

  • Performance enhancements for graph evaluation with lots of outputs
  • Default PRNG seed is based on current time instead of 0
  • Primitive VJP takes output as input. Reduces redundant work without need for simplification
  • PRNGs default seed based on system time rather than fixed to 0
  • Format boolean printing in Python style when in Python

Bugfixes

  • Scatter < 32 bit precision and integer overflow fix
  • Overflow with mx.eye
  • Report Metal out of memory issues instead of silent failure
  • Change mx.round to follow NumPy which rounds to even

- C++
Published by awni about 2 years ago

mlx - v0.0.9

Highlights:

  • Initial (and experimental) GGUF support
  • Support Python buffer protocol (easy interoperability with NumPy, Jax, Tensorflow, PyTorch, etc)
  • at[] syntax for scatter style operations: x.at[idx].add(y), (min, max, prod, etc)

Core

  • Array creation from other mx.array’s (mx.array([x, y]))
  • Complete support for Python buffer protocol
  • mx.inner, mx.outer
  • mx.logicaland, mx.logicalor, and operator overloads
  • Array at syntax for scatter ops
  • Better support for in-place operations (+=, *=, -=, ...)
  • VJP for scatter and scatter add
  • Constants (mx.pi, mx.inf, mx.newaxis, …)

NN

  • GLU activation
  • cosine_similarity loss
  • Cache for RoPE and ALiBi

Bugfixes / Misc

  • Fix data type with tri
  • Fix saving non-contiguous arrays
  • Fix graph retention for inlace state, and remove retain_graph
  • Multi-output primitives
  • Better support for loading devices

- C++
Published by awni about 2 years ago

mlx - v0.0.7

Core

  • Support for loading and saving HuggingFace's safetensor format
  • Transposed quantization matmul kernels
  • mlx.core.linalg sub-package with mx.linalg.norm (Frobenius, infininty, p-norms)
  • tensordot and repeat

NN

  • Layers
    • Bilinear,Identity, InstanceNorm
    • Dropout2D, Dropout3D
    • more customizable Transformer (pre/post norm, dropout)
    • More activations: SoftSign, Softmax, HardSwish, LogSoftmax
    • Configurable scale in RoPE positional encodings
  • Losses: hinge, huber, log_cosh

Misc

  • Faster GPU reductions for certain cases
  • Change to memory allocation to allow swapping

- C++
Published by awni about 2 years ago

mlx - v0.0.6

Core

  • quantize, dequantize, quantized_matmul
  • moveaxis, swapaxes, flatten
  • stack
  • floor, ceil, clip
  • tril, triu, tri
  • linspace

Optimizers

  • RMSProp, Adamax, Adadelta, Lion

NN

  • Layers: QuantizedLinear, ALiBi positional encodings
  • Losses: Label smoothing, Smooth L1 loss, Triplet loss

Misc

  • Bug fixes

- C++
Published by angeloskath about 2 years ago

mlx - v0.0.5

  • Core ops remainder, eye, identity
  • Additional functionality in mlx.nn
    • Losses: binary cross entropy, kl divergence, mse, l1
    • Activations: PRELU, Mish, and several others
  • More optimizers: AdamW, Nesterov momentum, Adagrad
  • Bug fixes

- C++
Published by awni about 2 years ago