Recent Releases of burn

burn - v0.18.0

Summary

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for: - Matrix-vector (mat@vec) - Vector-matrix (vec@mat) - Inner product - Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.

Fusion Enhancements

  • Improved reliability and performance of Burn Fusion through advanced optimizations.
  • Added support for basic dead code elimination.
  • Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

  • Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
  • Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
    • Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
    • Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

A typical cubecl.toml file might look like this:

```toml [profiling] logger = { level = "basic", stdout = true }

[autotune] level = "balanced" logger = { level = "minimal", stdout = true }

[compilation] logger = { level = "basic", file = "cubecl.log", append = true } ```

Each section configures a different aspect of CubeCL:

  • profiling: Controls performance profiling and logging.
  • autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
  • compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book.

As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig
```diff let pool = MaxPool2dConfig::new(kernel_size) + .with_strides([1, 1]) .with_padding(PaddingConfig2d::Same) .init(); ```
MaxPool1dConfig
```diff let pool = MaxPool1dConfig::new(kernel_size) + .with_stride(1) .with_padding(PaddingConfig1d::Same) .init(); ```
AvgPool2dConfig
```diff let pool = AvgPool2dConfig::new(kernel_size) + .with_strides([1, 1]) .with_padding(PaddingConfig2d::Same) .init(); ```
AvgPool1dConfig
```diff let pool = AvgPool1dConfig::new(kernel_size) + .with_stride(1) .with_padding(PaddingConfig1d::Same) .init(); ```

Module & Tensor

  • Add tensor grid::meshgrid (#3107 #3191) @crutcher
  • Add scalar tensor operations (#3127) @ArthurBrussee
  • Orthogonal initialization (#3109) @dymat
  • Support importing safetensors format (#2721) @wandbrandon @antimora
  • Add burn::linalg norms (#3131) @crutcher
  • Extract Linear.forward to nn::functional::linear (#3147) @crutcher
  • Base impl of matmul for Int tensor (#3201) @crutcher
  • (perf) generate_mask functions optimizations (#3203) @tafia
  • Add CosineEmbeddingLoss module and cosine_similarity function (#3207) @antimora
  • Tensor::slice_fill() (#3221 #3223) @crutcher
  • Base impl of tensor.slice_dim(dim, range) (#3235) @crutcher
  • Support shifting pre-computed RoPE values (#3275) @laggui
  • Improve RoPE partial shift case (#3290) @laggui
  • Add tensor.roll() and improve AsIndex (renamed IndexConversion) (#3281) @crutcher
  • [Breaking] Update pooling default strides to match kernel size (#3338) @lucianyao
  • Add is_finite tensor element wise op and fix is_close/all_close inf (#3341) @jonboh

Backends

  • [Perf] Interpolate optimizations (#3077) @wingertge
  • [Perf] Slice assign (#3069) @wingertge
  • Add multi stage conv (#3105) @wingertge
  • [Perf] Convolution migration to NHWC (#3090) @wingertge
  • Merge different convolution dimensional kernels (#3115) @wingertge
  • Support reduce mixed precision accumulation w/ fusion (#3132) @nathanielsimard
  • Update remote backend (#3175) @Cielbird
  • Feat/autotune optional (#3188) @nathanielsimard
  • cubecl unit matmul (#3214) @louisfd
  • Update CubeCL for client based profiling (#3222) @ArthurBrussee
  • Update cubecl unit matmul double buffered (#3233) @louisfd
  • Burn-remote to_device function (#3189) @Cielbird
  • Add Drop operation for fusion (#3263) @nathanielsimard
  • Lazy tensor downloading in burn-remote (#3276) @Cielbird
  • Improve specialized matmul (#3304) @louisfd
  • Add autotune priority (#3347 #3378) @nathanielsimard
  • Fix local tuner deadlock (#3384) @nathanielsimard
  • Fix fusion wasm unsafe input (#3385 #3386) @nathanielsimard

Bug Fixes

  • Fix WASM deadlock by really properly not capturing locks (#3123) @ArthurBrussee
  • Fix burn-cubecl with autotune disabled (#3141) @wingertge
  • Fix fusion multiple reshapes (#3220) @nathanielsimard
  • Fix/fusion multiple streams (#3297) @nathanielsimard
  • Fix gather broadcasted indices in kernel impl and fusion (#3337) @laggui
  • Fix rand interval (#3321) @laggui
  • Restrict binary op lhs/rhs alias (#3349) @laggui
  • Fix sum fallback when atomic add is not supported (#3369) @laggui

Documentation & Examples

  • Update pytorch-model.md with a new troubleshooting help (#3081) @antimora
  • Contributor example instructions (#3153) @AshAnand34
  • Update README.md with DeepWiki badge (#3192) @antimora
  • Add recursion_limit macro to getting started exemples code (#3238) @Marc-AnthonyG
  • KaTeX for Mathematical expressions in docstrings (#3278) @BhavyeMathur
  • Add Metal backend support to custom-image-dataset (#3335 #3354) @TsaoLun
  • Add link to license in README badge (#3356) @Olexandr88

Fixes

  • Fix typo in Burn Book (#3113) @danny-burrows
  • fix typos (#3186) @omahs
  • Fix Typos in Documentation Comments (#3280) @leopardracer
  • Fix typo in code documentation for BurnGraph codegen (#3286) @kilavvy
  • Fix error messages from tensor checks for flatten (#3319) @NoVegetable
  • Fix broken link to burn-tch (#3365) @dbdr
  • Update documentation description for nonzero and nonzero_async (#3368) @catch-twenty-two

ONNX Support

  • ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info (#3037) @antimora
  • Restrict ONNX opset to 16 and up (#3051) @antimora
  • Allow Shape input type for Slice operation (#3092) @antimora
  • Support onnx and, or & xor nodes (#3173) @tye-singwa
  • Add support ONNX instance norm (#3177) @tye-singwa
  • Onnx ceil & round (#3225) @tye-singwa
  • Add support onnx group norm (#3245) @tye-singwa
  • Add onnx SpaceToDepth / DepthToSpace (#3277) @tye-singwa
  • Fix onnx topological sort check (#3284) @tye-singwa
  • Add onnx ArgMin node (#3285) @tye-singwa
  • Add support onnx size (#3301) @tye-singwa
  • Support flexible backend selection for import tests (#3372 #3380) @lucianyao
  • Fix ONNX node name sanitization and allow ai.onnx.ml domain (#3371) @antimora

Enhancements

  • Replace some powf->powi (#3152) @ArthurBrussee
  • Improve fusion compilation speed (#3155) @nathanielsimard
  • Perf/remove repeat dim (#3183) @nathanielsimard
  • Perf: Fusion search for composed optimization (#3258) @nathanielsimard
  • Improve matmul selector (#3307 #3343 #3350 #3376) @nathanielsimard

Refactoring

  • Refactor CubeCL slices (#3104) @nathanielsimard
  • CubeCL init refactor (#3128) @nathanielsimard
  • Refactor narrow, chunk and split (#3137) @laggui
  • Refactor quantization scheme (#3042) @maxtremblay
  • Migrated prng (random) to CubeCL (#3165 #3170) @Cielbird
  • Break down test_onnx.rs into test subdirectories (#3144) @antimora
  • Refactor: Move op_configuration.rs from burn-import to onnx-ir (#3126) @antimora
  • Fix relative cmp + debug tools (#3197) @nathanielsimard
  • Refactor cubecl line size matmul (#3219) @louisfd
  • Absolute tolerance is too tight for strict/balanced/permissive (#3242) @laggui
  • Fix clippy rust 1.88 and cargo run checks usage (#3325 #3320) @laggui
  • Remove hip os cfg flags (#3336) @laggui
  • Update cubecl matmul refactor / docs (#3366) @louisfd

Miscellaneous

  • Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
  • Replace run-checks scripts with command alias (#3118) @laggui
  • Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
  • Add cubecl.toml config (#3150) @nathanielsimard
  • Use CUBECL_DEBUG_OPTION=profile macos ci (#3164) @laggui
  • Update cubecl: sync_cube (#3163) @louisfd
  • Fix autotune recursive (#3161) @nathanielsimard
  • Bump zip dependency (#3199) @swfsql
  • Import derive_new::new for safetensors feat (#3205) @swfsql
  • Add CUDA, Vulkan and WGPU on-demand self-hosted runners (#3190 #3215 #3334 #3348 #3351 #3352) @syl20bnr
  • Fix: size_of import in quantization tests (#3195) @louisfd
  • burn-dataset: Catch import.py unsuccessful exits (#3236) @drozdziak1
  • Adding image dimensions to ImageDatasetItem (#3251) @catch-twenty-two
  • burn-dataset: Make virtualenv optional when running importer.py (#3255) @drozdziak1
  • Fix cubecl std usage (#3306) @laggui
  • Fix tui legend label placement (#3327) @BenFradet
  • Move blanket Adaptor impl to metrics base (#3346) @dbdr
  • Make metric order consistent in summaries (#3353) @dbdr
  • Fix cubecl normal_respects_68_95_99_rule (#3377) @laggui
  • Bump deps (#3367) @ArthurBrussee
  • Fix fusion rollback, disable autotune checks and other CI issues (#3362) @laggui

- Rust
Published by laggui 7 months ago

burn - v0.17.1

Bug Fixes & Improvements

  • Downgrade to zip 2.4.2 (fixes #3224) @laggui
  • Fix non contiguous bug with comparison op (#3241) @nathanielsimard
  • Fix/reduce fusion (#3172) @nathanielsimard
  • Fix: fusion multi-block scalar index sharing (#3167) @nathanielsimard
  • Fix to NdArray intmaxdim bug (#3140) @crutcher
  • Make is_contiguous check common (#3083) @laggui
  • Fix clamp min/max line size > 1 (#3078) @laggui
  • Fix vectorization problem with fusion on reshaped not contiguous tensors (#3075) @nathanielsimard

- Rust
Published by laggui 9 months ago

burn - v0.17.0

Summary

This release brings major upgrades in performance and platform compatibility (most notably, a new Metal backend via WGPU passthrough). CubeCL now powers backends for Cuda, Metal, Rocm, Vulkan and WebGpu. Tensor operation fusion support has been greatly expanded to optimize element-wise, reductions and matmul operations.

A new compilation cache and improved autotune cache speed up repeated runs by reusing precompiled binaries and tuned kernel configurations. Data parallel training now scales better across multiple GPUs with automatic batch assignment to each worker. A new tensor slice API offers a simpler, more intuitive way to index tensors.

This version also comes with broad performance gains across tensor operations, especially for reductions, matmul, and convolutions. An initial implementation of quantized matmul is now available, with further quantization improvements planned in the future.

As with previous releases, this includes various bug fixes, further optimizations and enhanced documentation.

Be sure to check out the new burn-bench to compare performance across different versions, hardware and backends.

CubeCL Backends

Burn supports Cuda, Rocm, Vulkan, WebGpu, and the newly added Metal backend.

Each backend can be used through their respective type aliases, provided that the appropriate backend feature flag is also enabled.

Metal
```toml burn = { version = "0.17.0", features = ["metal"] } ``` ```rust use burn::prelude::*; use burn::backend::wgpu::{Metal, WgpuDevice}; let tensor = Tensor::::zeros([2, 4], &WgpuDevice::default()); ```
Cuda
```toml burn = { version = "0.17.0", features = ["cuda"] } ``` ```rust use burn::prelude::*; use burn::backend::cuda::{Cuda, CudaDevice}; let tensor = Tensor::::zeros([2, 4], &CudaDevice::default()); ```
Rocm
```toml burn = { version = "0.17.0", features = ["rocm"] } ``` ```rust use burn::prelude::*; use burn::backend::rocm::{Rocm, HipDevice}; let tensor = Tensor::::zeros([2, 4], &HipDevice::default()); ```
Vulkan
```toml burn = { version = "0.17.0", features = ["vulkan"] } ``` ```rust use burn::prelude::*; use burn::backend::wgpu::{Vulkan, WgpuDevice}; let tensor = Tensor::::zeros([2, 4], &WgpuDevice::default()); ```
WebGpu
```toml burn = { version = "0.17.0", features = ["webgpu"] } ``` ```rust use burn::prelude::*; use burn::backend::wgpu::{WebGpu, WgpuDevice}; let tensor = Tensor::::zeros([2, 4], &WgpuDevice::default()); ```


[!WARNING]
When using one of the wgpu backends, you may encounter compilation errors related to recursive type evaluation. This is due to complex type nesting within the wgpu dependency chain.
To resolve this issue, add the following line at the top of your main.rs or lib.rs file: ```rust

![recursion_limit = "256"]

``` The default recursion limit (128) is often just below the required depth (typically 130-150) due to deeply nested associated types and trait bounds.

Data Loader and Batcher

The Batcher trait has been updated to improve multi-device support. Previously, batcher implementations stored a device internally, which could lead to all data being loaded on the same device. The latest changes have the DataLoader generic over the backend, while the device is passed explicitly:

diff -impl<B: Backend> Batcher<MyItem, MyBatch<B>> for MyBatcher<B> { +impl<B: Backend> Batcher<B, MyItem, MyBatch<B>> for MyBatcher { - fn batch(&self, items: Vec<MyItem>) -> MyBatch<B> { + fn batch(&self, items: Vec<MyItem>, device: &B::Device) -> MyBatch<B> { // The correct `device` is already provided for the batching logic to use } }

The device can now be set when building a data loader: diff let dataloader = DataLoaderBuilder::new(batcher) .batch_size(batch_size) .shuffle(seed) .num_workers(num_workers) + .set_device(device) .build(dataset);

This step is not required for the Learner, which handles the device configuration automatically.

Better Tensor Slicing & Indexing

Tensor slicing now fully adopts idiomatic Rust range syntax, replacing the older (i64, i64) and Option tuple forms.

For example: diff let tensor = Tensor::<B, 2>::zeros([m, n], &device); -let slice = tensor.slice([(0, -1), (0, -2)]); +let slice = tensor.slice([0..-1, 0..-2]);

For more complex or mixed range types, use the s![] macro: diff let tensor = Tensor::<B, 3>::zeros([b, s, d], &device); -let slice = tensor.slice([None, Some((t as i64, t as i64 + 1)), None]); +let slice = tensor.slice(s![.., t..t + 1, ..]);

The macro is inspired by ndarray's s![] (at least, by name) and helps build flexible slice patterns.

```rust use burn::prelude::*;

let tensor = Tensor::::zeros([8, 4, 2, 3], &device); let slice = tensor.slice(s![..=4, 0..=3, .., -1]); assert_eq!(slice.dims(), [5, 4, 2, 1]); ```

Changelog

Module & Tensor

  • Feature add new one hot function meeting multi-dimensions (ranks) (#2613) @tiruka
  • Expand GRU support (#2704) @nwhitehead
  • feat: bitwise-ops-for-tensors (#2498) @quinton11
  • Feat: Add PoissonNLL loss (#2765) @salvomcl
  • Add metric parametrized name (#2808) @laggui
  • Add boolean and/or to bool tensors (#2802) @wingertge
  • Add ATOL/RTOL defaults (#2824) @crutcher
  • Feat: Add tan trig function (#2854) @Msa360
  • Refactor quantization schemes (#2849 #3036) @laggui @maxtremblay
  • Vectorize pooling for optimization (#2905) @wingertge
  • Feat: Add Cosh and Sinh (#2959) @Msa360
  • Refactor in-memory recorder load args (#2892) @BjornTheProgrammer
  • Improve gradient checkpointing (#2997) @nathanielsimard
  • Optimize minmax (#3009) @nathanielsimard
  • Improve tensor.slice(...) to support multiple range types (#3061) @laggui

Bug Fixes

  • Fix bce loss log (#2741) @laggui
  • Fix repeat_dim backward w/ dim size > 1 (#2777) @laggui
  • [Fix] tch upgrade (#2834) @wingertge
  • Check channels_in matches in convolution layers (#2944) @chlobes
  • Fixed GroupNorm implementation (#2945) @computer-whisperer

Backends

  • Migrate to type magic autotune (#2710) @wingertge
  • Feat/fused matmul tune (#2726) @nathanielsimard
  • Feat/shared sum (#2737) @maxtremblay
  • Improve fusion for broadcasting, mix vectorization and reshape operation (#2773 #2833) @nathanielsimard
  • Fuse gather (#2793) @nathanielsimard
  • Feat/fuse select (#2797 #2804 #2903) @nathanielsimard
  • Remove from_data conversions in backends (#2783) @laggui
  • Feat fuse swap dims (#2801 #2877) @nathanielsimard
  • [Feature] reduce fuse on read (#2870) @nathanielsimard
  • [Feat] SIMD acceleration for ndarray backend (#2851) @wingertge
  • Perf/reduce fuse on write (#2937) @nathanielsimard
  • [metal] Add CubeCL metal compiler support (#2993) @syl20bnr
  • Compilation Cache (#3020) @nathanielsimard
  • Cubecl quantize matmul (#3022 #3030) @maxtremblay

Bug Fixes

  • Fix from data fusion (#2735 #2778) @laggui @nathanielsimard
  • Fix constant creation in fusion to cast at compile time, not runtime (#2782) @wingertge
  • Fix two autotune issues on wasm (#2899) @ArthurBrussee
  • Fix/reduce out of bounds (#2906) @nathanielsimard
  • Fix fusion bug (#3031) @nathanielsimard
  • Fix metal backend name (#3040) @nathanielsimard
  • Fix matmul dynamic line size support (#3056) @nathanielsimard
  • Fix: matmul lower precision / flex32 (#3059) @nathanielsimard
  • Fix/autotune cache conflicts (#3070) @nathanielsimard

Documentation & Examples

  • Wasserstein Generative Adversarial Network (#2660) @wangjiawen2013
  • Add modern lstm (#2752) @wangjiawen2013
  • Improve tensor docs (#2951) @PtiLuky

Fixes

  • chore: fix some comments (#2717) @sunxunle
  • Add hardsigmoid formula and fix WGAN doc + default lr (#2706) @laggui
  • Fix db-pedia-infer backend (#2736) @laggui
  • Fixed typo in the burn book chapter advanced unit no-std. (#2731) @xmy314
  • typo - correct smp_serde to rmp_serde as per crate's name in url (#2744) @cameronbraid
  • typo - missing tick which was breaking formatting (#2745) @cameronbraid
  • Remove autodiff from generate (#2759) @laggui
  • Remove empty format precision specifier (#2785) @hkBst
  • Update tch instructions (#2844 #2976) @laggui
  • Fix from_embedded and bool ops docs (#2848) @laggui
  • Fix tiny typo in mathematical expression (#2867) @janhohenheim
  • Fix typos (#2927) @crutcher
  • Fix/web example (#2954 #2978) @laggui
  • Fix: burn-book getting-started Use Declarations (#2966) @jerryshell
  • chore: fix comment (#3008) @tsinghuacoder

ONNX Support

  • Code generation bug fix for ONNX import (#2708) @antimora
  • Floor Node (#2792) @akshitgaur2005
  • One hot ONNX (#2784) @akshitgaur2005
  • Onnx op topk (#2305) @oojo12
  • Fix output elem type for unsqueeze and reshape (#2807) @christeefy
  • Feat/Split ONNX Import (#2568) @agelas
  • Refactor GatherNode to support scalar outputs. (#2828) @loloxwg
  • Rename dim to rank for ONNX import (#2831) @antimora
  • Add rank inference for tan (#2868) @Msa360
  • Add Gemm (#2841) @akshitgaur2005
  • Fix RandomNormalLike ONNX node output rank (#2936) @Knight-Ops
  • Support multiple outputs being tracked in BurnGraph during ONNX conversion (#2938) @Knight-Ops
  • Ignore ONNX optional node inputs/outputs (#2935) @Knight-Ops
  • Fix ONNX flatten to match spec (#2940) @catch-twenty-two
  • burn-import: add some tests for ConstantNode (#2623) @jameshiew @laggui
  • Update SUPPORTED-ONNX-OPS.md with the latest info (#3064) @antimora

Enhancements

  • Add new burn-vision crate (#2753 #2810 #2842) @wingertge
  • Improve Burn compilation times (#2815 #2994) @nathanielsimard
  • Support training in no-std (#2830) @ivila
  • Perf: Speed up element and TensorData conversion (#2913) @wingertge
  • Feat/cubecl caching (#2902) @nathanielsimard
  • Improve multi-device data loading strategy (#2890 #3035) @laggui
  • Autotune level matmul double buffering (#2988) @nathanielsimard @louisfd

Refactoring

  • Remove deprecated Data and DataSerialize (#2703) @laggui
  • Clean up train system metrics (#2707) @laggui
  • Move IR to its own crate (#2796 #2798) @laggui
  • Refactor burn jit => burn-cubecl (#2809) @nathanielsimard
  • Cleanup Tensor Registry in fusion (#2826) @nathanielsimard
  • Migrate conv2d to cubecl (#2908 #3018) @wingertge
  • Update to edition 2024 (#2931) @laggui
  • Update runtime names (#2909) @nathanielsimard
  • Migrate backend comparison (#2961) @laggui
  • Improve test tolerance assertions (#3024) @maxtremblay @laggui
  • [hip] Move burn-hip to burn-rocm and rename backend to ROCm (#3062) @syl20bnr

Miscellaneous

  • Fix no default features flags + update cubecl (#2725) @laggui
  • Replace return with terminate (#2742) @maxtremblay
  • Clean up -jit suffix in feature flags and modules (#2705) @laggui
  • Fix types under autotune flag (#2750) @laggui
  • Fix BackendValues in backend-comparison after removal of jit suffix (#2756) @syl20bnr
  • Update cubecl (#2764) @wingertge
  • Fix optional burn-import dep + impl module types for isize (#2774) @laggui
  • Update cubecl with fix to shared_sum (#2779) @maxtremblay
  • feat: using rustls instead of native-tls (#2799) @ShoofLLC
  • bump cubecl version with dummy implementations (#2814) @maxtremblay
  • Add data_dir optional argument to Huggingface DataLoader to enable some manual download use cases (#2817) @Pablo1785
  • Bump xtask to 1.1.9 (#2896) @syl20bnr
  • Fix test checks for macos (#2952) @PtiLuky
  • Update cargo deps (#2962) @Brooooooklyn
  • Add train end event (#2967) @laggui
  • Update cubecl bitcast -> reinterpret (#2985) @maxtremblay
  • Update cubecl (#2869 #2888 #2990 #2996) @louisfd
  • Update wgpu to v25 (#3007) @syl20bnr
  • update cubecl: sync full cyclic checked (#3025) @louisfd
  • Fix autotune measurement (#3043) @nathanielsimard

- Rust
Published by laggui 10 months ago

burn - v0.16.1

Fixes / Improvements

  • Update bincode dependency (fixes #2876) @laggui
  • Fix TUI renderer display summary (#2967) @laggui

- Rust
Published by laggui 11 months ago

burn - v0.16.0

Summary

This release significantly enhances GPU utilization through a new tensor transaction mechanism for batched sync operations and simultaneous reads of multiple bindings for CubeCL runtimes. It also includes multiple performance optimizations like mixed precision support for matrix multiplication and convolution operations, as well as notable GEMM improvements.

Backend capabilities have been expanded with a new remote backend for distributed computing, improved SPIR-V support, custom operations fusion and an experimental fused matrix multiplication.

Training components have been expanded to support semantic segmentation and object detection datasets, new training metrics and improved training performance thanks to an async metric processor.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations and enhanced documentation.

Module & Tensor

  • Add warning in docstring for indices bound checks (#2462) @laggui
  • Add remainder op for tensor (#2427) @med1844
  • Add float cast tensor op (#2483 #2511 #2538 #2586 #2671) @laggui
  • Add step learning rate scheduler (#2423) @towerpark
  • Add tensor split operator (#2490) @agelas
  • Add tensor transaction mechanism to batch multiple sync operations (#2521) @nathanielsimard
  • [Breaking] Make .init() method of LR schedulers return Result (#2527) @towerpark
  • Make optimizer state public (#2561) @ArthurBrussee
  • Accept function pointer or closure for freq scaling (#2634) @laggui
  • Change pad value w/ ElementConversion (#2653) @laggui
  • Add checks for even padding when kernel size is even (#2677) @laggui

Bug Fixes

  • Fix unsqueeze dims with multiple trailing negative indices (#2496) @laggui
  • Fix one_hot implementation for Int Tensors (#2501) @maun
  • Fix tensor prod and prod dim containing nan values (#2515) @quinton11
  • Expose ItemLazy to be able to implement for custom types (#2525) @laggui
  • Check nonzero stride, dilation and groups (#2540) @laggui
  • Module derive types should inherit visibility (#2610) @laggui
  • Add dropout prob check (#2695) @laggui

Backends

  • Add remote Backend (#2463) @nathanielsimard
  • Add support for custom operations fusion (#2486) @ArthurBrussee
  • [Breaking] Remove precision bridge (#2538) @laggui
  • Add fused matmul under fusion experimental feature flag (#2622 #2690) @nathanielsimard

Bug Fixes

  • Prevent various OOB accesses and discontiguous buffer bugs (#2467) @wingertge
  • Fix autodiff memory management by verifying parent nodes' existence (#2488) @jnamika
  • Fix burn remote deadlock + burn fusion draining (#2492) @nathanielsimard
  • Remove dtype rewrite (#2528) @ArthurBrussee
  • Fix reduce autotune key no anchor (#2696) @nathanielsimard

Documentation & Examples

  • Add wgpu-spirv and hip-jit features to text-classification example (#2422) @syl20bnr
  • Add tensor basic ops examples (#2468) @quinton11
  • Add segmentation mask to burn book (#2495) @anthonytorlucci
  • Add numeric tensor examples (#2514) @quinton11
  • Add module mapper book examples (#2621 #2632) @laggui

Fixes

  • Fix output dim in embedding nn docstring (#2452) @getumen
  • Fix tri mask ops return docstring (#2517) @laggui
  • Fix the incorrect link in contributor-books (#2583) @tiruka
  • Fix the broken WGSL link in the README (#2607) @korbexmachina
  • Fix module visitor and mapper trait definition in the book (#2609) @laggui
  • Fix load_file usage to keep using model (#2672) @laggui
  • Don't mention a fixed candle bug (#2689) @kitterion

ONNX Support

  • Format all type names (#2436) @samolego
  • Add ONNX op Random Normal Like (#2441) @tiruka
  • Add ONNX op Random Uniform Like (#2448) @tiruka
  • Infer convolution kernel shape from weight (#2544) @laggui

Enhancements

  • Improve ndarray tensor creation from memory (#2439) @nathanielsimard
  • Dont attempt naive reduction when reduce_dim is too high (#2414) @ArthurBrussee
  • Add more type support for burn-jit (#2454) @wingertge
  • Rewrite legacy cpa kernels (#2455) @wingertge
  • Implicit GEMM optimizations/bug fixes (#2499) @wingertge
  • Add custom NCHW to NHWC kernel for implicit GEMM (optimization) (#2530) @wingertge
  • Support 8-bit bool for JitBackend (#2526) @wingertge
  • Implicit gemm rewrite optimization (#2545) @wingertge
  • Fix autotune error handling (#2670) @nathanielsimard
  • Use float intrinsics for deformconv2d backward, fix intodata for padded tensors (#2681) @wingertge

Refactoring

  • Migrate to cubecl IR refactor (#2418) @wingertge
  • DefaultDevice should be an alias of BestAvailable (#2443) @ArthurBrussee
  • Replace crates by dependi (#2477) @vincentmasse
  • Refactor quantization tensor data representation (#2479) @laggui
  • Use alias for more consistent typing (#2497) @loganbnielsen
  • Add QTensorOps docs + refactor tests to simplify inputs (#2557) @laggui
  • Update for rust 1.83 (#2562 #2605) @laggui
  • Matmul + CubeCL Update (#2551) @nathanielsimard
  • Migrate matmul autotune to macro and fix accelerated (#2584) @wingertge
  • Refactor jit quantized tensor representation (#2604) @laggui
  • [Breaking] Fix alignment issue of TensorData bytes (#2416) @WorldSEnder
  • Refactor quantized bytes representation (#2627) @laggui
  • Update to new cubecl with improved compilation times (#2654) @nathanielsimard
  • Refactor unary + binary kernels (#2665) @nathanielsimard
  • Import code from github-device-flow crate for burnbench (#2667) @syl20bnr
  • Fix web examples and conflicting feature flags w/ default-features = false (#2691) @laggui
  • Use cubecl reduce w/ autotune (#2673) @maxtremblay

Miscellaneous

  • Use core::error::Error for no-std (#2346) @antimora
  • Update deny.toml to follow the spec changes of cargo-deny (#2408) @tiruka
  • Add segmentation mask to ImageFolderDataset (#2426) @anthonytorlucci
  • Add ROC AUC metric (#2466) @vincentmasse
  • Async Processor: run train metrics & dashboard on another thread (#2482) @nathanielsimard
  • Add precision classification metric (#2293) @tsanona
  • Add test int one_hot and change ops docs in the book (#2519) @tsanona
  • Add option to request manual quit on tui (#2489) @vincentmasse
  • Reduce log spam (#2556) @ArthurBrussee
  • Add ImageDatasetItem image path field (#2558) @wangjiawen2013
  • Fix xtask command with last version (#2566 #2582) @syl20bnr
  • Remove duplicate jit conv2d test (#2581) @tiruka
  • Relax Fn requirements for param map (#2620) @ArthurBrussee
  • Extend ImageFolderDataset to support import of COCO detection (#2612) @jin-eld
  • Add recall metric (#2518) @tsanona
  • Propagate audio feature flag (#2633) @laggui
  • Add F-score metric (#2648) @tsanona
  • Implement benchmark for reduce kernel (#2692) @maxtremblay

- Rust
Published by laggui about 1 year ago

burn - v0.15.0

Summary

This release brings major performance improvements to tensor operations, particularly in matrix multiplication and convolution, along with experimental ROCm/HIP and SPIR-V support enabled by CubeCL runtimes. It also introduces foundational features for multi-backend compatibility and adds new quantization operations.

Support for ONNX models has been expanded, with additional operators and bug fixes for better operator coverage.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations, and enhanced documentation.

Module & Tensor

  • Remove copy restriction for const generic modules (#2222) @laggui
  • Add deform_conv2d as implemented in torchvision (#2147) @wingertge
  • Add dim checks on output rank for unsqueeze and stack (#2331) @laggui
  • Add Softmin (#2358) @NoahSchiro
  • Add round, floor, ceil for float tensor (#2372) @med1844
  • Make tensor sync (#2392) @kingwingfly
  • Add tensor.one_hot int operation (#2413) @tsanona
  • [Breaking] Change LR schedulers to return the initial LR at first .step() (#2337) @towerpark
  • Move LrSchedule generic to make it easier to use (#2309) @ArthurBrussee
  • Add quantization ops default implementation (#2125 #2275 2301) @laggui

Bug Fixes

  • Avoid 0 denominator in interpolate frac (#2224) @laggui
  • Nonzero should return an empty vec for zero tensors (#2212) @laggui
  • Change ndarray mask_where implementation to correctly deal with NaNs (#2272) @laggui
  • Fix mask_where broadcasted input (#2381) @laggui
  • Make powf broadcastable (#2398) @laggui

Backends

  • Add candle CudaDevice and MetalDevice to avoid creating a new unique device each time (#2290) @laggui
  • Add fusion mix precision (#2247) @nathanielsimard
  • Add SPIR-V compiler backend to burn-wgpu (#2386) @wingertge
  • Add burn-hip (#2399) @syl20bnr
  • Add BackendRouter to handle multiple backends on the way to distributed (#2353 #2419) @laggui

Bug Fixes

  • Fix autodiff memory leak (#2347) @nathanielsimard
  • Fix autodiff abs NaN when output is 0 (#2249) @AsherJingkongChen

Documentation & Examples

  • Add documentation for custom cubecl kernels, update some outdated docs (#2404) @wingertge
  • Add comments to burn fusion (#2130) @cBournhonesque
  • Improve doc for burn-tch (#2288) @kingwingfly
  • Improve regression example (#2405) @laggui
  • Create CITATION.cff (#2231) @antimora
  • Enable docautocfg to show feature-req-hint in docs.rs (#2271) @kingwingfly

Fixes

  • Fix tensor data elem type conversion in book (#2211) @laggui
  • Fix target convert in batcher and align guide imports (#2215) @laggui
  • Fix huber loss documentation (#2232) @kingwingfly
  • Fix debugger settings doc in contributor book (#2223) @tiruka
  • Fixed raspberry pi pico example not compiling (#2220) @BjornTheProgrammer
  • Fixed path in book (#2262) @mehmetalianil
  • Fix unresolved import regression (#2285) @tiruka
  • Fix burn book links (#2303 #2327) @laggui @tiruka
  • Contributor Book: Fix the link of primitive types in the "Serialization" page (#2362) @towerpark
  • Fix simple regression batch targets (#2379) @wangjiawen2013
  • Fix xtask args which are unmodified when upgrading xtask commands (#2364) @tiruka

ONNX Support

  • Add gather support for multi-dim indices (rank > 1) (#2199) @alteredoxide
  • Allow onnx-import expand op with non-const shapes (#2189) @hexd0t
  • Improve ONNX import tensor shape tracking (#2213) @hexd0t
  • Add missing output padding to conv transpose ONNX (#2216) @laggui
  • Fix ONNX where op for scalar inputs (#2218) @hexd0t
  • simplify scope tracking in burn-import (#2207) @skewballfox
  • Add onnx op trilu (#2323) @tiruka
  • Add ConvTranspose1d ONNX op (#2349) @tiruka

Enhancements

  • Improve slice kernel performance (#2252) @nathanielsimard
  • Fix burn-jit conv2d excessive loop unrolling (#2263) @AsherJingkongChen
  • Introduce autotuning to conv2d and conv_transpose2d with a new im2col/GEMM algorithm (#2287) @wingertge
  • Further data locality optimizations for implicit GEMM (#2300) @wingertge
  • Add utility methods to split gradients to GradientParams (#2311) @ArthurBrussee
  • Add bounds checking to implicit GEMM to allow arbitrary input shapes (#2354) @wingertge
  • Initialize accumulator to bias for implicit GEMM to save an expensive float_add (#2383) @wingertge

Refactoring

  • Select kernel from CPA to CubeCL (#2168) @mepatrick73
  • Migrate cubecl macro (#2266) @wingertge
  • Remove primitves const D generic (#2298) @laggui
  • Refactor elemwise fusion (#2344) @nathanielsimard
  • Refactor Adaptive Avg Pool to CubeCL (#2351) @nathanielsimard
  • Refactor pooling kernels (#2356) @nathanielsimard
  • Refactor burn-tensor: Split conv backward ops to allow conditional gradient computation (#2278) @AsherJingkongChen

Miscellaneous

  • Fix panic messages being invisible in tui mode (#2226) @PaulWagener
  • Refactor xtask to use tracel-xtask and refactor CI workflow (#2063) @syl20bnr
  • Automatic minimum rust version in README (#2227) @syl20bnr
  • Set MSRV to 1.81 (#2388) @nathanielsimard
  • Don't panic when the progress is > 1.0 (#2229) @PaulWagener
  • Fix compile for dataset crate with vision feature (#2228) @PaulWagener
  • Update CI workflow for last version of setup-linux action (#2248) @syl20bnr
  • [CI] Fix llvmpipe, lavapipe install for valgrind and vulnerabilities (#2264) @syl20bnr
  • Use CliMetricsRenderer when not in a terminal (#2307) @lancelet
  • Update rusqlite and associated libraries (#2328) @paulirotta
  • Fix missing fusion feature flag @nathanielsimard
  • Move conv autotune under feature flag (except key) (#2330) @laggui
  • Add should_run for convs instead of panicking (#2403) @ArthurBrussee
  • Make changes for latest ratatui version (#2421) @laggui
  • Add Windows/WindowsIterator/WindowsDataset (#2338) @NicoZweifel

- Rust
Published by laggui over 1 year ago

burn - v0.14.0

Summary

This release marks the debut of our CubeCL integration, which brings cross-platform GPU programming capabilities directly to Rust. With CubeCL now supporting both CUDA and WebGPU, Burn benefits from a new CUDA backend that can be enabled using the cuda-jit feature. Please note that this backend is still considered experimental, and some operations, particularly those related to vision, may experience issues.

Additionally, this release features significant enhancements to ONNX support, including bug fixes, new operators, and improvements in code generation.

As always, it also includes numerous bug fixes, performance enhancements, new tensor operations, and improved documentation.

Burn 0.14.0 introduces a new tensor data format that significantly enhances serialization and deserialization speeds and introduces Quantization, a new Beta feature included in this release. The format is not compatible with previous versions of Burn, but you can migrate your previously saved records using this guide.

Module & Tensor

  • (@laggui) Add RoPE initwithfrequency_scaling (#2194)
  • (@laggui) Add 0-dim tensor checks for creation ops and validate TensorData shape w/ num values (#2137)
  • (@wingertge): Add Hard sigmoid activation function (#2112)
  • (@antimora): Add isnan and containsnan tensor ops (#2088)
  • (@laggui) Convert compatible prelu weights to rank 1 (#2054)
  • (@laggui) Refactor tensor quantization for q_* ops (#2025)
  • (@RuelYasa): Adding burn::nn::Sigmoid (#2031)
  • (@laggui) Module weight quantization (#2000)
  • (@louisfd): Cube: Matmul tiling (#1994)
  • (@antimora): Enhance slice operation to support more range variation (#1989)
  • (@laggui) Add static tensor quantization (#1963)
  • (@johnhuichen): Enable negative starts and ends for slice op (#1981)
  • (@booti386): Implement 3D and transposed 3D convolutions. (#1945)
  • (@antimora): Print module - implement module display for remaining modules (part2) (#1933)
  • (@antimora): Print model structure like with PyTorch - Part 1 (#1912)
  • (@DieracDelta): Tanh nn wrapper (#1903)
  • (@laggui) Implement Element for bool (#1878)
  • (@LilDojd) Feat: Add movedim tensor operator (#1876)
  • (@ArthurBrussee): Make autodiff compile on wasm (#1889)
  • (@ArthurBrussee): Make Param.id public (#1859)
  • (@kantic) Remainder operator (#1726)
  • (@McArthur-Alford) Indices Operator (#1735)
  • (@laggui) Add seq start position when applying RoPE encoding (#1796)
  • (@JachymPutta): Adding max import (#1769)
  • (@agelas): Feat/squeeze dims (#1779)
  • (@wcshds) Implement bidirectional LSTM (#1035)
  • (@agelas): Feat/remainder (#1597)

Bug Fixes

  • (@laggui) Fix root-mean-square precision issue (#2193)
  • (@laggui) Fix indices dim check in gatherupdateoutputs (#2149)
  • (@antimora): Fix #2091 bug (in-place after expand) (#2114)
  • (@laggui) Fix aggregation results slice (#2110)
  • (@nathanielsimard): Fix: fusion auto bound checks (#2087)
  • (@laggui) Extend [min, max] range to ensure zero-point (#2055)
  • (@agelas): Bug/Remove Squeeze Panic for Multiple Dimensions (#2035)
  • (@nathanielsimard): Fix wgsl remainder definition (#1979)
  • (@laggui) Fix output tensor dtype (#1938)
  • (@femshima): feat: Make RetroForward public (#1905)
  • (@laggui) Fix conv2dweightgrad_groups (#1891)
  • (@nathanielsimard): Fix select assign backward (#1739)
  • (@louisfd): Fix repeat for dims > 1 (#1713)
  • (@nathanielsimard): Fix lstm batch size bug (#1695)
  • (@antimora): Reshape bug fix (#1684)
  • (@antimora) Fix bug: Filling tensor containing f32::NEG_INFINITY will result in NaN for burn-ndarray (#2095)

ONNX Support

  • (@hexd0t): Allow ONNX scalar greater/less with scalar (#2146)
  • (@hexd0t): Implement ONNX Gather for scalar indices (#2141)
  • (@mepatrick73): feat: adding shape support for gather ONNX operation (#2128)
  • (@mepatrick73): ONNX Tile operation (#2092)
  • (@cBournhonesque): Add onnx mean (#2119)
  • (@mepatrick73): Repeat operation (#2090)
  • (@antimora): Add 1d and 2d modules for interpolate with scaling (also fix ONNX Resize op) (#2081)
  • (@johnhuichen): Implement ONNX Pad Operator (#2007)
  • (@hexd0t, @antimora): Implement ONNX ConstantOfShape (#1815)
  • (@johnhuichen): Add subtract tensor from scalar for ONNX sub op (#1964)
  • (@Dirleye): Add ReduceProd ONNX Import (#1955)
  • (@JachymPutta) feat: added reduce min onnx import (#1894)
  • (@mosure): feat: resize onnx import (#1863)
  • (@JachymPutta) feat: added slice onnx import (#1856)
  • (@skewballfox): Optimize argument handling and improve ONNX graph building (#1857)
  • (@JachymPutta) feat: add sum onnx import (#1846)
  • (@agelas): Feat/gather import (#1843)
  • (@JachymPutta): feat: expand onnx import (#1813)
  • (@JachymPutta): feat: added range onnx import (#1834)
  • (@will-maclean): Feature/onnx argmax (#1814)
  • (@hexd0t): Feat: Implement ONNX RandomUniform + RandomNormal in burn-import (#1806)
  • (@JachymPutta): feat: Greater + GreaterOrEqual onnx import (#1801)
  • (@JachymPutta): feat: Less + LessOrEqual onnx import (#1800)
  • (@JachymPutta): feat: added min onnx import (#1778)
  • (@agelas): Squeeze Onnx Import (#1753)
  • (@Arjun31415): Added ONNX AvgPool1d (#1744)
  • (@Arjun31415): Add MaxPool1d ONNX Op(#1725)
  • (@AntBlo) Add reduce sum onnx ops to burn imports (#1723)
  • (@Arjun31415): PReLu ONNX import (#1721)
  • (@antimora): Update SUPPORTED-ONNX-OPS.md (#1717)
  • (@antimora): ONNX debug improvements (#1712)
  • (@antimora): Skip updating shape for linear if not present (#1700)
  • (@laggui) Remove leaky relu ONNX file (#1697)
  • (@antimora): ONNX support for scalar unsqueeze (#1690)
  • (@laggui) Add layer norm onnx op support (#1680)
  • (@antimora): Fix reshape bug (support for opset version 1) (#1667)
  • (@wufniks) Add sign ONNX op import support (#1663)
  • (@laggui) Add where onnx op support (#1653)
  • (@laggui) Add matmul ONNX op support (#1638)
  • (@laggui) Add reduce max ONNX op support (#1636)
  • (@laggui) Add shape ONNX op support (#1639)
  • (@laggui) [ONNX] Add not op and extend cast support to tensors (#1634)
  • (@laggui) Add reduce mean ONNX op support (#1637)
  • (@antimora): Update SUPPORTED-ONNX-OPS.md (#1641)
  • (@laggui) Add sin onnx op support (#1633)

Bug Fixes

  • (@mepatrick73) Tensor type indent fix (#2196)
  • (@mepatrick73) pad-input-fix: adding support for pads as attributes (#2195)
  • (@hexd0t) Fix ONNX Gather codegen for Shape input (#2148)
  • (@mepatrick73): bug fix: adding bounds checking to pad ONNX inputs (#2120)
  • (@laggui) Fix checkschannelsdiv_groups condition and ONNX conv import with groups (#2051)
  • (@nathanielsimard): Support linear 1d (#1682)
  • (@laggui) Fix ONNX and PyTorch import section links in burn book (#1681)
  • (@antimora): Fix bug 1645 (Unsqueeze OpSet 11) (#1661)
  • (@laggui) Fix transpose onnx op (permute) (#1657)

Enhancements

  • (@laggui) Add scientific notation formatting for small metric values (#2136)
  • (@ArthurBrussee): Always derive Cube features from adapter (#1958)
  • (@mepatrick73, @nathanielsimard): Dynamic memory management preset + updated wgpu buffer memory management (#1962)
  • (@mepatrick73): Feat/fixed chunk alloc by class (#1960)
  • (@ArthurBrussee): Consistent sync/async handling, allow more functions to be async for wasm. (#1936)
  • (@varonroy): Replaced str with Path (#1919)
  • (@louisfd, @nathanielsimard): New autodiff graph memory management strategy (#1698)
  • (@syl20bnr): Move HandleContainer and Tensor Ops descriptions from burn-fusion to burn-tensor (#1654)
  • (@NicoZweifel) WindowDataset/windows function (#1553)
  • (@antimora): Improve pickle (CandleTensor) conversions to NestedValue (#1944)

Refactoring

  • (@mepatrick73) Scatter kernel from cpa to cubecl (#2169)
  • (@nathanielsimard): Refactor binary op (#2085)
  • (@omahs): Fix typos (#2098)
  • (@nathanielsimard): Refactor/jit/unary (#1965)
  • (@skewballfox): Separating ONNX parsing from burn-import (#1921)
  • (@laggui) Refactor tensor data (#1916)
  • (@ArthurBrussee): Remove GraphicsAPI generic for WgpuRuntime (#1888)
  • (@skewballfox): add dependency management for python (#1887)
  • (@louisfd): refactor reduce into separate traits (#1798)
  • (@nathanielsimard): Refactor/jit fusion (#1750)
  • (@nathanielsimard): Refactor/burn compute (#1580)

Documentation & Examples

  • (@nathanielsimard) Enable cuda-jit in burn-core + in text classification example (#2160)
  • (@cBournhonesque): Add comments for matmul kernel (#2138)
  • (@laggui) Fix inner backend typo in book guide (#2135)
  • (@antimora): Improve ONNX import book section (#2059)
  • (@antimora): Update slice documentation (#2024)
  • (@syl20bnr): Remove mention of example in backend section of the book (#2014)
  • (@laggui) Fix image-classsification-web + autotune flag usage (#2011)
  • (@nathanielsimard): Cube/doc/readme (#1904)
  • (@laggui, @syl20bnr) Add models and examples reference (#1966)
  • (@antimora): Print module part3 - Update book (#1940)
  • (@towerpark): Book: Fix the link to burn-train in "Learner" page (#1920)
  • (@nathanielsimard): Doc: Improve module to_device/fork docs (#1901)
  • (@jwric, @ThierryCantin-Demers, @mepatrick73): Add documentation to burn core nn (#1746)
  • (@towerpark): Book: Fix typos in the name of MessagePack format (#1868)
  • (@Zirconium409122, @kantic): Remainder operator doc (#1836)
  • (@nathanielsimard): Fix wasm examples (#1824)
  • (@eltociear) docs: update README.md (#1810)
  • (@agelas): Contributor Book: Onnx to Burn Conversion (#1771)
  • (@benbaarber): update ARCHITECTURE.md links to project architecture section in contributor book (#1759)
  • (@jwric): Add hidden code snippets to guide example in Burn book redo
  • (@mepatrick73): Fixing various syntax errors in the Burn book (#1740)
  • (@ThierryCantin-Demers) Add indentation to project architecture in contributing book (#1738)
  • (@AntBlo) Add info about enabling debugging for new contributors (#1719)
  • (@syl20bnr): [guide] Remove ambiguity lib vs. executable (#1649)
  • (@wangxiaochuTHU): Update README.md (#1696)
  • (@syl20bnr): [burn-book] Fix broken URL to SUPPORTED-ONNX-OPS.md (#1651)
  • (@syl20bnr): [burn-book] Fix typos in getting started (#1650)
  • (@louisfd): Many superficial fixes to the contributor book (#1644)
  • (@laggui) Fix guide project name in the book (#1631)
  • (@Gadersd): Improve grammar (#1619)
  • (@agelas): Docs/update contributor book (#1622)

CubeCL

  • (@laggui) Remove CubeCL GELU kernel example reference (moved to CubeCL repo) (#2150)
  • (@cBournhonesque) Convert reduce_dim_naive kernel to use the #[cube] derive macro (#2117)
  • (@syl20bnr): Rename revision key to rev for cubecl dependencies in Cargo.toml (#2086)
  • (@syl20bnr): Fix cubecl version in Cargo.toml to correctly fecth the version tag
  • (@louisfd): Refactor/jit cube/mask (#2075)
  • (@nathanielsimard): Chore/update/cubecl (#2067)
  • (@ArthurBrussee): Feat: Dynamic cube count dispatch (#1975)
  • (@nathanielsimard): Refactor cube launch + support inplace operation (#1961)
  • (@nathanielsimard): Feat/cube/cooperative matrix-multiply and accumulate. (#1943)
  • (@nathanielsimard): Refactor/cube/mutability (#1934)
  • (@nathanielsimard): Handle visibility in cube (#1929)
  • (@nathanielsimard): Feat/cube/array assign ops (#1914)
  • (@nathanielsimard): Feat/comptime expr (#1910)
  • (@nathanielsimard): Feat/cube/compile error (#1909)
  • (@nathanielsimard): feat cube support Array (#1907)
  • (@louisfd): Cube: variable reusability + refactor in cube macros (#1885)
  • (@nathanielsimard): Refactor the tuner to be used standalone (#1884)
  • (@ArthurBrussee): Add option to flush queue instead of waiting for completion. (#1864)
  • (@louisfd): Cube: Vectorization + simple matmul implementation (#1866)
  • (@ArthurBrussee): Get resources from server (#1861)
  • (@ArthurBrussee): Speedup client.create for small allocations. (#1858)
  • (@ArthurBrussee): Add a feature to initialize from an existing wgpu adapter/device/queue (#1788)
  • (@laggui) Fix cmma test (#1957)
  • (@nathanielsimard): Perf/dynamic mm (#1906)
  • (@mepatrick73): Feat/dynamic small pool (#1931)
  • (@mepatrick73): Perf/dynamic mm slice adressing (#1917)
  • (@mepatrick73): Feat/dynamic mm basic implementation + small refactor (#1844)
  • (@louisfd): Cube: CubeType (no launch) and Comptime::map (#1853)
  • (@louisfd, @nathanielsimard): Feat/cube/struct support (#1842)
  • (@nathanielsimard): [Refactor - Breaking] Refactor cube operations with better names & Support subgroup operations (#1839)
  • (@louisfd, @nathanielsimard): Cube: Topology constants (#1838)
  • (@louisfd): Cube: cleaner use of topology values (#1835)
  • (@louisfd): Cube: support for shared memory (#1831)
  • (@louisfd): Cube: support method call + prettier tensor metadata (#1829)
  • (@nathanielsimard): Add vectorization support into cube (#1830)
  • (@louisfd): Cube: support for return + conv2d early return (#1828)
  • (@nathanielsimard): Feat/cube/launch (#1827)
  • (@nathanielsimard): Update cuda-jit (#1799)
  • (@louisfd): Feat/cube/remaining ops (#1807)
  • (@louisfd): Cube: first ported kernel + comptime support + variable reuse + cleanup (#1797)
  • (@louisfd): Refactor/cube/vectorization (#1781)
  • (@louisfd, @nathanielsimard): Feat/enable cube cl (#1777)
  • (@nathanielsimard, @louisfd): Feat/cubecl ir (#1776)
  • (@louisfd): CubeCL first iteration (#1756)
  • (@nathanielsimard): First draft CUDA runtime (#1685)
  • (@nathanielsimard): Upgrade wgpu (#1692)

Miscellaneous

  • (@BjornTheProgrammer) Make compatible with thumbv6m-none-eabi + add raspberry pi pico example (#2096)
  • (@antimora): Precision option for tensor display (#2139)
  • (@tiruka): remove lto linker option to make build successful (#2123)
  • (@cBournhonesque): Add top-k accuracy (#2097)
  • (@tiruka): Modify contributing md scripts to solve conflicts between doc and scripts (#2107)
  • (@ragyabraham, @antimora): Add polars DataFrame support for Dataset (#2029)
  • (@tiruka): modify broken link src of ide image (#2079)
  • (@syl20bnr): Bump rust minimal version to 1.79
  • (@Haislich): Added parameter trustremotecode to hf dataset call. (#2013)
  • (@laggui) Enable optimized handling of bytes (#2003)
  • (@nathanielsimard): Feat: Support trait with CubeCL (#1980)
  • (@syl20bnr): Set DEFAULTMAXTASKS to 1 when running tests
  • (@loganbnielsen) remove manual option matching (#1948)
  • (@jwhogg): Remove closed 'future improvements' (#1935)
  • (@nathanielsimard): Fix: launch without generics (#1932)
  • (@antimora): Update candle-core to a released version (#1913)
  • (@ArthurBrussee): Do not use default burn-compute features unless enabled. (#1908)
  • (@louisfd): clippy on rust update (#1886)
  • (@Icekey): LearnerBuilder "withcheckpointingstrategy" should use builder pattern (#1841)
  • (@nathanielsimard): Fix bench load record benchmarks (#1826)
  • (@jwric): Add configurable application logger to learner builder (#1774)
  • (@getumen) Add Clone trait to the OptimizerAdaptor and Clone implementations to the optimizers (#1770)
  • (@benbaarber): Replace opaque return types in optim (#1767)
  • (@ahmedyarub, @syl20bnr) #1747 Upgrade Rust dependencies (#1748)
  • (@sebhtml): Refactor: replace trait TemplateKernel by existing trait JitKernel (#1737)
  • (@louisfd): Autodiff Memory Management: BFS (#1710)
  • (@nathanielsimard): [Fusion] Support multi-precision fusion (#1718)
  • (@laggui) Refactor element type to be decoupled from runtime (#1693)
  • (@AlexErrant) Arc<EventStoreClient> to Rc<EventStoreClient> (#1668)
  • (@louisfd): remove JIT subsequent RNG tests (#1652)
  • (@antimora): Enable native sign operation for Candle backend (#1647)

Bug Fixes

  • (@laggui) Fix module derive with generics (#2127)
  • (@tiruka): modified mnist image link in the Hugging face (#2134)
  • (@NoahSchiro) Fix broken links in contributor book (#2061)
  • (@syl20bnr): Bump gix-tempfile to fix security audit on gix-fs (#2022)
  • (@laggui) Fix warnings when using record-backward-compat (#1977)
  • (@nathanielsimard): Fix: constant record loading (#1902)
  • (@laggui) Fix DataSerialize conversion for elements of the same type (#1832)
  • (@DieracDelta): Fix burn-jit compile error (#1803)
  • (@laggui) Fix record nested value de/serialization (#1751)
  • (@louisfd): fix prng bug during autotune (#1791)
  • (@ThierryCantin-Demers, @jwric) Fix Cargo.toml repository links (#1749)
  • (@AntBlo) Fix unstable tests when run concurrently (#1724)
  • (@lancelet) Handle ndarray matmul broadcasting (#1679)
  • (@laggui) Fix inverted epoch - iteration counts in valid progress (#1699)
  • (@NicoZweifel) fix: window -> pub window in dataset/mod.rs (#1658)

- Rust
Published by laggui over 1 year ago

burn -

Bugfix

Fix autodiff graph memory management strategy to improve performance (#1702 #1710) @louisfd Fix matmul double broadcasting for ndarray (#1646 #1679) @lancelet

- Rust
Published by nathanielsimard almost 2 years ago

burn -

Bugfix

Fix autodiff memory leak and improve performance with a new graph memory management strategy (#1698) @nathanielsimard @louisfd Fix inplace fused operations (#1682) @nathanielsimard

Improvements

Linear 1D support, helpful for ONNX support (#1682) @nathanielsimard Upgrade wgpu to 0.19.4 (#1692) @nathanielsimard

- Rust
Published by nathanielsimard almost 2 years ago

burn - v0.13.0

The Burn Release 0.13 is a significant update introducing numerous new features and performance enhancements. One major change is the removal of the Sync trait implementation from most Burn types, see Core User APIs. Additionally, the release introduces several new tensor operations, module features, optimizers, as well as improvements to the autodiff backend. Notably, a new bridge mechanism facilitates runtime switching between backends, and significant work has been done on the Just-in-Time and Wgpu backends. The release also addresses numerous bug fixes, documentation improvements, infrastructure updates, CI enhancements, and miscellaneous changes to improve code quality and usability.

Core User APIs

A major change in this release is that most Burn types no longer implement the Sync trait, such as modules, optimizers, and tensors. This change should not impact users of the Learner struct for model training. However, it may affect those who implemented their own training loop and inference server. While modules, optimizers and tensors can be sent to other threads, they cannot be accessed concurrently by multiple threads. This aligns with Burn's workflow, where each tensor operation requires an owned version of the tensor. The change was made to safely reduce the number of locks needed when modifying the state of the autodiff graph, fusion state, allocation cache, and various other use cases. While not all locks have been removed, the type signature no longer poses a problem for follow-up optimizations. Note that the same tensor can still be sent to multiple threads without copying the underlying data. However it will require cloning before sending a tensor to a thread. (#1575) @nathanielsimard

Tensor

  • Support signed value for Tensor::arange #1238 @Nikaidou-Shinku
  • Add Tensor::unsqueeze_dims op (#1236) @skewballfox
  • Add support for Any, All operations to Tensor (#1342) @ashdtu
  • Add not_equal and not_equal_elem tensor ops (#1374) @laggui
  • Element wise min/max between a pair of tensors (#1385) @boondocklabs
  • Add is_close and all_close tensor operators (#1389) @antimora
  • Interpolate tensor operation (Inference Only) (#1246) @Nikaidou-Shinku @antimora @ashdtu
  • Autodiff/training support for Nearest Interpolation (#1414) @Nikaidou-Shinku @ashdtu @antimora
  • Add argwhere and nonzero boolean tensor ops (#1394) @laggui
  • Add bool() op for numerical tensor (#1402) @antimora
  • Tensor permute operator (#1410) @antimora
  • Add sign tensor operator (#1446) @antimora
  • Rename diagonal to eye tensor op and add missing entry for diagonal to Book tensor section (#1449) @antimora
  • Add prod and prod_dim tensor ops (#1460) @antimora
  • Add tril_mask, triu_mask and diag_mask ops (#1479) @antimora
  • Add flip tensor operator (#1468) @carrotflakes
  • Add tensor sorting operations (#1488) (#1494) @laggui
  • Add topk tensor operation (#1497) @laggui
  • Tensor expand operator (#1508) @antimora
  • Provide Tensor Padding Helpers (#960) (#1097) @jcmullwh @antimora
  • Move log_sigmoid to activation ops (#1558) @laggui
  • Add repeat autodiff and fusion support (#1600) @louisfd

Module

  • Feature Addition: PRelu Module (#1328) @Arjun31415
  • Implement Instance Normalization (#1321) @tushushu
  • Add enum module support (#1337) @laggui
  • Make the parameters of conv1d and conv2d public. (#1245) @Arjun31415
  • Parameters are now lazy initialized, so you don't need to implement both the init and init_with(record) method for training/inference. (#1539) @nathanielsimard
  • Support multilabel binary cross entropy (#1571)
  • Implement Huber loss (#1444) @WorldSEnder
  • Feat: Add Leaky Relu Model (#1467) @Arjun31415
  • Feat/swiglu (#1507) @ashdtu
  • Feat: transformer rotary positional encoding to transformer modules (#1604) @ashdtu

Optimizer

  • Add linear learning rate scheduler (#1443) @astral4
  • Exponential learning rate scheduler @1481 @rubenjr0
  • Cosine Annealing learning rate scheduler with cold restarts @1481 @rubenjr0
  • Add Rank0 variant to AdaptorRecordV1 and AdaptorRecordItemV1 (#1442) @carrotflakes

Train

  • Add multi-label classification dataset and metric (#1572) @laggui
  • Add learner training summary (#1591) @laggui

Backend

This release also introduces the backend bridge, a new mechanism for runtime switching between backends. While an improvement, it remains compatible with previous methods of supporting mixed precision. (#1529) @nathanielsimard

JIT

Significant effort has been devoted over the past few months to refactor the previous Wgpu backend into a shader-agnostic Just-in-Time backend. All lower-level dependencies have been abstracted into the Just-in-Time Runtime trait, requiring a compiler, compute server, and storage. The bulk of this work was carried out by @nathanielsimard and @louisfd.

Commits: #1274 #1280 #1313 #1340 #1356 #1359 #1378 #1391 #1396 #1398 #1417 #1429 #1423 #1424 #1433 #1456 #1474 #1457 #1480 #1472 #1493 #1509 #1530 #1528 #1541 #1550 #1569

Wgpu

  • Enable burn-fusion by default. (#1223) @nathanielsimard
  • Feature/autotune int ops (#1136) @agelas
  • Add runtime options in Wgpu init methods. (#1505) @nathanielsimard
  • Decent speedup of transposed convolution @louisfd

Autodiff

Extensive work has also been undertaken on Burn's autodiff backend. The backend now supports gradient checkpointing to reduce memory usage and has been refactored into a client/server architecture. These updates result in significantly less blocking when tracking gradients, enhancing performance particularly on smaller models. Furthermore, various bugs have been fixed where some graph nodes weren't used, potentially truncating the autodiff graph. Overall, these changes make the autodiff process more reliable and efficient. (#1575) (#1358) @louisfd @nathanielsimard

Candle

  • Upgrade to Candle 0.4.1. (#1382) @laggui

Data

  • Add an image folder dataset implementation. (#1232) (#1132) @laggui
  • Add burn::data::network::downloader. (#1283) @laggui

Import

  • [PyTorchRecorder] Allow multiple pattern matches in chain. (#1269) @laggui
  • [PyTorchRecorder] Pytorch config extraction (#1323) @antimora
  • [PyTorchRecorder] Pass top-level key to extract state_dict (#1300) @antimora
  • [PyTorchRecorder] print debug option (#1425) @antimora
  • [PyTorchRecorder] Truncate debug display for NestedValue (#1428) @antimora
  • [PyTorchRecorder] Support for non-contiguous indexes in PyTorchFileRecorder keys (#1432) @antimora
  • [PyTorchRecorder] Add Enum module support (#1436) @antimora
  • [ONNX] Parser rewrite (#1296) @skewballfox

Benchmarks

We have implemented a system that enables the comparison of backends across a variety of tasks. Currently, most of these tasks consist of micro-benchmarks, but we plan to expand the range of benchmarks in the future. To ensure Burn's portability and performance across different devices, the community can run and upload benchmarks! πŸ”₯

  • Created the burnbench CLI. (#1260) @syl20bnr
  • Added GitHub authentication to the burnbench CLI. (#1285) @syl20bnr
  • Updated GitHub App ID with the official application. (#1397) @syl20bnr
  • Implemented benchmark upload functionality to the server. (#1381) @syl20bnr
  • Compiled benchmarks in a dedicated target directory. (#1435) @syl20bnr
  • Enhanced benchmark result presentation with a neat table and attempted to run every benchmark. (#1464) @akhildevelops
  • Improved access token refreshing and displayed authenticated user name. (#1483) @syl20bnr
  • Added system information to benchmark results. (#1495) @syl20bnr
  • Included Operating System information in benchmark results. (#1531) @syl20bnr
  • Fixed automatic fusion activation issue with Wgpu. (#1542) @syl20bnr
  • Tweaked and added kinds to Gelu benchmark names. (#1533) @syl20bnr
  • Ensured backend names in JSON reports match the burnbench CLI. (#1375) @errordeveloper @syl20bnr
  • Added 'all' choice to --benches and --backends options. (#1567) @syl20bnr
  • Revamped burnbench output for improved readability and compactness. (#1568) @syl20bnr
  • Added URL to browse results on the burn.dev website. (#1573) @syl20bnr

Bug Fix

  • Fix the pow backward pass when one of the tensor wasn't tracking the gradients. (#1225) (#1224) @nathanielsimard
  • Fix batch norm on the LibTorch backend when the aggregation was on the same device. (#1226) @nathanielsimard
  • Fix training dashboard metrics switch on Max OS & Linux (#1228) @nathanielsimard
  • Fix a bug introduced in (#1138) where arithmetic could fail on usize type. (#1287) @louisfd
  • [PyTorchRecorder] Fix out of memory bug (#1270) (#1286) @antimora
  • [PyTorchRecorder] Fix chain pattern matching when multiple patterns are provided (#1273) @laggui
  • Fix LogEventStore end epoch log (#1314) @laggui
  • Huggingface dataset importer: check that patype is valid before checking if isbinary (#1354) @laggui
  • Fix implicit casting of bool in wgpu backend (#1391) @louisfd
  • Fix Switched arguments in reshapeargsusize check (#1409) @jackdarlison
  • Fix tch view data corruption (#1434) @nathanielsimard
  • Missing Debug derive for Group Norm Config (#1482) @Arjun31415
  • Numerically stable log_sigmoid (#1548) @laggui
  • Fix pytorch recorder adapt_linear when using autodiff backend (#1576) @laggui

Infrastructure

The minimum Rust version has been updated to 1.75. (#1297) @syl20bnr

Docs

  • Improve the doc feature flags for docs.rs (#1212) @syl20bnr
  • Include the backends in the documentation (#1229) @nathanielsimard
  • Started the burn developer book. (#1184) @skewballfox @syl20bnr @antimora
  • Update TORCHCUDAVERSION usage. (#1284) @laggui
  • fix(book): add missing device parameter to mode.init(). (#1302) @apertureless
  • fix(book): add missing second parameter to CrosEntropyLoss constructor (#1301) @apertureless
  • docs(book-&-examples): modify book and examples with new prelude module (#1372) @bioinformatist
  • Update tensor book (#1401) @antimora
  • Fix book MNIST reference (no more huggingface) (#1471) @laggui
  • Update SUPPORTED-ONNX-OPS.md (#1547) @antimora
  • Update book module (#1557) @antimora
  • Update pytorch-model.md (#1570) @antimora
  • Fixes to code examples in section 5.2 (#1594) @hrishim

CI

  • Add a semantic versioning checker. (#1219) @Luni-4
  • Simplify CI binaries updating. (#1235) @Luni-4
  • Trigger test suite when Cargo.lock file is updated (#1326) @syl20bnr
  • Fix codecov and update to weekly the .dependabot file for cargo (#1320) @Luni-4
  • Refactor xtask (#1288) @iamricks
  • Fix broken test and run-checks script (#1347) @antimora
  • Add stale action (#1383) @Luni-4
  • Update Cargo.lock workflow to trigger commit checks (#1399) @syl20bnr
  • Use GitHub's own action to generate GitHub App token (#1437) @syl20bnr
  • Add support for cargo metadata new workspace member format (#1500) @syl20bnr
  • Switch codecov to informational mode (#1540) @syl20bnr
  • Migrate workflows to use Blaze runners (#1596) @dcvz
  • Add a retry on adding ppa for kisak (#1599) @dcvz

Tests

  • Add NaN and Inf detection in assert_approx_eq to catch potential numerical bugs. (#1209) @skewballfox

Misc

  • Make all struct CamelCase (#1316) (#1311) @antimora
  • Move burn crates to their own crates directory (#1336) @syl20bnr
  • Add sub-crates as members of workspace (#1348) @antimora
  • Pytorch message updates (#1344) @antimora
  • Chore: update main README links to crate-specific READMEs (#1415) @ekalosak
  • [Wasm] remove exit in scripts (#1543) @AlexErrant
  • Use num-traits for float ops (#1584) @antimora

- Rust
Published by laggui almost 2 years ago

burn - v0.12.1

Bugfix

Fix wgpu performance issue: revert to wgpu 0.18.0 #1221 @nathanielsimard Fix problem with batch norm on LibTorch backend #1226 @nathanielsimard Fix docs build #1212 #1229 @syl20bnr @nathanielsimard Fix training dashboard metrics switch #1228 @nathanielsimard

Chores

Put all dependencies versions in workspace #1210 @nathanielsimard

- Rust
Published by nathanielsimard about 2 years ago

burn - v0.12.0

This release highlights an optimized Wgpu Backend, clearer examples and documentation, and numerous bug fixes. Notably, breaking changes in device management mandate explicit device specification to prevent potential bugs. Additionally, the new PyTorch recorder simplifies model porting by enabling automatic import of PyTorch's weights. We also put a lot of efforts into improving our CI infrastructure for enhanced reliability, efficiency, and scalability.

Changes

Tensor & Module API

  • Added support for generic modules #1147 @nathanielsimard
  • Added support for tuple modules #1186 @varonroy
  • Enabled loading PyTorch .pt (weights/states) files directly to module's record, currently available on Linux & MacOS #1085 @antimora
  • Added mish and softplus activation functions #1071 @pacowong
  • Improved chunk performance in backends @1032 @Kelvinyu1117
  • [Breaking] Added the device as an argument for tensor operations that require it, replacing the previous optional device usage #1081 #518 #1110 @kpot
    • Code update involves either using Default::default for the same behavior or specifying the desired device.
  • Allowed raw tensors to be serialized/deserialized directly with serde #1041 @jmacglashan
  • [Breaking] Forced the choice of the device for deserialization #1160 #1165 @nathanielsimard
  • Added element-wise pow operation #1133 @skewballfox
  • Refactored the tensor backend API names #1174 @skewballfox
  • [Breaking] Changed the default recorder to NamedMpkFileRecorder #1161 #1151 @laggui
    • After a bit of exploration, we removed any type of compression because it adds to much overhead

Examples & Documentation

  • Updated the text-classification example #1044 @nathanielsimard
  • Fixed import and type redefinitions in mnist-web-inference #1100 @syl20bnr
  • Fixed documentation of Tensor::stack #1105 @PonasKovas
  • Fixed some typos in links in the burn-book #1127 @laggui
  • Added an example for a custom CSV dataset #1129 #1082 @laggui
  • Fixed missing ticks in Burn book and removed unused example dependency #1144 @laggui
  • Added a new example for regression problems #1150 #1148 @ashdtu
  • Added model saving and loading examples in the book #1164 #1156 @laggui
  • Added Rust concept notes and explanations to the Burn Book #1169 #1155 @laggui
  • Fixed jupyter notebook and ONNX IR example #1170 @unrenormalizable
  • Added a custom mnist dataset, removing the Python dependency for running the guide and the mnist example #1176 #1157 @laggui
  • Updated documentation and book sections on PyTorch import #1180 @antimora
  • Updated burn-book with improved tensor documentation #1183 #1103 @ashdtu
  • Updated burn-book with a new dataset transforms section #1183 #1154 @ashdtu
  • Update CONTRIBUTING.md with code guidelines. #1134 @syl20bnr
  • Fixed documentation of Multi Head Attention #1205 @ashdtu

Wgpu Backend

  • Optimized the repeat operation with a new kernel #1068 @louisfd
  • Improved reduce autotune by adding the stride to the autotune key #1070 @louisfd
  • Refactored binary operations to use the new JIT compiler IR #1078 @nathanielsimard
  • Added persistent cache for autotune #1087 @syl20bnr

Fusion

  • Refactored burn-fusion, making it possible to eventually save the JIT state #1104 @nathanielsimard
  • Improved fusion in the Wgpu backend with caching #1069 @nathanielsimard
  • Supported fusing int operations with burn-fusion #1093 @nathanielsimard
  • Supported automatic vectorization of operations fused with burn-fusion in WGPU #1123 #1111 @nathanielsimard
  • Supported automatically executing in-place operations fused with burn-fusion in WGPU #1128 #1124 @nathanielsimard
  • Heavily refactored burn-fusion to better reflect the stream optimization process #1135 @nathanielsimard
  • Heavily refactored burn-fusion to save all execution plans for any trigger #1143 @nathanielsimard
  • Supported multiple concurrent optimization streams #1149 #1117 @nathanielsimard
  • Supported overlapping optimization builders #1162 @nathanielsimard
  • Supported fusing ones, zeroes, and full operations #1159 @nathanielsimard
  • Supported autotuning fused element-wise kernels #1188 #1112 @nathanielsimard

Infra

  • Support testing accelerate(MacOS) on the burn-ndarray backend #1050 @dcvz
  • Improved CI output by introducing groups #1024 @dcvz
  • Updated scheduled CI tasks #1028 @Luni-4
  • Added support for Windows Pipeline #925 @Luni-4
  • Fixed CI for testing the wgpu backend by pinning versions #1120 @syl20bnr
  • Fixed burn-compute build command with no-std #1109 @syl20bnr
  • Temporarily disabled unnecessary steps on Windows runners to save CI time #1107 @syl20bnr
  • Refactored serialization of backend comparison benchmarks #1131 @syl20bnr
  • Fixed doc build on docs.rs #1168 @syl20bnr
  • Added cargo xtask commands for dependencies and vulnerabilities checks #1181 #965 @syl20bnr
  • Added cargo xtask command to manage books #1192 @syl20bnr

Chore

  • Shared some properties across the cargo workspace #1039 @dcvz
  • Formatted the codebase with nightly where the stable version falls short #1017 @AlexErrant
  • Improved panic messages on the web @1051 @dcvz
  • Used web-time in wasm #1060 @sigma-andex
  • Refactored some tensor tests #1089 @nathanielsimard
  • Made Embedding weights public #1094 @unrenormalizable
  • Updated candle version and added support for slice_assign #1095 @louisfd
  • Records no longer require Debug and Clone #1137 @nathanielsimard
  • Removed cargo warning #1108 @syl20bnr
  • Updated wgpu version to 0.19.0 #1166 @nathanielsimard
  • Added tests for Slice assign vs Cat in LSTM backward #1146 @louisfd
  • Updated xtask publish task #1189 @Luni-4
  • Enable dependabot daily #1195 @Luni-4
  • Updated Ratatui version #1204 @nathanielsimard
  • Updated tch version #1206 @laggui

Bug Fixes

  • Fixed a slice issue in the LibTorch backend that could corrupt tensors' data #1064 #1055 @nathanielsimard
  • Fixed issues with tensor stack and reshape on ndarray #1053 #1058 @AuruTus
  • Fixed multithread progress aggregation in dataloader #1083 #1063 @louisfd
  • Resolved a numerical bug with tanh on MacOS with Wgpu #1086 #1090 @louisfd
  • Fixed burn-fusion, where only operations followed by a sync were fused #1093 @nathanielsimard
  • Removed the requirement for users to add serde as a dependency for Burn #1091 @nathanielsimard
  • Fixed transformer prenorm on the residual path #1054 @Philonoist
  • Fixed conv2d initialization by supporting fan_out #1138 @laggui
  • Resolved the problem of sigmoid gradient generating NaN #1140 #1139 @wcshds
  • Fixed FullPrecisionSettings type for integers #1163 @laggui
  • Fixed batchnorm not working properly when training on multiple devices #1167 @wcshds
  • Fixed powf function in WGPU, with new tests #1193 #1207 @skewballfox @louisfd
  • Fixed regex in PyTorch Recorder #1196 @antimora

- Rust
Published by nathanielsimard about 2 years ago

burn -

Burn v0.11.1 fixes a few bugs in the recent v0.11.0

Bugfixes

Fix concurrency issue in burn-fusion, related to freeing tensors that are never read @nathanielsimard

Fix typos in the book @shanmo

Fix Readme @nathanielsimard

Fix docs build @dcvz

Thanks

Thanks to all aforementioned contributors

- Rust
Published by nathanielsimard about 2 years ago

burn -

The main feature of Burn v0.11.0 is automatic kernel fusion, which is still in active development but already usable. Many enhancement and new features have been added throughout the framework, for better efficiency and reliability.

Warnings:

  • There are some breaking changes, see below.
  • The organization has been renamed from burn-rs to tracel-ai.

Changes

Overall changes

  • [Breaking] Refactor backend names @nathanielsimard

  • [Breaking] Updated the feature flags of burn to improve usability @nathanielsimard

  • Update of Burn's Readme @nathanielsimard @louisfd

Burn Fusion

  • Innovative automatic kernel fusion algorithm @nathanielsimard

  • Relative computation graph cache @nathanielsimard

Burn Core

  • GroupNorm module @dcvz

  • Allow for int and bool constant tensors in modules @nathanielsimard

  • Quiet softmax in transformers @wbrickner

Burn Tensor

  • New operators in tensor API: unsqueeze_dim, narrow, stack, chunk, tril, triu @dcvz

  • Recip operation support on all backends @gzsombor

  • Implement DoubleEndedIterator for DimIter @wcshds

Burn Compute

  • Major Autotune refactor @louisfd

Burn Import

  • ONNX Support for Gather @CohenAriel

  • ONNX Support for Cos, Exp, Gelu, Log, Neg @antimora

  • ONNX Support ConvTranspose2D @npatsakula, @antimora,

  • ONNX Support for Sqrt @edmondop

  • Support countincludepad attr in avg_pool2d ONNX @antimora

Burn Train

  • Add warmup consideration for estimated training time @nathanielsimard

Burn WGPU

  • New Matmul kernels @louisfd

  • New Reduce kernel @louisfd

  • Add Autotune capabilities to Matmul and Reduce @louisfd

  • Support of kernel fusion for element-wise operations @nathanielsimard @louisfd

Burn Candle

  • Support convtranspose1d @louisfd

  • Enable accelerate for MacOS CPU @dcvz

Backend Comparison

  • Custom Gelu benchmarks @nathanielsimard

  • Persistence of results in json @louisfd

Bugfixes

  • Allow arbitrary precision threshold for float equality assertion @meteor-lsw

  • Update serde_rusqlite to the new version with MIT/Apache2 license @antimora

  • Fix SQLite database tests on Windows @syl20bnr

  • Fix maxdim and mindim tensor operations @gzsombor

  • Fix inplace double binary broadcasting in the LibTorch backend @nathanielsimard

Documentation

  • Add Python details in the Book's getting started @antimora

  • Miscellaneous Book fixes @syl20bnr @mks-h

Continuous Integration

  • Add test coverage @Luni-4

  • Speedup typos check @Luni-4

  • Dependency checks @Luni-4

  • Vulnerability checks @Luni-4

Thanks

Thanks to all aforemetioned contributors.

- Rust
Published by nathanielsimard about 2 years ago

burn -

Burn v0.10.0 sees the addition of the burn-compute crate to simplify the process of creating new custom backends, a new training dashboard and the possibility of using the GPU in the browser along with a web demo. Additionally, numerous new features, bug fixes, and CI improvements have been made.

Warning: there are breaking changes, see below.

Changes

Burn Compute

  • Introduction of burn-compute, a new Burn crate making it easier to create async backends with custom kernels. @nathanielsimard, @louisfd

  • Add new memory management strategies @louisfd, @nathanielsimard

  • Add autotune capabilities @louisfd

Burn Import

  • Add more ONNX record types @antimora

  • Support no-std for ONNX imported models @antimora

  • Add custom file location for loading record with ONNX models @antimora

  • Support importing erf operation to ONNX @AuruTus

Burn Tensor

  • Add covariance and diagonal operations @ArvidHammarlund

  • [Breaking] Reading operations are now async when compiling to wasm, except when wasm-sync feature is enabled. @nathanielsimard @AlexErrant

  • [Breaking] Improved Clamp API @nathanielsimard

  • Add unfold tensor operation @agelas, @nathanielsimard

  • Improve tensor display implementation with ellipsis for large tensors: @macroexpansion

Burn Dataset

  • Improved speed of SqLite Dataset @antimora

  • Use gix-tempfile only when sqlite is enabled @AlexErrant

Burn Common

  • Add benchmark abstraction @louisfd

  • Use thread-local RNG to generate IDs @dae

Burn Autodiff

  • Use AtomicU64 for node ids improving performance @dae

Burn WGPU

  • Enable non-blocking reads when compiling to wasm to fully support WebGPU @nathanielsimard

  • Add another faster matmul kernel @louisfd

  • [Breaking] Massive refactor to use burn-compute @nathanielsimard

Burn Candle

  • Candle backend is now available as a crate and updated with Candle advances @louisfd @agelas

Burn Train

  • New training cli dashboard using ratatui @nathanielsimard

  • [Breaking] Heavy refactor of burn-train making it more extensible and easier to work with @nathanielsimard

  • Checkpoints can be customized with criteria based on collected metrics @nathanielsimard

  • Add the possibility to do early stopping based on collected metrics @nathanielsimard

Examples

  • Add image classifier web demo using different backends, including WebGPU, @antimora

Bugfixes

  • Epoch and iteration were swapped. (#838) @daniel-vainsencher

  • RNN (Gru & LSTM) were not generic over the batch size @agelas, @EddieMataEwy

  • Other device adaptors in WGPU were ignored when best available device was used @chistophebiocca

Documentation

  • Update book @nathanielsimard

  • Doc improvements with std feature flag: @ArvidHammarlund

Chores

  • Update all dependencies @antimora

  • Lots and lots of CI Improvements with coverage information @Luni-4, @DrChat, @antimora, @dae, @nathanielsimard

Thanks

Thanks to all aforemetioned contributors and to our sponsors @smallstepman, @0x0177b11f and @premAI-io.

- Rust
Published by nathanielsimard over 2 years ago

burn -

Burn v0.9.0 sees the addition of the Burn Book, a new model repository, and many new operations and optimizations.

Burn Book

The Burn Book is available at https://burn-rs.github.io/book/

  • Burn Book setup and plan @nathanielsimard @wdoppenberg @antimora
  • Motivation & Getting started @louisfd @nathanielsimard
  • Basic Workflow: from training to inference @nathanielsimard @louisfd
  • Building blocks @nathanielsimard
  • ONNX models @antimora
  • Advanced sections @nathanielsimard

Model repository

The Model repository is available at https://github.com/burn-rs/models

  • Setup @nathanielsimard
  • Add SqueezeNet @antimora
  • Multiple models made with Burn @gadersd
    • Llama 2
    • Whisper
    • Stable Diffusion v1.4

Changes to Burn

Neural networks

  • Three new optimizers
    • AdamW @wdoppenberg
    • AdaGrad @CohenAriel
    • RMSProp @AuruTus
  • Custom initializer for transformer-related modules @wbrickner
  • Cross Entropy with label smoothing and weights @ArvidHammarlund

Tensors

  • Many new operators
    • cast @trfdeer @nathanielsimard
    • clamp, clampmin, clampmax @antimora
    • abs @mmalczak
    • maxpool1d, maxpool with dilation @caiopiccirillo
    • adaptiveavgpool 1d and 2d @nathanielsimard
    • conv_transpose 1d and 2d, with backward @nathanielsimard
    • Not operator @louisfd
    • Dim iterator @ArvidHammarlund
  • More tests for basic tensor ops @louisfd

Training

  • New training metrics @Elazrod56
    • CPU temperature and use
    • GPU temperature
    • Memory use
  • Custom training and validation metric loggers @nathanielsimard
  • Migration from log4rs to tracing, better integration in a GUI app @dae
  • Training interruption @dae
  • New custom optimize method @nathanielsimard

Backends

  • WGPU backend
    • Autotune @louisfd @nathanielsimard
    • Cache optimization @agelas
    • Pseudo-random number generator @louisfd
    • Fix configs @nathanielsimard
    • Matmul optimization @louisfd
  • ndarray backend
    • Optimization of argmin/argmax @DrChat
    • Optimization of conv2d @DrChat
  • Candle backend @louisfd
    • Support for all basic operations
    • Work in progress

Dataset

  • Option for with or without replacement in dataset sampler @nathanielsimard

Import & ONNX

  • Refactor, performance, tests and fixes @antimora @Luni-4 @nathanielsimard, @gadersd
  • New operators @Luni-4 @antimora @AuruTus
    • Reshape
    • Transpose
    • Binary operators
    • Concat
    • Dropout
    • Avg pool
    • Softmax
    • Conv1d, Conv2d
    • Scalar and constants
    • tanh
    • clip

Fix

  • Hugging Face downloader Windows support @Macil
  • Fix grad replace and autodiff backward broadcast @nathanielsimard
  • Fix processed count at learning completion @dae
  • Adjust some flaky tests @dae
  • Ability to disable experiment logging @dae

Configuration

  • Rewrite publish and checks scripts in Rust, with cargo-xtask @luni-4 @DrChat
  • Add Typos verification to checks @caiopiccirillo @antimora
  • Checks for Python and venv environment @mashirooooo
  • Feature flags for crates in different scenarios @dae

Documentation

  • Configuration doc for vscode environment setup @caiopiccirillo
  • Jupyter notebook examples @antimora
  • Readme updated @louisfd

Thanks

Thanks to all aforemetioned contributors and to our sponsors @smallstepman and @premAI-io.

- Rust
Published by nathanielsimard over 2 years ago

burn -

In this release, our main focus was on creating a new backend using wgpu. We greatly appreciate the meaningful contributions made by the community across the project. As usual, we have expanded the number of supported operations.

Changes

Tensor

  • Added Max/Minimum operation @nathanielsimard
  • Added average pooling 1D operation @nathanielsimard
  • Added Gather/Scatter operations @nathanielsimard
  • Added Mask Where operation @nathanielsimard
  • Refactor index related operations @nathanielsimard
    • index, index_assign => slice, slice_assign
    • index_select, index_select_assign => select, select_assign
  • New syntax sugar for transpose @wbrickner
  • Added SiLU activation function @poxxy

Dataset

  • Added a dataset using Sqlite for storage. Now used to store huggingface datasets. @antimora
  • New speech command audio dataset. @antimora
  • Create python virtual environment for huggingface dependencies. @dengelt

Burn-Import

  • Big refactor to make it easier to support new operations. @nathanielsimard
  • Support bool element type. @maekawatoshiki
  • Added Add operator. @luni-4
  • Added MaxPool2d operator. @luni-4
  • Parse convolution 2D config. @luni-4
  • Added sigmoid operation. @luni-4

Backend

  • New burn-wgpu backend πŸ”₯! @nathanielsimard @louisfd
    • Tile 2D matrix multiplication
    • All operations are supported
  • Improve performance of repeat with the tch backend. @nathanielsimard

Neural Networks

  • Added LSTM module. @agelas
  • Added GRU module. @agelas
  • Better weights initialization with added support for Xavier Glorot. @louisfd
  • Added MSE loss. @bioinformatist
  • Cleanup padding for convolution and pooling modules. @luni-4
  • Added sinusoidal positional embedding module. @antimora

Fix

  • Deserialization of constant arrays. @nathanielsimard
  • Concat backward with only one dim. @nathanielsimard
  • Conv1d stride hardcoded to 1. @antimora
  • Fix arange with the tch backend. @nathanielsimard

Documentation

  • Improve documentation across the whole project β™₯! @antimora

Thanks

Thanks to all contributors and to the sponsor @smallstepman.

- Rust
Published by nathanielsimard over 2 years ago

burn -

Serialization

Serialization has been completely revamped since the last release. Modules, Optimizers, and Learning Rate Scheduler now have an associative type, allowing them to determine the type used for serializing and deserializing their state. The solution is documented in the new architecture doc.

State can be saved with any precision, regardless of the backend in use. Precision conversion is performed during serialization and deserialization, ensuring high memory efficiency since the model is not stored twice in memory with different precisions.

All saved states can be loaded from any backend. The precision of the serialized state must be set correctly, but the element types of the backend can be anything.

Multiple (de)serialization recorders are provided:

  • Default (compressed gzip with named message pack format)
  • Bincode
  • Compressed gzip bincode
  • Pretty JSON

Users can extend the current recorder using any serde implementation.

Multiple precision settings are available:

  • Half (f16, i16)
  • Full (f32, i32)
  • Double (f64, i64)

Users can extend the current settings using any supported number type.

Optimizer

The optimizer API has undergone a complete overhaul. It now supports the new serialization paradigm with a simplified trait definition. The learning rate is now passed as a parameter to the step method, making it easier to integrate the new learning rate scheduler. The learning rate configuration is now a part of the learner API. For more information, please refer to the documentation.

Gradient Clipping

You can now clip gradients by norm or by value. An integration is done with optimizers, and gradient clipping can be configured from optimizer configs (Adam & SGD).

Learning Rate Scheduler

A new trait has been introduced for creating learning rate schedulers. This trait follows a similar pattern as the Module and Optimizer APIs, utilizing an associative type that implements the Record trait for state (de)serialization.

The following learning rate schedulers are now available:

  • Noam learning scheduler
  • Constant learning scheduler

Module

The module API has undergone changes. There is no longer a need to wrap modules with the Param struct; only the Tensor struct requires a parameter ID.

All modules can now be created with their configuration and state, eliminating the unnecessary tensor initializations during model deployment for inference.

Convolution

Significant improvements have been made to support all convolution configurations. The stride, dilation, and groups can now be set, with full support for both inference and training.

Transposed convolutions are available in the backend API but do not currently support the backward pass. Once they are fully supported for both training and inference, they will be exposed as modules.

Pooling

The implementation of the average pooling module is now available.

Transformer

The transformer decoder has been implemented, offering support for efficient inference and autoregressive decoding by leveraging layer norms, position-wise feed forward, self-attention, and cross-attention caching.

Tensor

The developer experience of the Tensor API has been improved, providing more consistent error messages across different backends for common operations. The Tensor struct now implements Display, allowing values, shape, backend information, and other useful details to be displayed in an easily readable format.

New operations

  • The flatten operation
  • The mask scatter operation

Torch Backend

The Torch backend now supports bf16.

ONNX

The burn-import project now has the capability to generate the required Burn code and model state from an ONNX file, enabling users to easily import pre-trained models into Burn. The code generation utilizes the end user API, allowing the generated model to be fine-tuned and trained using the learner struct.

Please note that not all operations are currently supported, and assistance from the community is highly appreciated. For more details, please refer to the burn-import repository https://github.com/burn-rs/burn/tree/main/burn-import.

Bug Fixes

  • Backward pass issue when there is implicit broadcasting in add https://github.com/burn-rs/burn/issues/181

Thanks πŸ™

Thanks to all contributors @nathanielsimard , @antimora, @agelas, @bioinformatist, @sunny-g Thanks to current sponsors: @smallstepman

- Rust
Published by nathanielsimard almost 3 years ago

burn -

Backend API

  • Almost all tensor operations now receive owned tensors instead of references, which enables backend implementations to reuse tensor-allocated memory.
  • Backends now have a different type for their int tensor, with its own set of operations.
  • Removed the IntegerBackend type.
  • Simpler Element trait with fewer functions.
  • New index-related operations (index_select , index_select_assign , index_select_dim and index_select_dim_assign).

Tensor API

  • The Tensor struct now has a third generic parameter Kind with a default value of Float.
  • There are three kinds of tensors: Float, Bool, and Int,
    • Float Tensor β‡’ Tensor<B, D> or Tensor<B, D, Float>
    • Bool Tensor β‡’ Tensor<B, D, Bool>
    • Int Tensor β‡’ Tensor<B, D, Int>
  • You still don’t have to import any trait to have functions enabled, but they have an extra constraint based on the kind of tensor, so you can’t call matmul on a bool tensor. All of it with zero match or if statement, just pure zero-cost abstraction.
  • The BoolTensor struct has been removed.

Autodiff

  • Not all tensors are tracked by default. You now have to call require_grad.
  • The state is not always captured. Operations manually have to clone the state they need for their backward step. This results in a massive performance enhancement.

No Std

  • Some Burn crates don't require std anymore, which enables them to run on any platform:
    • burn-core
    • burn-ndarray
    • burn-common
    • burn-tensor
  • We have a WebAssembly demo with MNIST inference. The code is also available here with a lot of details explaining the process of compiling a model to WebAssembly.

Performance

  • The Tch backend now leverages in-place operations.
  • The NdArray backend now leverages in-place operations.
  • The convolution and maxpooling layers in the NdArray backend have been rewritten with much better performance.
  • The cross-entropy loss module leverages the new index_select operation, resulting in a big performance boost when the number of classes is high.

And of course, a lot of fixes and enhancements everywhere.

Thanks to all the contributors for their work @antimora @twitchax @h4rr9

- Rust
Published by nathanielsimard almost 3 years ago

burn -

New Modules for Vision Tasks

  • Conv1D, Conv2D currently without support for stride, dilation, or group convolution
  • MaxPool2D
  • BatchNorm2D

New General Tensor Operations

  • log1p thanks to @bioinformatist
  • sin, cos, tanh thanks to @makroiss

Breaking Changes

  • Devices are now passed by reference, thanks to feedback from @djdisodo.
  • The shape function now returns an owned struct, and backends no longer need to cache each shape.

- Rust
Published by nathanielsimard about 3 years ago

burn -

- Rust
Published by nathanielsimard about 3 years ago

burn -

  • Separed backend crates

- Rust
Published by nathanielsimard over 3 years ago