Recent Releases of flash-linear-attention

flash-linear-attention - v0.3.1

What's Changed

  • [Misc] Change grid to support long ctx by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/528
  • [RWKV7] Reduce CPU overhead by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/529
  • [Tokenshift] Support SP and cache by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/531
  • [RWKV7] Use tokenshift to save cache by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/532
  • [RWKV7] Fix the issue of RWKV7 initialization with BFloat16 data type on CPU. by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/538
  • [CI] Add compatibility check by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/536
  • [ShortConv] Support cache in prefill by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/535
  • [WIP] Add Log-Linear Attention by @2022tgoel in https://github.com/fla-org/flash-linear-attention/pull/524
  • [Cache] Upgrade to transformer>= v4.48[skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/541
  • [Misc.] Set env var TRITONF32DEFAULT to ieee when tf32 is not supported on NVIDIA by @KevlarKanou in https://github.com/fla-org/flash-linear-attention/pull/544
  • [CI] Fix mirror for building triton by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/543
  • Log-Linear Attention Tests by @2022tgoel in https://github.com/fla-org/flash-linear-attention/pull/542
  • [CI] Add proxy config for git by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/548
  • [Conv] Fix warning issue by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/549
  • [Misc.] Eliminate recompilation in layer-norm kernels caused by dynam… by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/545
  • [Misc.] Add activations for non-cuda Backends by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/174
  • [TMA] Accelerate solve_tril with TMA descriptors[skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/550
  • [CI] Upgrade to latest casual-conv1d and fix triton build for 3.4.x by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/551
  • [CI] Fix support for Intel GPU by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/554
  • [Fix] Fix Triton Error for HeadDim < 16[skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/556
  • [GLA] Fix simple_gla Test by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/558
  • [CI] Fix CI script errors[skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/566
  • require transformers <= 4.53.3 by @richardodliu in https://github.com/fla-org/flash-linear-attention/pull/570
  • [Deps] Adopt transformers>4.53.3 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/571
  • [Misc.] Clean codes and make mypy happy by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/572
  • [Models]: Add MoM by @WKX933 in https://github.com/fla-org/flash-linear-attention/pull/442
  • [MoM]Fix lint by @JusenD in https://github.com/fla-org/flash-linear-attention/pull/573
  • [Refactor] Apply GradientCheckpointingLayer to all model layers by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/575
  • [Mamba] Fix errors in Triton backend by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/576

New Contributors

  • @2022tgoel made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/524
  • @KevlarKanou made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/544
  • @richardodliu made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/570
  • @WKX933 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/442

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.3.0...v0.3.1

- Python
Published by yzhangcs 6 months ago

flash-linear-attention - v0.3.0

Highlights

🧠 New Models

We are excited to expand our model library with the addition of four powerful new architectures.

  • 🎉 MesaNet by @sustcsonglin
  • 🛣️ PaTH by @sustcsonglin
  • 🐍 Comba by @AwesomeSeq @yzhangcs
  • 🐳 MLA by @toothacher17 @yzhangcs

What's Changed

  • [MesaNet] add kernel impl. by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/419
  • [GDN] Add support for inference with GVA by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/429
  • [HGRN] remove unused q_conv1d by @yibozhong in https://github.com/fla-org/flash-linear-attention/pull/430
  • Update mesa_net.py by @jovoswald in https://github.com/fla-org/flash-linear-attention/pull/434
  • [Gated DeltaNet] Refactor the kernel to remove one matrix inversion by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/433
  • [Modules] Add L2Warp to maintain bf16 precision by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/438
  • [RWKV]: Set default scale to None by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/445
  • [Typos] Change scale docs to (Optional[float]) [skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/446
  • [Modules] Enhance Testing of l2warp by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/448
  • [CI] Upgrade CI envs to torch~=2.7.0 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/450
  • [Mesa] misc. fix by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/449
  • [Models]: Add Comba Implementation by @AwesomeSeq in https://github.com/fla-org/flash-linear-attention/pull/444
  • [Test] Walk around the bug of causal_conv1d by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/453
  • [Utils] Add deprecation handling for kwargs with deprecate_kwarg decorator by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/455
  • [ShortConv] Replace use_fast_conv1d with backend parameter by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/456
  • [Docs] Update tensor shape descriptions and deprecate head_first argument by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/457
  • [Simple GLA] Support dg when dht passed by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/459
  • [Mesa] Improve precision by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/460
  • [Comba] Remove problematic safe_exp by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/466
  • [TokenShift] Fix invalid argument on AMD GPUs by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/464
  • [Test] Refractor model testing[skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/467
  • [Testing] Enhance generation testing by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/468
  • [Simple GLA] Remove unnecessary dg for data-independent decay by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/469
  • [CI] Update workflow by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/473
  • [Misc.] Enhance support for some platforms by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/470
  • [Gated Delta Product] Optimize kernels by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/472
  • [README] Add support for aarch64 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/475
  • [Cache] Fix bad seen_tokens update by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/478
  • [CI] Revert causal-conv1d to 2a288a1 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/480
  • [Parallel] Fix all tokens offsets by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/479
  • Use tl.exp2 for all gating operations by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/361
  • Refactor modeling tests by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/482
  • Add L2_norm for p in Recurrent ops to fix generation error by @AwesomeSeq in https://github.com/fla-org/flash-linear-attention/pull/483
  • Refactor benchmark: adapt to latest FLA benchmark interface by @yuweih205 in https://github.com/fla-org/flash-linear-attention/pull/488
  • [GLA] Remove all safe_exp ops by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/489
  • [MesaNet] Remove all safe_exp ops & Refactor tests by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/490
  • [Misc.] Support PT2.5 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/491
  • [Misc.] Fast testing & Autotune by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/476
  • fix: update import path for causal_conv1d by @yuweih205 in https://github.com/fla-org/flash-linear-attention/pull/492
  • Make RWKV-7 init match official RWKV-LM by @johanwind in https://github.com/fla-org/flash-linear-attention/pull/493
  • Modernize the fused_chunk impls by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/437
  • [ShortConv] Fix bad conv weight input shape during inference by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/495
  • [DeltaProduct] chore: remove unused functions by @timurcarstensen in https://github.com/fla-org/flash-linear-attention/pull/496
  • [CI] Fix pipeline in GPU CIs by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/497
  • [RWKV] Make torch.compile decorator compatible with python3.10 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/498
  • [GDN] Fuse 64x64 matrix inverse kernel by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/501
  • [L2Norm] Speedup by saving rstd by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/506
  • [Norm] Move eps out of sqrt by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/508
  • Correct types of constructor arguments with issues for configuration classes by @V0XNIHILI in https://github.com/fla-org/flash-linear-attention/pull/509
  • Fix typo: suppoerted -> supported by @zxytim in https://github.com/fla-org/flash-linear-attention/pull/510
  • [RWKV7] Increase Lora shape for headdim>64 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/512
  • [Delta Rule] Support gk for WY reprs by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/514
  • [PaTH attention] Support headdim 128 & refactor kernel for better stability by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/503
  • [Rotary] Fix max_seqlen under varlen mode by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/516
  • [Misc] Skip testing models on Nvidia 4090 CI by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/517
  • [GDP] Delete duplicated code by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/518
  • [WIP] Add MLA layers into fla by @toothacher17 in https://github.com/fla-org/flash-linear-attention/pull/395
  • [Mamba] Add triton conv1d backend and fix mamba2 test by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/520
  • [Typo] Fix types in all configuration files[skip test] by @V0XNIHILI in https://github.com/fla-org/flash-linear-attention/pull/513
  • [GSA] Fix memory boundary conditions by @JusenD in https://github.com/fla-org/flash-linear-attention/pull/527

New Contributors

  • @jovoswald made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/434
  • @AwesomeSeq made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/444
  • @yuweih205 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/488
  • @V0XNIHILI made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/509
  • @zxytim made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/510
  • @toothacher17 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/395
  • @JusenD made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/527

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.2.2...v0.3.0

- Python
Published by yzhangcs 8 months ago

flash-linear-attention - v0.2.2

What's Changed

  • [TokenShift] support fused_token_shift with varlen by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/373
  • [Mamba] Use official init strategies by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/374
  • [Mamba2] Create attn layer by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/375
  • [Mamba] Add attn layer & fix configs by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/376
  • [RWKV7] Update fused_addcmul impls by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/378
  • [RWKV7]: Rewrite docs to match Triton codes. by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/381
  • [RWKV7] Fix convert script by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/383
  • [Misc.] Update triton-nightly.yml by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/382
  • [PaTH] Add PaTH attention model and kernel by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/384
  • [Tests] Enable tests with causal_conv1d on H100 CIs by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/385
  • [GDN]: initializing A_log and dt_bias in _init_weights by @HanGuo97 in https://github.com/fla-org/flash-linear-attention/pull/380
  • [Utils] Add fused pack/unpack fns by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/386
  • [RWKV7] Strictly initialize rwkv7 according to RWKV-LM by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/387
  • [chore] switched to processing_class kwarg inside Trainer invocation by @timurcarstensen in https://github.com/fla-org/flash-linear-attention/pull/391
  • [RWKV7] Update initialization to sync with latest RWKV-LM by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/393
  • [Token Shift]: Fix potential cuda kernel parameter error for varlen by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/397
  • [DeltaProduct] fix query conv cache, remove extraneous query convs by @timurcarstensen in https://github.com/fla-org/flash-linear-attention/pull/396
  • [Misc.] Log warnings when Triton is older than 3.2.0 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/394
  • [RWKV7]: clean fused_addcmul_rwkv7 impls by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/404
  • [README] Update FoX venue info by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/406
  • Added details to some formulas, fixed the display error of the L2 Loss formula by @Beortext in https://github.com/fla-org/flash-linear-attention/pull/407
  • [RWKV7] Change fp32 errors to warnings by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/412
  • [Misc.] Add exist_ok=True to all models by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/413
  • Add Rodimus impl into fla by @ziHoHe in https://github.com/fla-org/flash-linear-attention/pull/416
  • Align RWKV7 LoRA Rank Initialization with official Implementation by @WuTianyi321 in https://github.com/fla-org/flash-linear-attention/pull/418
  • [Canon] Add triton impls by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/388
  • [GDN] Support Gated Value Attention (GVA) by @Rafa-zy in https://github.com/fla-org/flash-linear-attention/pull/421
  • [RWKV7]: clean some imps by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/420
  • [RoPE] Fix out-of-boundary bugs by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/423
  • [RWKV] Fix cu_seqlens with gradient checkpoint by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/422

New Contributors

  • @timurcarstensen made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/391
  • @ziHoHe made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/416
  • @WuTianyi321 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/418
  • @Rafa-zy made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/421

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.2.1...v0.2.2

- Python
Published by yzhangcs 9 months ago

flash-linear-attention - v0.2.1

Highlights

🚀 Performance Boost for DeltaNet

We've achieved a notable performance enhancement for (Gated) DeltaNet models. The optimization efforts focused on the fused LayerNormGated layer, particularly for small headdims, which has resulted in a 1.1x speedup.

Below are the benchmarks for 1B parameter models, tested on 4k sequences in varlen mode, using a single H100 GPU

| | TPS (K tokens/s) | | ----------------- | :--------------: | | Transformer++ | 53.8 | | DeltaNet (before) | 48.6 | | DeltaNet (after) | 54.0 |

by running py python -m benchmarks.benchmark_training_throughput \ --name delta_net \ --batch_size 1 \ --seq_len 32768 \ --context_len 4096 \ --varlen \ --steps 512

What's Changed

  • [Gated DeltaNet] optimize UT transform by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/349
  • [RWKV] remove duplicate params from autotune key list by @jihaoh98 in https://github.com/fla-org/flash-linear-attention/pull/359
  • Fix some arg passing by @yibozhong in https://github.com/fla-org/flash-linear-attention/pull/358
  • [RWKV7] Update RWKV7 to follow official initialization by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/365
  • Remove all NT: constexpr by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/364
  • [Misc.] Use logger.info instead of print in fla.utils.py by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/366
  • [RWKV]: Prevent initialization when loading pretrained weights by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/369
  • [Norm] Optimize speed for small headdim by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/368
  • [GroupNorm] Optimized speed for small headdims by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/371
  • [LayerNormGated] Fix arg bugs during autotuning by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/372

New Contributors

  • @jihaoh98 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/359
  • @yibozhong made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/358

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.2.0...v0.2.1

- Python
Published by yzhangcs 10 months ago

flash-linear-attention - v0.2.0

What's Changed

  • [Attn] Delete V reduction & Enable 256 headdim tests by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/273
  • [RWKV7] Add more elementwise kernels by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/271
  • [CI] Remove cache and disable full test on Arc GPU by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/274
  • [Fox] Add model/layer/kernel impls w/ varlen support by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/275
  • [FoX] Simplify some tests and enhance tiling by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/277
  • [Test] Remove some warnings and correct condition checks by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/278
  • [CI] auto-cancel workflows on PR merge via concurrency group by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/280
  • [Test] use tl.float16 instead of tl.bfloat16 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/281
  • [OP] replace tl.exp, tl.log, tl.log2 with fast ops when FLA_USE_FAST_OPS=1 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/276
  • [FoX] Rename fox to forgetting_attn by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/282
  • [DeltaNet] WY repr speedup by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/279
  • [README] Add --no-use-pep517 flag for faster installation by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/286
  • [FoX] Skip test D>128 on RTX4090 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/287
  • [FoX] Test different forget gate initialization ranges by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/291
  • [FoX] Fix class inheritance for ForgettingTransformerForCausalLM by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/293
  • [CI] use latest stable triton by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/294
  • [Triton] use tl.gather to enhance performance by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/270
  • [WY representation] Faster lower triangle inverse by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/289
  • [GroupNorm] Add argument is_rms_norm to GroupNorm by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/295
  • [GroupNorm] Return correct residual in reference implementation by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/297
  • [CI] Don't show Triton autotune logs in CI by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/298
  • [FoX] Use GroupNorm for QK-norm implementation in FoX by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/299
  • [Utils] Update H100 and A100 configs by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/306
  • Pass shifted labels and add a warning to RWKV-7 initialization. by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/304
  • [Misc.] Update imports for GatedDeltaProduct by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/309
  • [FAQ] Rewrite the nightly installation instructions by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/305
  • Add unit tests for model forward and variable-length checks by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/310
  • [Test] Improve path handling and test file detection by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/311
  • [ShortConv] Adjust input shape according to cu_seqlens by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/316
  • [Tests] Add unit tests for generation with padding by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/312
  • [Testing] Update testing.py by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/320
  • [DeltaNet] optimize chunk_delta_h by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/315
  • [CI] Only cancel in-progress CI for pull requests by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/321
  • [Test] Skip some tests on arcA770 by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/322
  • [API] Update head_first parameter default to False by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/324
  • [Rotary] Remove max_seqlen parameter and adjust related logic by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/326
  • [DeltaProduct] Remove unnecessary config parameter. by @JulienSiems in https://github.com/fla-org/flash-linear-attention/pull/325
  • fix the training problem of GatedDeltaProduct by @ridgerchu in https://github.com/fla-org/flash-linear-attention/pull/327
  • [Linear Attn] Fix head_first tests by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/330
  • [Deprecated] Remove head_first option in gla variants by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/337
  • [Test] Ensure most tests on Triton 3.2.0 and add 4096 seq_length in tests [skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/300
  • [FoX] Merge code to FlashAttention | support batch inference by @sustcsonglin in https://github.com/fla-org/flash-linear-attention/pull/333
  • [DeltaNet] Delete head_first option for all by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/338
  • [WIP] Remove head_first option by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/339
  • [RWKV7] add input_precision param [skip test] by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/335
  • [Testing] Add recursive dependency finding for test discovery by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/341
  • [WIP] Delete head_first option for cumsum by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/342
  • [WIP] Delete head_first tests for DeltaNet/GLA by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/344
  • [Attn] Remove head_first & rename offsets to cu_seqlens by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/345
  • [RWKV7] Drop some kernels to enhance speed by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/346
  • Remove the head_first arg from several token mixing layer fns. by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/347

New Contributors

  • @sustcsonglin made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/289

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.1.2...v0.2.0

- Python
Published by yzhangcs 11 months ago

flash-linear-attention - v0.1.2

What's Changed

  • [RWKV7] fix RWKV7Attention.__init__ by @exhyy in https://github.com/fla-org/flash-linear-attention/pull/238
  • fix(triton): remove numwarps=8 in bwdpreparewyrepr_kernel to avoid MMA layout assertion on non-Ampere GPUs. by @kugwzk in https://github.com/fla-org/flash-linear-attention/pull/240
  • [Fix]: reshape o before oproj in linearattn layer. by @Luther-Sparks in https://github.com/fla-org/flash-linear-attention/pull/243
  • [CI] Seperate tests to compile , normal and varlen by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/247
  • [ABC] Add use_rope parameter to ABCAttention and ABCConfig & Fix compiler bugs in kernels by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/248
  • [CI] trigger GPU workflow only on pull_request events by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/249
  • Create test_linearatten.py by @kangyiyang in https://github.com/fla-org/flash-linear-attention/pull/250
  • [CI] Fix all erros and enable testing for PR by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/251
  • [CI] add H100 GPU by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/254
  • [Gated DeltaNet] fix gdn kernel bugs on h100 when vdim=64 by @kugwzk in https://github.com/fla-org/flash-linear-attention/pull/256
  • [Test] Enhance support for NVIDIA Hopper GPU by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/257
  • [FAQ] Update triton-nightly links by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/259
  • [Attn] Add triton impls for MHA/GQA by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/260
  • [Attn] Use larger block size for hopper devices by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/261
  • [Attn] Enable test for attn by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/262
  • [CI] fix a syntax error in triton-nightly by @zhiyuan1i in https://github.com/fla-org/flash-linear-attention/pull/263
  • Bump fla to v0.1.2 by @yzhangcs in https://github.com/fla-org/flash-linear-attention/pull/264

New Contributors

  • @exhyy made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/238
  • @kugwzk made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/240
  • @Luther-Sparks made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/243
  • @yzhangcs made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/248
  • @kangyiyang made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/250

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.1.1...v0.1.2

- Python
Published by yzhangcs 11 months ago

flash-linear-attention - v0.1.1

What's Changed

  • [README] Fix HGRN2 bibs in https://github.com/fla-org/flash-linear-attention/commit/a43b525e397ecb92e3f1335e39029be731b9cc21
  • [LightNet] Use fused norm, output gate and proj in https://github.com/fla-org/flash-linear-attention/commit/bc86b590880e47f3c27753ac377138ab165414a2
  • [LayerNormGated] Support combined sigmoid/swish output gate in https://github.com/fla-org/flash-linear-attention/commit/27de88dbf8382c859a4136c0002bf231908a8235
  • [NSA] Fix missing Cache class in https://github.com/fla-org/flash-linear-attention/commit/a5199200efdce7eae142498f137ea77219191a21
  • [BUG][RWKV7] Fix value head dim mismatch in https://github.com/fla-org/flash-linear-attention/commit/8fd1d3d55c19ccbb012b33f1caa6d271e58ead8d
  • [DeltaNet] Improved kernel speed by finegrained autotuning in https://github.com/fla-org/flash-linear-attention/commit/3b9bba8b9c9397b5ce69677e02c3181a056bc741

Full Changelog: https://github.com/fla-org/flash-linear-attention/compare/v0.1.0...v0.1.1

- Python
Published by yzhangcs 11 months ago

flash-linear-attention - 💥 v0.1.0

What's Changed

  • Update README.md by @eltociear in https://github.com/fla-org/flash-linear-attention/pull/2
  • fix simple gla backward by @sunyt32 in https://github.com/fla-org/flash-linear-attention/pull/6
  • Adding RWKV-v4. by @ridgerchu in https://github.com/fla-org/flash-linear-attention/pull/8
  • fixed hgrn.py paper link and title by @ridgerchu in https://github.com/fla-org/flash-linear-attention/pull/10
  • Update recurrent_naive.py by @hypnopump in https://github.com/fla-org/flash-linear-attention/pull/12
  • fix: calculate du on different batch by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/35
  • fix: enhance state gradient when bf16 by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/37
  • Add implementations of Mamba 2 into FLA by @DanFosing in https://github.com/fla-org/flash-linear-attention/pull/39
  • Minor mamba-2 fixes by @DanFosing in https://github.com/fla-org/flash-linear-attention/pull/40
  • [DeltaNet] Adds beta as a vector option by @hypnopump in https://github.com/fla-org/flash-linear-attention/pull/42
  • [DRAFT] Beta gradient does not match by @hypnopump in https://github.com/fla-org/flash-linear-attention/pull/43
  • [Attn] fix negative value of seqlen offset during sft by @ChaosCodes in https://github.com/fla-org/flash-linear-attention/pull/45
  • [RWKV6] fix backward if h0 not passed by @hypnopump in https://github.com/fla-org/flash-linear-attention/pull/48
  • Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel by @learning-chip in https://github.com/fla-org/flash-linear-attention/pull/49
  • benchmark script for simple_gla vs mamba2 kernel by @learning-chip in https://github.com/fla-org/flash-linear-attention/pull/50
  • Update amp customfwd, custombwd usage for torch 2.4.0 compatibility by @mirceamironenco in https://github.com/fla-org/flash-linear-attention/pull/54
  • Fix syntax error by @JulienSiems in https://github.com/fla-org/flash-linear-attention/pull/55
  • Add __init__.py in fla/ops/common for automatic package discovery by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/56
  • [Mamba2] Post Merge Fixes - norm_before_gate and generation with inputs_embeds by @vasqu in https://github.com/fla-org/flash-linear-attention/pull/57
  • Correctly compute max_seqlen when max_position_embeddings is None by @zhixuan-lin in https://github.com/fla-org/flash-linear-attention/pull/59
  • add chunked kl div by @ChaosCodes in https://github.com/fla-org/flash-linear-attention/pull/62
  • Add fine-grained warning category for easier supression by @mirceamironenco in https://github.com/fla-org/flash-linear-attention/pull/65
  • Update fused_chunk.py by @hypnopump in https://github.com/fla-org/flash-linear-attention/pull/72
  • [Mamba2] Fix slow path by @vasqu in https://github.com/fla-org/flash-linear-attention/pull/84
  • Add BitNet by @DustinWang1 in https://github.com/fla-org/flash-linear-attention/pull/85
  • Fix RWKV6 Cache Problems by @WorldEditors in https://github.com/fla-org/flash-linear-attention/pull/78
  • Bugs in RWKV6 OP by @WorldEditors in https://github.com/fla-org/flash-linear-attention/pull/87
  • fix mamba2 cache bug by @WorldEditors in https://github.com/fla-org/flash-linear-attention/pull/89
  • fix dh0 is None breaking backward pass by @Sxela in https://github.com/fla-org/flash-linear-attention/pull/102
  • support varlen training for conv1d by @LKJacky in https://github.com/fla-org/flash-linear-attention/pull/116
  • blood for the torch.compile gods by @harrisonvanderbyl in https://github.com/fla-org/flash-linear-attention/pull/119
  • Added forward pass for chunckwise ttt-linear, varlen is supported. by @Pan-Yuqi in https://github.com/fla-org/flash-linear-attention/pull/124
  • Add scripts for converting pretrained RWKV7 models to fla format by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/128
  • [Mamba2] Fixes for caching and multiple other small issues by @vasqu in https://github.com/fla-org/flash-linear-attention/pull/129
  • [LinAttn] Fix handling of None scale in chunklinearattn for output normalization by @HallerPatrick in https://github.com/fla-org/flash-linear-attention/pull/130
  • Fix incorrect kwarg name in fused_recurrent by @fffffgggg54 in https://github.com/fla-org/flash-linear-attention/pull/134
  • RWKV-7 conversion and evals by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/135
  • Fixed dtype mismatch of mamba & mamba2 under residualinfp32 setting by @chengshuang18 in https://github.com/fla-org/flash-linear-attention/pull/137
  • [RWKV7] Fix masking before time shifting modules by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/141
  • [RWKV7, but applicable to all models] Update modelingrwkv7.py: Fixing `basemodel_prefix` by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/143
  • fix bitattn with latest attn implementation. by @ridgerchu in https://github.com/fla-org/flash-linear-attention/pull/146
  • [RWKV7] Remove in-place operations and add gradient checkpointing for v_first by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/145
  • [BitNet] Fix bugs of model definitions by @ridgerchu in https://github.com/fla-org/flash-linear-attention/pull/147
  • Fix #157 by @jannalulu in https://github.com/fla-org/flash-linear-attention/pull/167
  • [Mamba, Samba] Add weight initializations and resetparameters() in _initweights() for compatibility in Flame by @zaydzuhri in https://github.com/fla-org/flash-linear-attention/pull/169
  • fix lint errors by @jannalulu in https://github.com/fla-org/flash-linear-attention/pull/170
  • [RWKV] Follow-up to fix cache management by @jannalulu in https://github.com/fla-org/flash-linear-attention/pull/168
  • one liner by @seanxwzhang in https://github.com/fla-org/flash-linear-attention/pull/178
  • [Attn] Fix cache update of swa by @Pan-Yuqi in https://github.com/fla-org/flash-linear-attention/pull/183
  • [RWKV7] Fix conversion precision by @Triang-jyed-driung in https://github.com/fla-org/flash-linear-attention/pull/188
  • [GRPO]: add grpo functions by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/189
  • [RWKV] fix logits handling by @jannalulu in https://github.com/fla-org/flash-linear-attention/pull/192
  • [Modules]: Enhance the precision of the fused LayerNorm OP. by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/200
  • [MISC] fix delta_net logit handling by @jannalulu in https://github.com/fla-org/flash-linear-attention/pull/205
  • [RWKV7] Keep compatibility with Torch Compiler. by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/208
  • [Misc.] Update wrapper to support contiguous and guard custom device … by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/212
  • [Models] Fix the error in the judgment of pastkeyvalues when inputs… by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/213
  • [Titans] Update Titans implementation by @rucnyz in https://github.com/fla-org/flash-linear-attention/pull/214
  • [Mamba2] Fix initialization by @HanGuo97 in https://github.com/fla-org/flash-linear-attention/pull/225
  • [TTT] Update fused chunk ops and state bias term by @Pan-Yuqi in https://github.com/fla-org/flash-linear-attention/pull/230
  • Enable utils.py to be imported on CPU-only machines (#231) by @zhuzeyuan in https://github.com/fla-org/flash-linear-attention/pull/232
  • [Utils] use fla.utils.device instead of cuda by @uniartisan in https://github.com/fla-org/flash-linear-attention/pull/163
  • fix(GatedDeltaNet): Ensure integer dimensions when using expand_v by @vladislavalerievich in https://github.com/fla-org/flash-linear-attention/pull/234

New Contributors

  • @eltociear made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/2
  • @sunyt32 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/6
  • @ridgerchu made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/8
  • @hypnopump made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/12
  • @DanFosing made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/39
  • @ChaosCodes made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/45
  • @learning-chip made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/49
  • @mirceamironenco made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/54
  • @JulienSiems made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/55
  • @zhixuan-lin made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/56
  • @vasqu made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/57
  • @DustinWang1 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/85
  • @WorldEditors made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/78
  • @Sxela made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/102
  • @LKJacky made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/116
  • @harrisonvanderbyl made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/119
  • @Triang-jyed-driung made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/128
  • @HallerPatrick made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/130
  • @fffffgggg54 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/134
  • @chengshuang18 made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/137
  • @jannalulu made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/167
  • @zaydzuhri made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/169
  • @seanxwzhang made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/178
  • @zhuzeyuan made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/232
  • @vladislavalerievich made their first contribution in https://github.com/fla-org/flash-linear-attention/pull/234

Full Changelog: https://github.com/fla-org/flash-linear-attention/commits/v0.1.0

- Python
Published by yzhangcs 11 months ago