accelerated-scan
Accelerated First Order Parallel Associative Scan
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary
Keywords
Repository
Accelerated First Order Parallel Associative Scan
Basic Info
- Host: GitHub
- Owner: proger
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://twitter.com/darkproger/status/1745041586394648975
- Size: 1000 KB
Statistics
- Stars: 180
- Watchers: 7
- Forks: 8
- Open Issues: 9
- Releases: 4
Topics
Metadata Files
README.md
Accelerated Scan
This package implements the fastest first-order parallel associative scan on the GPU for forward and backward.
The scan efficiently solves first-order recurrences of the form x[t] = gate[t] * x[t-1] + token[t], common in state space models and linear RNNs.
The accelerated_scan.warp C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available
on each level of hierarchy: warp shuffles within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.
The derivation of Chunked Scan has been used to extend tree-level Blelloch algorithm to block.
A similar implementation is available in accelerated_scan.triton using a Triton's tl.associative_scan primitive. It requires Triton 2.2 for its enable_fp_fusion flag.
Quick Start:
bash
pip install accelerated-scan
```python import torch from accelerated_scan.warp import scan # a pure c++ kernel, faster than cub
from acceleratedscan.triton import scan # uses tl.associativescan
from accelerated_scan.ref import scan # reference torch implementation
sequence lengths must be a power of 2 of lengths between 32 and 65536
hit me up if you need different lengths!
batchsize, dim, seqlen = 3, 1536, 4096 gates = 0.999 + 0.001 * torch.rand(batchsize, dim, seqlen, device="cuda") tokens = torch.rand(batch_size, dim, seqlen, device="cuda")
out = scan(gates, tokens) ```
To ensure numerical equivalence, a reference implementation for trees is provided in Torch. It can be sped up using torch.compile.
Benchmarks:

See more benchmarks in nanokitchen: https://github.com/proger/nanokitchen
forward speed of (8,1536,seqlen), inference mode:
SEQUENCE_LENGTH accelerated_scan.triton (triton 2.2.0) accelerated_scan.ref accelerated_scan.warp
0 128.0 0.027382 0.380874 0.026844
1 256.0 0.049104 0.567916 0.048593
2 512.0 0.093008 1.067906 0.092923
3 1024.0 0.181856 2.048471 0.183581
4 2048.0 0.358250 3.995369 0.355414
5 4096.0 0.713511 7.897022 0.714536
6 8192.0 1.433052 15.698944 1.411390
7 16384.0 3.260965 31.305046 2.817152
8 32768.0 31.459671 62.557182 5.645697
9 65536.0 66.787331 125.208572 11.297921
Notes on Precision
When gates and tokens are sampled uniformly from 0..1 the lack of bfloat16 precision dominates the error (compared to the reference implementation):

Owner
- Name: Volodymyr Ky
- Login: proger
- Kind: user
- Location: UTC+2
- Website: https://www.linkedin.com/in/darkproger/
- Twitter: darkproger
- Repositories: 50
- Profile: https://github.com/proger
Citation (CITATION.cff)
cff-version: 1.1.0 message: "If you use Accelerated Scan, please cite it as below." authors: - family-names: Kyrylov given-names: Volodymyr orcid: https://orcid.org/0000-0001-7462-9918 title: Accelerated Scan doi: 10.5281/zenodo.10600962 version: 0.1.2 date-released: 2024-01-31
GitHub Events
Total
- Issues event: 4
- Watch event: 33
- Issue comment event: 6
Last Year
- Issues event: 4
- Watch event: 33
- Issue comment event: 6
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Alex Nichol | u****e@g****m | 20 |
| Volodymyr Kyrylov | v****l@w****a | 15 |
| Volodymyr Kyrylov | p****r@w****a | 13 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 5
- Total pull requests: 3
- Average time to close issues: about 1 month
- Average time to close pull requests: 3 days
- Total issue authors: 4
- Total pull request authors: 2
- Average comments per issue: 3.0
- Average comments per pull request: 3.67
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: about 2 months
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 2.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- proger (2)
- xaviergonzalez (1)
- ruoyxue (1)
- jeromeku (1)
- JohnAlphaIII (1)
- ekellbuch (1)
- WeihanLikk (1)
Pull Request Authors
- unixpickle (4)
- proger (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout master composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish release/v1 composite
- torch >=2.1.0