accelerated-scan

Accelerated First Order Parallel Associative Scan

https://github.com/proger/accelerated-scan

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Keywords

cuda cumulative-sum recurrent-neural-networks state-space-model torch

Last synced: 6 months ago · JSON representation ·

Repository

Accelerated First Order Parallel Associative Scan

Basic Info

Host: GitHub
Owner: proger
License: mit
Language: Python
Default Branch: main
Homepage: https://twitter.com/darkproger/status/1745041586394648975
Size: 1000 KB

Statistics

Stars: 180
Watchers: 7
Forks: 8
Open Issues: 9
Releases: 4

Topics

cuda cumulative-sum recurrent-neural-networks state-space-model torch

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Accelerated Scan

This package implements the fastest first-order parallel associative scan on the GPU for forward and backward.

The scan efficiently solves first-order recurrences of the form x[t] = gate[t] * x[t-1] + token[t], common in state space models and linear RNNs.

The accelerated_scan.warp C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available on each level of hierarchy: warp shuffles within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.

The derivation of Chunked Scan has been used to extend tree-level Blelloch algorithm to block.

A similar implementation is available in accelerated_scan.triton using a Triton's tl.associative_scan primitive. It requires Triton 2.2 for its enable_fp_fusion flag.

Quick Start:

bash pip install accelerated-scan

```python import torch from accelerated_scan.warp import scan # a pure c++ kernel, faster than cub

from acceleratedscan.triton import scan # uses tl.associativescan

from accelerated_scan.ref import scan # reference torch implementation

sequence lengths must be a power of 2 of lengths between 32 and 65536

hit me up if you need different lengths!

batchsize, dim, seqlen = 3, 1536, 4096 gates = 0.999 + 0.001 * torch.rand(batchsize, dim, seqlen, device="cuda") tokens = torch.rand(batch_size, dim, seqlen, device="cuda")

out = scan(gates, tokens) ```

To ensure numerical equivalence, a reference implementation for trees is provided in Torch. It can be sped up using torch.compile.

Benchmarks:

See more benchmarks in nanokitchen: https://github.com/proger/nanokitchen

forward speed of (8,1536,seqlen), inference mode: SEQUENCE_LENGTH accelerated_scan.triton (triton 2.2.0) accelerated_scan.ref accelerated_scan.warp 0 128.0 0.027382 0.380874 0.026844 1 256.0 0.049104 0.567916 0.048593 2 512.0 0.093008 1.067906 0.092923 3 1024.0 0.181856 2.048471 0.183581 4 2048.0 0.358250 3.995369 0.355414 5 4096.0 0.713511 7.897022 0.714536 6 8192.0 1.433052 15.698944 1.411390 7 16384.0 3.260965 31.305046 2.817152 8 32768.0 31.459671 62.557182 5.645697 9 65536.0 66.787331 125.208572 11.297921

Notes on Precision

When gates and tokens are sampled uniformly from 0..1 the lack of bfloat16 precision dominates the error (compared to the reference implementation):

Owner

Name: Volodymyr Ky
Login: proger
Kind: user
Location: UTC+2

Website: https://www.linkedin.com/in/darkproger/
Twitter: darkproger
Repositories: 50
Profile: https://github.com/proger

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use Accelerated Scan, please cite it as below."
authors:
- family-names: Kyrylov
  given-names: Volodymyr
orcid: https://orcid.org/0000-0001-7462-9918
title: Accelerated Scan
doi: 10.5281/zenodo.10600962
version: 0.1.2
date-released: 2024-01-31

GitHub Events

Total

Issues event: 4
Watch event: 33
Issue comment event: 6

Last Year

Issues event: 4
Watch event: 33
Issue comment event: 6

Committers

Last synced: over 1 year ago

All Time

Total Commits: 48
Total Committers: 3
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.583

Past Year

Commits: 48
Committers: 3
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.583

Top Committers

Name	Email	Commits
Alex Nichol	u**e@g**m	20
Volodymyr Kyrylov	v**l@w**a	15
Volodymyr Kyrylov	p**r@w**a	13

Committer Domains (Top 20 + Academic)

wilab.org.ua: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 5
Total pull requests: 3
Average time to close issues: about 1 month
Average time to close pull requests: 3 days
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 3.0
Average comments per pull request: 3.67
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: about 2 months
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

proger (2)
xaviergonzalez (1)
ruoyxue (1)
jeromeku (1)
JohnAlphaIII (1)
ekellbuch (1)
WeihanLikk (1)

Pull Request Authors

unixpickle (4)
proger (3)

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

Dependencies

.github/workflows/release.yml actions

actions/checkout master composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish release/v1 composite

pyproject.toml pypi

torch >=2.1.0

accelerated-scan

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Accelerated Scan

from acceleratedscan.triton import scan # uses tl.associativescan

from accelerated_scan.ref import scan # reference torch implementation

sequence lengths must be a power of 2 of lengths between 32 and 65536

hit me up if you need different lengths!

Benchmarks:

Notes on Precision

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies