accelerated-scan

Accelerated First Order Parallel Associative Scan

https://github.com/proger/accelerated-scan

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary

Keywords

cuda cumulative-sum recurrent-neural-networks state-space-model torch
Last synced: 6 months ago · JSON representation ·

Repository

Accelerated First Order Parallel Associative Scan

Basic Info
Statistics
  • Stars: 180
  • Watchers: 7
  • Forks: 8
  • Open Issues: 9
  • Releases: 4
Topics
cuda cumulative-sum recurrent-neural-networks state-space-model torch
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Accelerated Scan

PyPI Version DOI

This package implements the fastest first-order parallel associative scan on the GPU for forward and backward.

The scan efficiently solves first-order recurrences of the form x[t] = gate[t] * x[t-1] + token[t], common in state space models and linear RNNs.

The accelerated_scan.warp C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available on each level of hierarchy: warp shuffles within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.

The derivation of Chunked Scan has been used to extend tree-level Blelloch algorithm to block.

A similar implementation is available in accelerated_scan.triton using a Triton's tl.associative_scan primitive. It requires Triton 2.2 for its enable_fp_fusion flag.

Quick Start:

bash pip install accelerated-scan

```python import torch from accelerated_scan.warp import scan # a pure c++ kernel, faster than cub

from acceleratedscan.triton import scan # uses tl.associativescan

from accelerated_scan.ref import scan # reference torch implementation

sequence lengths must be a power of 2 of lengths between 32 and 65536

hit me up if you need different lengths!

batchsize, dim, seqlen = 3, 1536, 4096 gates = 0.999 + 0.001 * torch.rand(batchsize, dim, seqlen, device="cuda") tokens = torch.rand(batch_size, dim, seqlen, device="cuda")

out = scan(gates, tokens) ```

To ensure numerical equivalence, a reference implementation for trees is provided in Torch. It can be sped up using torch.compile.

Benchmarks:

bench.png

See more benchmarks in nanokitchen: https://github.com/proger/nanokitchen

forward speed of (8,1536,seqlen), inference mode: SEQUENCE_LENGTH accelerated_scan.triton (triton 2.2.0) accelerated_scan.ref accelerated_scan.warp 0 128.0 0.027382 0.380874 0.026844 1 256.0 0.049104 0.567916 0.048593 2 512.0 0.093008 1.067906 0.092923 3 1024.0 0.181856 2.048471 0.183581 4 2048.0 0.358250 3.995369 0.355414 5 4096.0 0.713511 7.897022 0.714536 6 8192.0 1.433052 15.698944 1.411390 7 16384.0 3.260965 31.305046 2.817152 8 32768.0 31.459671 62.557182 5.645697 9 65536.0 66.787331 125.208572 11.297921

Notes on Precision

When gates and tokens are sampled uniformly from 0..1 the lack of bfloat16 precision dominates the error (compared to the reference implementation):

max-abs-error.png

Owner

  • Name: Volodymyr Ky
  • Login: proger
  • Kind: user
  • Location: UTC+2

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use Accelerated Scan, please cite it as below."
authors:
- family-names: Kyrylov
  given-names: Volodymyr
orcid: https://orcid.org/0000-0001-7462-9918
title: Accelerated Scan
doi: 10.5281/zenodo.10600962
version: 0.1.2
date-released: 2024-01-31

GitHub Events

Total
  • Issues event: 4
  • Watch event: 33
  • Issue comment event: 6
Last Year
  • Issues event: 4
  • Watch event: 33
  • Issue comment event: 6

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 48
  • Total Committers: 3
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.583
Past Year
  • Commits: 48
  • Committers: 3
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.583
Top Committers
Name Email Commits
Alex Nichol u****e@g****m 20
Volodymyr Kyrylov v****l@w****a 15
Volodymyr Kyrylov p****r@w****a 13
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 3
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 3 days
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 3.0
  • Average comments per pull request: 3.67
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: about 2 months
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • proger (2)
  • xaviergonzalez (1)
  • ruoyxue (1)
  • jeromeku (1)
  • JohnAlphaIII (1)
  • ekellbuch (1)
  • WeihanLikk (1)
Pull Request Authors
  • unixpickle (4)
  • proger (3)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Dependencies

.github/workflows/release.yml actions
  • actions/checkout master composite
  • actions/setup-python v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi
  • torch >=2.1.0