Releases | Open Source Science

accelerated-scan - 0.2.0 — faster training!

@unixpickle has fused the sequence reversal required by backward into the kernel and vectorized loads and stores to load entries, training is 30-40 percent faster on 3090.

- Python
Published by proger almost 2 years ago

accelerated-scan - 0.1.2 — reverse reference scan

This release includes reverse=True flag on accelerated_scan.ref.scan.

Full Changelog: https://github.com/proger/accelerated-scan/compare/0.1.1...0.1.2

- Python
Published by proger about 2 years ago

accelerated-scan - 0.1.1 — 16 bit support

This package adds support for float16 and bfloat16 through templating the warp kernel. Below is the plot for max abs errors comparing the reference implementation and the kernel:

- Python
Published by proger about 2 years ago

accelerated-scan - 0.1

This package implements the fastest first-order parallel associative scan on the GPU for forward and backward.

The scan efficiently solves first-order recurrences of the form x[t] = gate[t] * x[t-1] + token[t], common in state space models and linear RNNs.

The accelerated_scan.warp C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available on each level of hierarchy: warp shuffles within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.

The derivation of Chunked Scan has been used to extend tree-level Blelloch algorithm to block

A similar implementation is available in acceleratedscan.triton using a Triton's tl.associativescan primitive. It requires Triton 2.2 for its enablefpfusion flag.

bench

- Python
Published by proger about 2 years ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

Recent Releases of accelerated-scan

accelerated-scan - 0.2.0 — faster training!

accelerated-scan - 0.1.2 — reverse reference scan

accelerated-scan - 0.1.1 — 16 bit support

accelerated-scan - 0.1