https://github.com/bytedance/flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Keywords

cuda cutlass gpu pytorch

Last synced: 10 months ago · JSON representation

Repository

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

Basic Info

Host: GitHub
Owner: bytedance
License: apache-2.0
Language: C++
Default Branch: main
Homepage:
Size: 2.81 MB

Statistics

Stars: 1,085
Watchers: 14
Forks: 79
Open Issues: 34
Releases: 6

Topics

cuda cutlass gpu pytorch

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License

README.md

👋 Hi, everyone!
We are ByteDance Seed team.

You can get to know us better through the following channels👇

seed logo

Flux: Fine-grained Computation-communication Overlapping GPU Kernel Library

Flux is a communication-overlapping library for dense/MoE models on GPUs, providing high-performance and pluggable kernels to support various parallelisms in model training/inference.

Flux's efficient kernels are compatible with Pytorch and can be integrated into existing frameworks easily, supporting various Nvidia GPU architectures and data types.

News

[2025/03/10]🔥We have released COMET: Computation-communication Overlapping for Mixture-of-Experts.

Getting started

Install Flux either from source or from PyPI.

Install from Source

```bash git clone --recursive https://github.com/bytedance/flux.git && cd flux

For Ampere(sm80) GPU

./build.sh --arch 80 --nvshmem

For Ada Lovelace(sm89) GPU

./build.sh --arch 89 --nvshmem

For Hopper(sm90) GPU

./build.sh --arch 90 --nvshmem ```

Install in a virtual environment

Here is a snippet to install Flux in a virtual environment. Let's finish the installation in an virtual environment with CUDA 12.4, torch 2.6.0 and python 3.11.

```bash conda create -n flux python=3.11 conda activate flux pip3 install packaging pip3 install ninja pip3 install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

./build.sh --clean-all ./build.sh --arch "80;89;90" --nvshmem --package ```

Then you would expect a wheel package under dist/ folder that is suitable for your virtual environment.

Install from PyPI

We also provide some pre-built wheels for Flux, and you can directly install with pip if your wanted version is available. Currently we provide wheels for the following configurations: torch(2.4.0, 2.5.0, 2.6.0), python(3.10, 3.11), cuda(12.4).

```bash

Make sure that PyTorch is installed.

pip install byte-flux ```

Customized Installation

Build options for source installation

Add --nvshmem to build Flux with NVSHMEM support. It is essential for the MoE kernels.
If you are tired of the cmake process, you can set environment variable FLUX_BUILD_SKIP_CMAKE to 1 to skip cmake if build/CMakeCache.txt already exists.
If you want to build a wheel package, add --package to the build command. find the output wheel file under dist/

Dependencies

Flux depends on NCCL and CUTLASS, which are located under 3rdparty/, and NVSHMEM, which you can install by pip. 1. NCCL: Managed by git submodule automatically. 2. NVSHMEM: It's suggested that you install nvshmem by pip install nvidia-nvshmem-cu12; If you want to build nvshmem from source, you can download it from https://developer.nvidia.com/nvshmem. Flux is tested with nvshmem 3.2.5/3.3.9 3. CUTLASS: Flux leverages CUTLASS to generate high-performance GEMM kernels. We currently use CUTLASS 4.0.0

Quick Start

Below are commands to run some basic demos once you have installed Flux successfully. ```bash

gemm only

python3 test/python/gemmonly/testgemm_only.py 4096 12288 6144 --dtype=float16

all-gather fused with gemm (dense MLP layer0)

./launch.sh test/python/aggemm/testag_kernel.py 4096 49152 12288 --dtype=float16 --iters=10

gemm fused with reduce-scatter (dense MLP layer1)

./launch.sh test/python/gemmrs/testgemm_rs.py 4096 12288 49152 --dtype=float16 --iters=10

all-gather fused with grouped gemm (MoE MLP layer0)

./launch.sh test/python/moeagscatter/testmoeag.py

grouped gemm fused with reduce-scatter (MoE MLP layer1)

./launch.sh test/python/moegatherrs/testmoegather_rs.py ```

You can check out the documentations for more details!

For a more detailed usage on MoE kernels, please refer to Flux MoE Usage. Try some examples as a quick start. A minimal MoE layer can be implemented within only a few tens of lines of code using Flux!
For some performance numbers, please refer to Performance Doc.
To learn more about the design principles of Flux, please refer to Design Doc.

License

The Flux Project is under the Apache License v2.0.

Citation

If you use Flux in a scientific publication, we encourage you to add the following reference to the related papers: ``` @misc{chang2024flux, title={FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion}, author={Li-Wen Chang and Wenlei Bao and Qi Hou and Chengquan Jiang and Ningxin Zheng and Yinmin Zhong and Xuanrun Zhang and Zuquan Song and Ziheng Jiang and Haibin Lin and Xin Jin and Xin Liu}, year={2024}, eprint={2406.06858}, archivePrefix={arXiv}, primaryClass={cs.LG} }

@misc{zhang2025comet, title={Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts}, author={Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen and Xin Liu}, year={2025}, eprint={2502.19811}, archivePrefix={arXiv}, primaryClass={cs.DC} }

```

Reference

About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

Committers

Last synced: about 1 year ago

All Time

Total Commits: 39
Total Committers: 7
Avg Commits per committer: 5.571
Development Distribution Score (DDS): 0.641

Past Year

Commits: 35
Committers: 7
Avg Commits per committer: 5.0
Development Distribution Score (DDS): 0.6

Top Committers

Name	Email	Commits
Ningxin Zheng	4****n	14
zhangshulai	z**i@b**m	12
Li-Wen Chang	1****z	5
Wenlei Bao	1****o	5
houqi	h**3@g**m	1
Tyler Michael Smith	t**r@n**m	1
Burness Duan	b**0@g**m	1

Committer Domains (Top 20 + Academic)

neuralmagic.com: 1 bytedance.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 109
Total pull requests: 58
Average time to close issues: 25 days
Average time to close pull requests: 1 day
Total issue authors: 56
Total pull request authors: 9
Average comments per issue: 2.47
Average comments per pull request: 0.22
Merged pull requests: 44
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 92
Pull requests: 39
Average time to close issues: 23 days
Average time to close pull requests: 2 days
Issue authors: 48
Pull request authors: 7
Average comments per issue: 1.84
Average comments per pull request: 0.31
Merged pull requests: 25
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jinchen89 (17)
lancelly (8)
rajagond (7)
tlrmchlsmth (5)
sushe2111 (4)
Mulbetty (3)
sallyjunjun (3)
Dmax001 (3)
victor250214384 (3)
ll000x (3)
Rainlin007 (3)
Linus-Voss (2)
chenhongyu2048 (2)
lucifer1004 (2)
wenscarl (2)

Pull Request Authors

zheng-ningxin (27)
ZSL98 (24)
wenlei-bao (10)
houqi (5)
burness (3)
tlrmchlsmth (2)
liwenchangbdbz (2)
ad8e (1)

Top Labels

Issue Labels

bug (6) question (5) enhancement (2)

Pull Request Labels

enhancement (2)

Dependencies

setup.py pypi

torch *