https://github.com/bytedance/bytetransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary

Keywords

bert gpu inference research transformer

Last synced: 10 months ago · JSON representation

Repository

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

Basic Info

Host: GitHub
Owner: bytedance
License: apache-2.0
Language: C++
Default Branch: main
Homepage:
Size: 134 KB

Statistics

Stars: 473
Watchers: 9
Forks: 37
Open Issues: 4
Releases: 0

Topics

bert gpu inference research transformer

Created over 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

ByteTransformer: Optimized BERT Transformer Inference on NVIDIA GPUs

Introduction

ByteTransformer is a high-performance inference library for BERT-like transformers that offers the following features:

Provides Python and C++ APIs, with the PyTorch plugin allowing users to enhance transformer inference with just a few lines of Python code.
Supports both fixed-length and variable-length transformers.
Includes end-to-end architectural-aware optimizations for the padding-free algorithm on BERT routines, including QKV encoding, softmax, feed forward network, activation, layernorm, and multi-head attention.

ByteTransformer has been widely deployed to improve in-house transformer inference serving systems at ByteDance, delivering superior performance over other transformer implementations for both fixed-length and variable-length inputs. The technical details have been published at IEEE IPDPS 2023.

Cite Us

If you use our library, please cite our research paper.

@article{zhai2022bytetransformer, title={ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs}, author={Zhai, Yujia and Jiang, Chengquan and Wang, Leyuan and Jia, Xiaoying and Zhang, Shang and Chen, Zizhong and Liu, Xin and Zhu, Yibo}, journal={arXiv preprint arXiv:2210.03052}, year={2022} }

Performance and Speedup

We compared ByteTransformer with PyTorch, TensorFlow, FasterTransformer, and DeepSpeed on an A100 GPU. The benchmark script is available in benchmark/bert_bench.sh.

1. Standard BERT batch size = 1, average sequence length = 0.6 * maximal, execution time in millisecond:

| | PyTorch | Tensorflow | FasterTransformer | FasterTransformer with remove padding | DeepSpeed | ByteTransformer | |------|-------------|----------------|-------------------|---------------------------------------|---------------------|-----------------| | 64 | 2.93 | 2.46 | 1.05 | 1.23 | 1.17 | 0.90 | | 128 | 3.18 | 2.6 | 1.10 | 1.43 | 1.28 | 0.97 | | 192 | 3.18 | 2.81 | 1.26 | 1.43 | 1.40 | 1.36 | | 256 | 2.81 | 2.9 | 1.35 | 1.55 | 1.51 | 1.43 | | 320 | 3.11 | 3.24 | 1.63 | 1.66 | 1.84 | 1.69 | | 384 | 2.87 | 3.43 | 1.64 | 1.64 | 1.95 | 1.72 | | 448 | 2.99 | 3.61 | 2.26 | 2.35 | 2.23 | 1.86 | | 512 | 2.89 | 3.74 | 2.28 | 2.43 | 2.37 | 2.00 | | 576 | 2.99 | 4.03 | 2.51 | 2.59 | 2.70 | 2.19 | | 640 | 2.99 | 4.54 | 2.85 | 2.83 | 3.17 | 2.23 | | 704 | 3.21 | 4.67 | 3.16 | 3.44 | 3.32 | 2.47 | | 768 | 3.33 | 4.88 | 3.26 | 3.63 | 3.46 | 2.51 | | 832 | 3.78 | 5.39 | 3.75 | 3.87 | 3.97 | 2.80 | | 896 | 3.86 | 5.81 | 4.08 | 4.95 | 4.37 | 2.86 | | 960 | 4.02 | 6.27 | 4.30 | 5.23 | 4.66 | 3.12 | | 1024 | 4.2 | 6.37 | 4.51 | 4.96 | 4.86 | 3.16 |

2. Standard BERT batch size = 16, average sequence length = 0.6 * maximal, execution time in millisecond:

| | PyTorch | Tensorflow | FasterTransformer | FasterTransformer with remove padding | DeepSpeed | ByteTransformer | |------|-------------|----------------|-------------------|---------------------------------------|---------------------|-----------------| | 64 | 3.2 | 4.57 | 2.24 | 1.93 | 2.81 | 2.09 | | 128 | 4.97 | 6.97 | 3.62 | 3.33 | 4.54 | 3.18 | | 192 | 7.65 | 9.37 | 5.26 | 5.29 | 6.68 | 5.08 | | 256 | 9.56 | 12.17 | 6.77 | 5.49 | 9.03 | 6.85 | | 320 | 13.21 | 15.87 | 8.85 | 6.47 | 12.81 | 7.49 | | 384 | 15.01 | 18.56 | 10.37 | 7.05 | 15.19 | 8.44 | | 448 | 19.06 | 23.01 | 15.97 | 12.54 | 18.83 | 8.89 | | 512 | 21 | 26.03 | 18.03 | 13.79 | 21.55 | 9.22 | | 576 | 24.33 | 31.24 | 21.11 | 17.65 | 26.2 | 10.15 | | 640 | 28.03 | 35.07 | 24.52 | 20.34 | 30.24 | 12.04 | | 704 | 32.33 | 41.43 | 28.94 | 24.52 | 34.65 | 13.55 | | 768 | 35.31 | 44.62 | 32.09 | 28.21 | 37.95 | 16.3 | | 832 | 40.75 | 51.87 | 36.33 | 31.69 | 45.32 | 16.92 | | 896 | 44.47 | 55.65 | 42.17 | 38.05 | 49.48 | 20.67 | | 960 | 49.72 | 63.59 | 47.01 | 42.98 | 55.72 | 23.27 | | 1024 | 53.21 | 65.94 | 50.28 | 45.22 | 59.96 | 24.70 |

Supported Models

Currently, only the standard BERT transformer encoder is available under this repository.

Environment requirements

CUDA: 11.6
CMake: >= 3.13
PyTorch: >= 1.8
GPU compute capability: 7.0(V100) / 7.5(T4) or 8.0(A100)
Python: >= 3.7

Tested on: A100 + CUDA 11.6 + PyTorch 1.13.0+cu116 + Python 3.9.16

Building from Source

To build from source, run the following commands: bash git submodule update --init mkdir build && cd build cmake -DTORCH_CUDA_ARCH_LIST="8.0" -DDataType=FP16 -DBUILD_THS=ON -DCUDAARCHS="80" .. make

Getting Started with Unit Tests

Unit Tests in C++

To generate test data, run the following code: ```bash cd build

batch sz = 16, seqlen = 64, head num = 12, head sz = 64, avg seqlen = 32

python3 berttransformertest.py 16 64 12 64 --avgseqlen 32 --dtype fp16 --exportdata ```

Here, 16, 64, 12, and 64 represent batch size, sequence length, number of heads, and head size, respectively. The --avg_seqlen 32 flag is used to set the average sequence length, --dtype fp16 sets the data type, and --export_data exports the test data.

After test data is generated (*.in and *.out files are saved under the current directory), run the following command:

./bin/bert_transformer_test 16 64 12 64

Here, the arguments represent the same parameters as used in generating the test data.

Unit Tests in a PyTorch Plugin in Python

To perform the unit tests in a PyTorch plugin in Python, use the same script as for C++, but without the --export_data flag. Run the following command in the terminal:

```bash

batch sz = 16, seqlen = 64, head num = 12, head sz = 64, avg seqlen = 32

python3 berttransformertest.py 16 64 12 64 --avg_seqlen 32 --dtype fp16 ```

Again, the arguments represent the same parameters as used in generating the test data.

Benchmark

bash cd build ../benchmark/bert_bench.sh

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

GitHub Events

Total

Watch event: 21
Fork event: 4

Last Year

Watch event: 21
Fork event: 4

Committers

Last synced: about 1 year ago

All Time

Total Commits: 10
Total Committers: 2
Avg Commits per committer: 5.0
Development Distribution Score (DDS): 0.1

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
jiangchengquan	j**n@b**m	9
liuxin.ai	l**i@b**m	1

Committer Domains (Top 20 + Academic)

bytedance.com: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 10
Total pull requests: 2
Average time to close issues: 7 days
Average time to close pull requests: about 19 hours
Total issue authors: 10
Total pull request authors: 2
Average comments per issue: 1.9
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

niezhiyang (1)
woskii (1)
chenhongyu2048 (1)
qingchanghan (1)
zhanghaoie (1)
xiyuecangxin (1)
LHengyi (1)
aska-0096 (1)
GuWei007 (1)

https://github.com/bytedance/bytetransformer

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ByteTransformer: Optimized BERT Transformer Inference on NVIDIA GPUs

Introduction

Cite Us

Performance and Speedup

Supported Models

Environment requirements

Building from Source

Getting Started with Unit Tests

Unit Tests in C++

batch sz = 16, seqlen = 64, head num = 12, head sz = 64, avg seqlen = 32

Unit Tests in a PyTorch Plugin in Python

batch sz = 16, seqlen = 64, head num = 12, head sz = 64, avg seqlen = 32

Benchmark

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels