https://github.com/bytedance/byteps

A high performance and generic framework for distributed DNN training

Keywords

deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow

Last synced: 6 months ago · JSON representation

Repository

A high performance and generic framework for distributed DNN training

Basic Info

Host: GitHub
Owner: bytedance
License: other
Language: Python
Default Branch: master
Homepage:
Size: 20.2 MB

Statistics

Stars: 3,697
Watchers: 82
Forks: 494
Open Issues: 113
Releases: 1

Topics

deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow

Created over 6 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog Contributing License

BytePS

Pypi

BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.

BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than Horovod+NCCL. In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.

News

BytePS paper has been accepted to OSDI'20. The code to reproduce the end-to-end evaluation is available here.
Support gradient compression.
v0.2.4
- Fix compatibility issue with tf2 + standalone keras
- Add support for tensorflow.keras
- Improve robustness of broadcast
v0.2.3
- Add DistributedDataParallel module for PyTorch
- Fix the problem of different CPU tensor using the same name
- Add skip_synchronize api for PyTorch
- Add the option for lazy/non-lazy init
v0.2.0
- Largely improve RDMA performance by enforcing page aligned memory.
- Add IPC support for RDMA. Now support colocating servers and workers without sacrificing much performance.
- Fix a hanging bug in BytePS server.
- Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
- New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
- New feature: Add bpslaunch as the command to launch tasks.
- Add support for pip install: pip3 install byteps

Performance

We show our experiment on BERT-large training, which is based on GluonNLP toolkit. The model uses mixed precision.

We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs (32GB memory) with NVLink-enabled. Machines are inter-connected with 100 Gbps RDMA network. This is the same hardware setup you can get on AWS.

BytePS achieves ~90% scaling efficiency for BERT-large with 256 GPUs. The code is available here. As a comparison, Horovod+NCCL has only ~70% scaling efficiency even after expert parameter tunning.

BERT-Large

With slower network, BytePS offers even more performance advantages -- up to 2x of Horovod+NCCL. You can find more evaluation results at performance.md.

Goodbye MPI, Hello Cloud

How can BytePS outperform Horovod by so much? One of the main reasons is that BytePS is designed for cloud and shared clusters, and throws away MPI.

MPI was born in the HPC world and is good for a cluster built with homogeneous hardware and for running a single job. However, cloud (or in-house shared clusters) is different.

This leads us to rethink the best communication strategy, as explained in here. In short, BytePS only uses NCCL inside a machine, while re-implements the inter-machine communication.

BytePS also incorporates many acceleration techniques such as hierarchical strategy, pipelining, tensor partitioning, NUMA-aware local communication, priority-based scheduling, etc.

Quick Start

We provide a step-by-step tutorial for you to run benchmark training tasks. The simplest way to start is to use our docker images. Refer to Documentations for how to launch distributed jobs and more detailed configurations. After you can start BytePS, read best practice to get the best performance.

Below, we explain how to install BytePS by yourself. There are two options.

Install by pip

pip3 install byteps

Build from source code

You can try out the latest features by directly installing from master branch:

git clone --recursive https://github.com/bytedance/byteps cd byteps python3 setup.py install

Notes for above two options: - BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet. - BytePS depends on CUDA and NCCL. You should specify the NCCL path with export BYTEPS_NCCL_HOME=/path/to/nccl. By default it points to /usr/local/nccl. - The installation requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try yum install devtoolset-7 before everything else. In general, we recommend using gcc 4.9 for best compatibility (how to pin gcc). - RDMA support: During setup, the script will automatically detect the RDMA header file. If you want to use RDMA, make sure your RDMA environment has been properly installed and tested before install (install on Ubuntu-18.04).

Examples

Basic examples are provided under the example folder.

To reproduce the end-to-end evaluation in our OSDI'20 paper, find the code at this repo.

Use BytePS in Your Code

Though being totally different at its core, BytePS is highly compatible with Horovod interfaces (Thank you, Horovod community!). We chose Horovod interfaces in order to minimize your efforts for testing BytePS.

If your tasks only rely on Horovod's allreduce and broadcast, you should be able to switch to BytePS in 1 minute. Simply replace import horovod.tensorflow as hvd by import byteps.tensorflow as bps, and then replace all hvd in your code by bps. If your code invokes hvd.allreduce directly, you should also replace it by bps.push_pull.

Many of our examples were copied from Horovod and modified in this way. For instance, compare the MNIST example for BytePS and Horovod.

BytePS also supports other native APIs, e.g., PyTorch Distributed Data Parallel and TensorFlow Mirrored Strategy. See DistributedDataParallel.md and MirroredStrategy.md for usage.

Limitations and Future Plans

BytePS does not support pure CPU training for now. One reason is that the cheap PS assumption of BytePS do not hold for CPU training. Consequently, you need CUDA and NCCL to build and run BytePS.

We would like to have below features, and there is no fundamental difficulty to implement them in BytePS architecture. However, they are not implemented yet: * Sparse model training * Fault-tolerance * Straggler-mitigation

Publications

[OSDI'20] "A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters". Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo.
[SOSP'19] "A Generic Communication Scheduler for Distributed DNN Training Acceleration". Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, Chuanxiong Guo. (Code is at bytescheduler branch)

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

GitHub Events

Total

Watch event: 91
Issue comment event: 1
Member event: 2
Fork event: 12

Last Year

Watch event: 91
Issue comment event: 1
Member event: 2
Fork event: 12

Committers

Last synced: 12 months ago

All Time

Total Commits: 426
Total Committers: 20
Avg Commits per committer: 21.3
Development Distribution Score (DDS): 0.575

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
jiangyimin	j**n@b**m	181
Yibo Zhu	z**o@b**m	140
Yulu Jia	6****t	41
Chang Lan	l**g@b**m	27
Haibin Lin	l**c@g**m	9
Yuchen Zhong	i**n@g**m	7
yhpeng-git	w**h@g**m	4
yhpeng	y**g@c**k	3
Maybe	h**u@g**m	2
Chang Lan	c**n@e**u	2
Juncheng Gu	6****u	1
Hugo Zhang	6**9@q**m	1
Heyan Liu	j**7@g**m	1
Hanpeng	j**r@g**m	1
Junxian Ye	y**d@g**m	1
VincentLeeMax	l**e@y**t	1
Yifan Wu	y**u@p**n	1
dbonner	2****r	1
gongwei-130	5****0	1
laochanlam	l**m@g**m	1

Committer Domains (Top 20 + Academic)

bytedance.com: 3 pku.edu.cn: 1 yeah.net: 1 qq.com: 1 eecs.berkeley.edu: 1 cs.hku.hk: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 69
Total pull requests: 35
Average time to close issues: about 2 months
Average time to close pull requests: about 2 months
Total issue authors: 48
Total pull request authors: 9
Average comments per issue: 4.58
Average comments per pull request: 1.49
Merged pull requests: 24
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

wuyujiji (6)
Ruinhuang (6)
anj-s (3)
showerage (3)
hamidralmasi (2)
QingQingR (2)
starkeisntein (2)
qingfengmingyue (2)
ruipeterpan (2)
themoonstone (2)
powermano (2)
hengruo (1)
jazka (1)
YouhuiBai (1)
liaopeiyuan (1)

Pull Request Authors

pleasantrabbit (20)
ymjiang (4)
jasperzhong (3)
dbonner (2)
rainj-me (2)
VincentLeeMax (2)
anj-s (1)
oliverhu (1)
gongwei-130 (1)

Top Labels

Issue Labels

distributed (1) bug (1) bytescheduler (1) enhancement (1) good first issue (1) bps.torch.ddp (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 63 last-month
Total docker downloads: 54

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 10
Total maintainers: 1

proxy.golang.org: github.com/bytedance/byteps

Documentation: https://pkg.go.dev/github.com/bytedance/byteps#section-documentation
License: other
Latest release: v0.2.5
published over 5 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

pypi.org: byteps

A high-performance cross-framework Parameter Server for Deep Learning

Homepage: https://github.com/bytedance/byteps
Documentation: https://byteps.readthedocs.io/
License: Apache
Latest release: 0.2.4
published over 5 years ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 63 Last month
Docker Downloads: 54

Rankings

Stargazers count: 1.3%

Forks count: 2.5%

Docker downloads count: 3.0%

Dependent packages count: 10.0%

Average: 10.1%

Dependent repos count: 21.8%

Downloads: 22.3%

Maintainers (1)

byteps

Last synced: 6 months ago

https://github.com/bytedance/byteps

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

BytePS

News

Performance

Goodbye MPI, Hello Cloud

Quick Start

Install by pip

Build from source code

Examples

Use BytePS in Your Code

Limitations and Future Plans

Publications

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/bytedance/byteps

Rankings

pypi.org: byteps

Rankings

Maintainers (1)