https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines

Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.6%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 542 KB
Statistics
  • Stars: 18
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Optimizing Multi-task Training through Dynamic Pipelines

Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines (Paper).

During multi-task training, the model commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This project adopts a dynamic micro-batching approach to tackle sequence length variation. Each input global batch is split into multiple variable-length micro-batches, each of which comprises a (potentially different) number of samples of similar sequence lengths. These micro-batches are efficiently organized into pipelines, facilitating efficient 3D-parallel (data, tensor and pipeline) multi-task model training.

Main features of this project include:

  • An efficient dynamic programming algorithm to compute the optimal micro-batching plan for each input global batch.
  • A pipeline schedule robust to variable-sized micro-batches, minimizing pipeline bubbles.
  • A pipeline executor supporting highly dynamic pipelines (the pipeline schedule, the size and number of micro-batches can vary each iteration), based on an instruction-based abstraction of pipeline operations.
  • Overlapped execution plan generation with model training.

System Diagram

System Diagram

Getting Started

Dependencies

Redis

The distributed instruction store uses Redis as the underlying key-value store. Redis server needs to be installed on machines participating in training. Our code will setup and initialize a Redis server automatically.

Note: The Redis server is not protected by authentication and may pose security risks. Please make sure that the code is only run in a secure environment.

Python Dependencies

Please see requirements.txt for the required Python packages. Install them by running

pip3 install -r requirements.txt

Installation

Clone this repository and run

pip3 install -e .

Then, build the C++ extensions by running

cd dynapipe/data_opt make cd ../memory_opt python3 setup.py build

Pipeline Instructions

To use this project, the Pipeline Instructions (defined here) needs to be implemented using the intented training framework (e.g., Megatron-LM). A reference implementation of the instructions in Megatron-LM can be found here.

Using this project

Please note that this project is experimental and only tested on integrating with Megatron-LM (please refer to the linked repository for detailed usage).

This project interacts with the training framework mainly through the following two interfaces:

Data Loader

We wrap the micro-batch splitting and execution plan generation process into a DynaPipeDataLoader. It takes the normal PyTorch data loader arguments with a few additional ones. Please see here for the full list of arguments. The returning iterator will generate tuples of micro-batched data and the corresponding execution plan for each iteraton. This iterator is to be used by the pipeline executor. See here for an example of using the DynaPipeDataLoader in Megatron-LM.

Pipeline Executor

The pipeline executor simply reads in execution plans and calls the Pipeline Instruction Implementations. These implementations are registered to the executor through the register_handler function. To run the pipeline executor, simply call the execute function with the corresponding execution plan in each iteration. See here for an example of using the pipeline executor in Megatron-LM.

Environment Variables

Except for the above two interfaces, this project can also be configured through the following environment variables:

  • DYNAPIPE_KV_HOST: The host IP of the Redis kv store server. Default to 'localhost' (requried for multi-node training).
  • DYNAPIPE_KV_PORT: The port for the Redis kv store server. Default to 29500.
  • DYNAPIPE_DEBUG: Logging level. Default to 'INFO'. Set to 'DEBUG' for more detailed logging.
  • DYNAPIPE_LOGGING_DEBUG_DIR: The directory to store all generated logs.
  • DYNAPIPE_DEBUG_DUMP_EP_STATS: if set true, dump the generated execution plans, seen sequence lengths, shapes of the generated micro-batches, estimated memory and simulated traces for each iteration during training. Used for debugging and for collecting statistics during our experiments.
  • DYNAPIPE_DEBUG_DUMP_EP_PREFIX: the directory for dumping the above artifacts.

Code Structure

├── dynapipe │ : main source folder │ ├── data_opt │ │ : code for micro-batch splitting and cost models │ ├── memory_opt │ │ : contains the modified cuda caching memory allocator │ │ from PyTorch │ ├── pipe │ │ : contains implementation of pipeline instructions, │ │ executor, and the distributed instruction store │ ├── schedule_opt │ │ : code for computing pipeline schedule │ └── utils │ : other util codes like logger ├── scripts │ : utility scripts for various purposes ├── tests │ : unit tests of different modules

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Watch event: 5
  • Fork event: 1
Last Year
  • Watch event: 5
  • Fork event: 1

Issues and Pull Requests

Last synced: about 2 years ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: 9 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 9 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • robotsp (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

dynapipe/memory_opt/setup.py pypi
requirements.txt pypi
  • numpy *
  • prtpy *
  • pybind11 *
  • pytest *
  • redis *
  • scikit-learn >=1.2.0
  • scipy *
  • setuptools *
  • sortedcontainers *
  • tqdm *
setup.py pypi