cords

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

https://github.com/decile-team/cords

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    2 of 22 committers (9.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary

Keywords

compute-efficient-ml deep-learning energy energy-requirements machine-learning speedups-training
Last synced: 6 months ago · JSON representation ·

Repository

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

Basic Info
Statistics
  • Stars: 340
  • Watchers: 12
  • Forks: 58
  • Open Issues: 30
  • Releases: 2
Topics
compute-efficient-ml deep-learning energy energy-requirements machine-learning speedups-training
Created about 5 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md


            

COResets and Data Subset selection

GitHub Decile Documentation GitHub Stars GitHub Forks

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of PyTorch. Today, deep learning systems are extremely compute-intensive, with significant turnaround times, energy inefficiencies, higher costs, and resource requirements [7, 8]. CORDS is an effort to make deep learning more energy, cost, resource, and time-efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the suitable representative data subsets from massive datasets, and it does so iteratively. CORDS uses recent advances in data subset selection, particularly ideas of coresets and submodularity select such subsets. CORDS implements several state-of-the-art data subset/coreset selection algorithms for efficient supervised learning(SL) and semi-supervised learning(SSL).

Some of the algorithms currently implemented with CORDS include:

For Efficient and Robust Supervised Learning: - GLISTER - GradMatch - CRAIG - SubmodularSelection (Facility Location, Feature Based Functions, Coverage, Diversity) - RandomSelection

For Efficient and Robust Semi-supervised Learning: - RETRIEVE - GradMatch - CRAIG - RandomSelection

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

  • Reproducibility of SOTA in Data Selection and Coresets: Enable easy reproducibility of SOTA described above. We are trying also to add more algorithms, so if you have an algorithm you would like us to include, please let us know,
  • Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets, including CIFAR-10, CIFAR-100, MNIST, SVHN, and ImageNet.
  • Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
  • Modular design: The data selection algorithms are directly incorporated into data loaders, allowing one to use their own training loop for varied utility scenarios.
  • A broad number of use cases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating several additional use cases like Auto-ML, object detection, speech recognition, semi-supervised learning, etc.

Highlights

  • 3x to 5x speedups, cost reduction, and energy reductions in the training of deep models in supervised learning
  • 3x+ speedups, cost/energy reduction for deep model training in semi-supervised learning
  • 3x to 30x speedups and cost/energy reduction for Hyper-parameter tuning using subset selection with SOTA schedulers (Hyperband and ASHA) and algorithms (TPE, Random)

Starting with CORDS

Pip Installation

To install the latest version of the CORDS package using PyPI:

python pip install cords

From Git Repository

To install using the source:

bash git clone https://github.com/decile-team/cords.git cd cords pip install -r requirements/requirements.txt

First Steps

To better understand CORDS's functionality, we have provided example Jupyter notebooks and python code in the examples folder, which can be easily executed by using Google Colab. We also provide a simple SL, SSL, and HPO training loops that runs experiments using a provided configuration file. To run this loop, you can look into following code examples:

Using subset selection based data loaders

Create a subset selection based data loader at train time and use the subset selection based data loader with your own training loop.

Essentially, with subset selection-based data loaders, it is pretty straightforward to use subset selection strategies directly because they are integrated directly into subset data loaders; this allows users to use subset selection strategies directly by using their respective subset selection data loaders.

Below is an example that shows the subset selection process is simplified by just calling a data loader in supervised learning setting,

```python from cords.utils.data.dataloader.SL.adaptive import GLISTERDataLoader

Pass on necessary arguments for GLISTERDataLoader

dssargs = dict(model=model, loss=criterionnored, eta=0.01, numclasses=10, numepochs=300, device='cuda', fraction=0.1, selectevery=20, kappa=0, linearlayer=False, selectiontype='SL', greedy='Stochastic') dssargs = DotMap(dss_args)

Create GLISTER subset selection dataloader

dataloader = GLISTERDataLoader(trainloader, valloader, dssargs, logger, batchsize=20, shuffle=True, pin_memory=False)

for epoch in range(num_epochs): for _, (inputs, targets, weights) in enumerate(dataloader): """ Standard PyTorch training loop using weighted loss

    Our training loop differs from the standard PyTorch training loop in that along with 
    data samples and their associated target labels; we also have additional sample weight
    information from the subset data loader, which can be used to calculate the weighted 
    loss for gradient descent. We can calculate the weighted loss by using default PyTorch
    loss functions with no reduction.
    """

```

In our current version, we deployed subset selection data loaders in supervised learning and semi-supervised learning settings.

Using default supervised training loop,

```python from trainsl import TrainClassifier from cords.utils.configutils import loadconfigdata

configfile = '/content/cords/configs/SL/configglistercifar10.py' cfg = loadconfigdata(configfile) clf = TrainClassifier(cfg) clf.train() ```

Using default semi-supervised training loop,

```python from trainssl import TrainClassifier from cords.utils.configutils import loadconfigdata

configfile = '/content/cords/configs/SSL/configretrieve-warmvatcifar10.py' cfg = loadconfigdata(config_file) clf = TrainClassifier(cfg) clf.train() ```

You can use the default configurations that we have provided in the configs folder, or you can make a custom configuration. For making your custom configuration file for training, please refer to CORDS Configuration File Documentation.

Applications

Efficient Hyper-parameter Optimization(HPO)

The subset selection strategies for efficient supervised learning in CORDS allow one to train models faster. We can use the faster model training using data subsets for quicker configuration evaluations in Hyper-parameter tuning. A detailed pipeline figure of efficient hyper-parameter tuning using subset based training for faster configuration evaluations can be seen below:



We can use any existing data subset selection strategy in CORDS along with existing hyperparameter search and scheduling algorithms currently. We currently use Ray-Tune library for hyper-parameter tuning and search algorithms.

Please find the tutorial notebook explaining the usage of CORDS subset selections strategies for Efficient Hyper-parameter optimization in the following notebook

Speedups achieved using CORDS

To achieve significantly faster speedups, one can use the subset selection data loaders from CORDS while keeping the training algorithm the same. Look at the speedups one can achieve using the subset selection data loaders from CORDS below:

SpeedUps in Supervised Learning



SpeedUps in Semi-supervised Learning



SpeedUps in Hyperparameter Tuning



Tutorials

We have added example python code and tutorial notebooks under the examples folder. See this link

Documentation

The documentation for the latest version of CORDS can always be found here.

Contributing to CORDS

We value and encourage contributions from the open-source community to enhance the CORDS library. Here are some guidelines for contributing:

  1. Report issues: If you come across any bugs or have suggestions for improvements, please raise an issue on our GitHub repository. Provide detailed information about the problem or feature request, including steps to reproduce the issue if applicable.

  2. Feature requests: If you have ideas for new features or enhancements, feel free to submit a feature request on GitHub. Clearly describe the proposed functionality and how it aligns with the goals of the CORDS library.

  3. Code contributions: We welcome code contributions to improve CORDS. If you plan to contribute code, please follow these steps:

    • Fork the CORDS repository on GitHub.
    • Create a new branch for your work based on the develop branch.
    • Make your changes and ensure they are well-documented and tested.
    • Submit a pull request, providing a clear explanation of the changes made and their purpose.
  4. Code style: When contributing code, please adhere to the existing code style and formatting conventions used in the CORDS library. Consistency in code style helps maintain readability and makes it easier to review and merge contributions.

  5. Testing: Ensure that your code changes pass the existing tests

Mailing List

To receive updates about CORDS and to be a part of the community, join the DecileCORDSDev group. https://groups.google.com/forum/#!forum/Decile_CORDS_Dev/join

Acknowledgment

This library takes inspiration, builds upon, and uses pieces of code from several open source codebases. These include Teppei Suzuki's consistency based SSL repository and Richard Liaw's Tune repository. Also, CORDS uses submodlib for submodular optimization.

Team

CORDS is created and maintained by Krishnateja Killamsetty, Dheeraj N Bhat, Rishabh Iyer, and Ganesh Ramakrishnan. We look forward to have CORDS more community driven. Please use it and contribute to it for your efficient learning research, and feel free to use it for your commercial projects. We will add the major contributors here.

Resources

Blog Articles

Publications

[1]: Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer, “AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning”. arXiv [cs.LG], 2022. arXiv:2203:08212.

[2]: Krishnateja Killamsetty, Xujiang Zhou, Feng Chen, and Rishabh Iyer, “RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning”. To Appear in Neural Information Processing Systems, NeurIPS 2021.

[3]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer. “GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training”. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 139:5464–5474. Proceedings of Machine Learning Research. PMLR, 2021.

[4]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer. “GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning”. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, 8110–8118. AAAI Press, 2021.

[5]: Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. “Coresets for Data-efficient Training of Machine Learning Models”. In International Conference on Machine Learning (ICML), July 2020

[6]: Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, “Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision”. 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[7]: Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019).

[8]: Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” In ACL 2019.

[9]: Kai Wei, Rishabh Iyer, Jeff Bilmes, “Submodularity in Data Subset Selection and Active Learning”. International Conference on Machine Learning (ICML) 2015

[10]: Wei, Kai, et al. Submodular subset selection for large-scale speech training data. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

Owner

  • Name: decile-team
  • Login: decile-team
  • Kind: organization
  • Email: developer@decile.org

DECILE: Data EffiCient machIne LEarning

Citation (CITATION.CFF)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Killamsetty"
    given-names: "Krishnateja"
    orcid: "https://orcid.org/0000-0001-5565-9126"
  - family-names: "Bhat"
    given-names: "Dheeraj"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Ramakrishnan"
    given-names: "Ganesh"
    orcid: "https://orcid.org/0000-0003-4533-2490" 
  - family-names: "Iyer"
    given-names: "Rishabh"
    orcid: "https://orcid.org/0000-0001-9851-463X"  
title: "CORDS: COResets and Data Subset selection for Efficient Learning"
version: v0.0.1
date-released: 2022-03-23
url: "https://github.com/decile-team/cords"

GitHub Events

Total
  • Issues event: 3
  • Watch event: 18
  • Issue comment event: 1
  • Fork event: 4
Last Year
  • Issues event: 3
  • Watch event: 18
  • Issue comment event: 1
  • Fork event: 4

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 498
  • Total Committers: 22
  • Avg Commits per committer: 22.636
  • Development Distribution Score (DDS): 0.606
Past Year
  • Commits: 37
  • Committers: 3
  • Avg Commits per committer: 12.333
  • Development Distribution Score (DDS): 0.432
Top Committers
Name Email Commits
krishnatejakk k****y@u****u 196
Krishnateja Killamsetty 6****k 118
Dheeraj Bhat d****t@g****m 48
krishnatejakk k****y@g****m 38
Rishabh Iyer r****d@g****m 19
suraksha d****2@g****m 15
gsaiabhishek g****5@g****m 11
Krishnateja Killamsetty k****k@i****m 10
Krishnateja Killamsetty k****t@g****m 8
Krishnateja-Killamsetty1 k****1@i****m 6
krishnatejakk k****1@u****u 5
dssresearch 7****h 4
Sahasra Ranjan s****n@g****m 3
noil-reed p****2@g****m 3
atul04 a****1@g****m 3
Savan Visalpara s****7 3
Krishnateja Killamsetty t****t@m****m 2
Rishabh Iyer r****r@R****t 2
Krishnateja Killamsetty k****y@u****m 1
Aakriti a****0@g****m 1
Dennis Duan d****7 1
durga y****u@e****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 50
  • Total pull requests: 19
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 14 days
  • Total issue authors: 19
  • Total pull request authors: 7
  • Average comments per issue: 0.82
  • Average comments per pull request: 0.05
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • krishnatejakk (22)
  • rishabhk108 (8)
  • meghbhalerao (2)
  • wgcban (2)
  • noilreed (2)
  • Hayoung93 (1)
  • Janghyun1230 (1)
  • chengwuxinlin (1)
  • lishaguo (1)
  • eendee (1)
  • football-prince (1)
  • HaoKang-Timmy (1)
  • youdutaidi (1)
  • INF800 (1)
  • victor-ribeiro (1)
Pull Request Authors
  • krishnatejakk (11)
  • sahasrarjn (2)
  • noilreed (2)
  • savan77 (1)
  • dheerajnbhat (1)
  • dduan97 (1)
  • ghost (1)
Top Labels
Issue Labels
enhancement (14) high priority (5) documentation (5) help wanted (3) New benchmarks (3) in progress (2) good first issue (2) normal priority (1) low priority (1) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 99 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 4
  • Total maintainers: 1
pypi.org: cords

cords is a package for data subset selection for efficient and robust machine learning.

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 99 Last month
Rankings
Stargazers count: 3.7%
Forks count: 6.0%
Dependent packages count: 7.3%
Average: 12.0%
Downloads: 21.1%
Dependent repos count: 22.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/codeql-analysis.yml actions
  • actions/checkout v2 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/run_tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
requirements/requirements.txt pypi
  • apricot-select >=0.6.0
  • datasets *
  • dotmap *
  • matplotlib *
  • numba >=0.43.0
  • numpy >=1.19.0
  • pandas >=1.1.0
  • pillow >=8.4.0
  • pyyaml *
  • ray *
  • scikit-image >=0.17.0
  • scikit-learn *
  • scipy >=1.5.0
  • setuptools *
  • sphinx-rtd-theme *
  • sphinxcontrib-bibtex *
  • sphinxcontrib-napoleon *
  • torch >=1.8.0
  • torchtext *
  • torchvision *
  • tqdm >=4.24.0
setup.py pypi
  • apricot-select >=0.6.0
  • dotmap *
  • matplotlib *
  • numba >=0.43.0
  • numpy >=1.19.0
  • pandas >=1.1.0
  • pillow >=8.4.0
  • pyyaml *
  • ray *
  • scikit-image >=0.17.0
  • scikit-learn *
  • scipy >=1.5.0
  • setuptools >=58.0.4
  • sphinx-rtd-theme *
  • sphinxcontrib-bibtex *
  • sphinxcontrib-napoleon *
  • torch >=1.8.0
  • torchtext *
  • torchvision >=0.10.1
  • torchvision *
  • tqdm >=4.24.0