cords

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

https://github.com/decile-team/cords

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
2 of 22 committers (9.1%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Keywords

compute-efficient-ml deep-learning energy energy-requirements machine-learning speedups-training

Last synced: 10 months ago · JSON representation ·

Repository

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

Basic Info

Host: GitHub
Owner: decile-team
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://cords.readthedocs.io/en/latest/
Size: 57.5 MB

Statistics

Stars: 340
Watchers: 12
Forks: 58
Open Issues: 30
Releases: 2

Topics

compute-efficient-ml deep-learning energy energy-requirements machine-learning speedups-training

Created over 5 years ago · Last pushed about 3 years ago

Metadata Files

Readme License Citation

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

In this README
What is CORDS?
Highlights
Starting with CORDS
Applications
- Efficient Hyper-parameter Optimization(HPO)
Speedups achieved using CORDS
Tutorials
Documentation
Mailing List
Acknowledgment
Team
Resources
Publications

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of PyTorch. Today, deep learning systems are extremely compute-intensive, with significant turnaround times, energy inefficiencies, higher costs, and resource requirements [7, 8]. CORDS is an effort to make deep learning more energy, cost, resource, and time-efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the suitable representative data subsets from massive datasets, and it does so iteratively. CORDS uses recent advances in data subset selection, particularly ideas of coresets and submodularity select such subsets. CORDS implements several state-of-the-art data subset/coreset selection algorithms for efficient supervised learning(SL) and semi-supervised learning(SSL).

Some of the algorithms currently implemented with CORDS include:

For Efficient and Robust Supervised Learning: - GLISTER - GradMatch - CRAIG - SubmodularSelection (Facility Location, Feature Based Functions, Coverage, Diversity) - RandomSelection

For Efficient and Robust Semi-supervised Learning: - RETRIEVE - GradMatch - CRAIG - RandomSelection

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

Reproducibility of SOTA in Data Selection and Coresets: Enable easy reproducibility of SOTA described above. We are trying also to add more algorithms, so if you have an algorithm you would like us to include, please let us know,
Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets, including CIFAR-10, CIFAR-100, MNIST, SVHN, and ImageNet.
Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
Modular design: The data selection algorithms are directly incorporated into data loaders, allowing one to use their own training loop for varied utility scenarios.
A broad number of use cases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating several additional use cases like Auto-ML, object detection, speech recognition, semi-supervised learning, etc.

Highlights

3x to 5x speedups, cost reduction, and energy reductions in the training of deep models in supervised learning
3x+ speedups, cost/energy reduction for deep model training in semi-supervised learning
3x to 30x speedups and cost/energy reduction for Hyper-parameter tuning using subset selection with SOTA schedulers (Hyperband and ASHA) and algorithms (TPE, Random)

Starting with CORDS

Pip Installation

To install the latest version of the CORDS package using PyPI:

python pip install cords

From Git Repository

To install using the source:

bash git clone https://github.com/decile-team/cords.git cd cords pip install -r requirements/requirements.txt

First Steps

To better understand CORDS's functionality, we have provided example Jupyter notebooks and python code in the examples folder, which can be easily executed by using Google Colab. We also provide a simple SL, SSL, and HPO training loops that runs experiments using a provided configuration file. To run this loop, you can look into following code examples:

Using subset selection based data loaders

Create a subset selection based data loader at train time and use the subset selection based data loader with your own training loop.

Essentially, with subset selection-based data loaders, it is pretty straightforward to use subset selection strategies directly because they are integrated directly into subset data loaders; this allows users to use subset selection strategies directly by using their respective subset selection data loaders.

Below is an example that shows the subset selection process is simplified by just calling a data loader in supervised learning setting,

```python from cords.utils.data.dataloader.SL.adaptive import GLISTERDataLoader

Pass on necessary arguments for GLISTERDataLoader

dssargs = dict(model=model, loss=criterionnored, eta=0.01, numclasses=10, numepochs=300, device='cuda', fraction=0.1, selectevery=20, kappa=0, linearlayer=False, selectiontype='SL', greedy='Stochastic') dssargs = DotMap(dss_args)

Create GLISTER subset selection dataloader

dataloader = GLISTERDataLoader(trainloader, valloader, dssargs, logger, batchsize=20, shuffle=True, pin_memory=False)

for epoch in range(num_epochs): for _, (inputs, targets, weights) in enumerate(dataloader): """ Standard PyTorch training loop using weighted loss

    Our training loop differs from the standard PyTorch training loop in that along with 
    data samples and their associated target labels; we also have additional sample weight
    information from the subset data loader, which can be used to calculate the weighted 
    loss for gradient descent. We can calculate the weighted loss by using default PyTorch
    loss functions with no reduction.
    """

```

In our current version, we deployed subset selection data loaders in supervised learning and semi-supervised learning settings.

Using default supervised training loop,

```python from trainsl import TrainClassifier from cords.utils.configutils import loadconfigdata

configfile = '/content/cords/configs/SL/configglistercifar10.py' cfg = loadconfigdata(configfile) clf = TrainClassifier(cfg) clf.train() ```

Using default semi-supervised training loop,

```python from trainssl import TrainClassifier from cords.utils.configutils import loadconfigdata

configfile = '/content/cords/configs/SSL/configretrieve-warmvatcifar10.py' cfg = loadconfigdata(config_file) clf = TrainClassifier(cfg) clf.train() ```

You can use the default configurations that we have provided in the configs folder, or you can make a custom configuration. For making your custom configuration file for training, please refer to CORDS Configuration File Documentation.

Applications

Efficient Hyper-parameter Optimization(HPO)

The subset selection strategies for efficient supervised learning in CORDS allow one to train models faster. We can use the faster model training using data subsets for quicker configuration evaluations in Hyper-parameter tuning. A detailed pipeline figure of efficient hyper-parameter tuning using subset based training for faster configuration evaluations can be seen below:

We can use any existing data subset selection strategy in CORDS along with existing hyperparameter search and scheduling algorithms currently. We currently use Ray-Tune library for hyper-parameter tuning and search algorithms.

Please find the tutorial notebook explaining the usage of CORDS subset selections strategies for Efficient Hyper-parameter optimization in the following notebook

Speedups achieved using CORDS

To achieve significantly faster speedups, one can use the subset selection data loaders from CORDS while keeping the training algorithm the same. Look at the speedups one can achieve using the subset selection data loaders from CORDS below:

SpeedUps in Supervised Learning

SpeedUps in Semi-supervised Learning

SpeedUps in Hyperparameter Tuning

Tutorials

We have added example python code and tutorial notebooks under the examples folder. See this link

Documentation

The documentation for the latest version of CORDS can always be found here.

Contributing to CORDS

We value and encourage contributions from the open-source community to enhance the CORDS library. Here are some guidelines for contributing:

Report issues: If you come across any bugs or have suggestions for improvements, please raise an issue on our GitHub repository. Provide detailed information about the problem or feature request, including steps to reproduce the issue if applicable.
Feature requests: If you have ideas for new features or enhancements, feel free to submit a feature request on GitHub. Clearly describe the proposed functionality and how it aligns with the goals of the CORDS library.
Code contributions: We welcome code contributions to improve CORDS. If you plan to contribute code, please follow these steps:
- Fork the CORDS repository on GitHub.
- Create a new branch for your work based on the develop branch.
- Make your changes and ensure they are well-documented and tested.
- Submit a pull request, providing a clear explanation of the changes made and their purpose.
Code style: When contributing code, please adhere to the existing code style and formatting conventions used in the CORDS library. Consistency in code style helps maintain readability and makes it easier to review and merge contributions.
Testing: Ensure that your code changes pass the existing tests

Mailing List

To receive updates about CORDS and to be a part of the community, join the DecileCORDSDev group. https://groups.google.com/forum/#!forum/Decile_CORDS_Dev/join

Acknowledgment

This library takes inspiration, builds upon, and uses pieces of code from several open source codebases. These include Teppei Suzuki's consistency based SSL repository and Richard Liaw's Tune repository. Also, CORDS uses submodlib for submodular optimization.

Team

CORDS is created and maintained by Krishnateja Killamsetty, Dheeraj N Bhat, Rishabh Iyer, and Ganesh Ramakrishnan. We look forward to have CORDS more community driven. Please use it and contribute to it for your efficient learning research, and feel free to use it for your commercial projects. We will add the major contributors here.

Resources

Blog Articles

Publications

[1]: Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer, “AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning”. arXiv [cs.LG], 2022. arXiv:2203:08212.

[2]: Krishnateja Killamsetty, Xujiang Zhou, Feng Chen, and Rishabh Iyer, “RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning”. To Appear in Neural Information Processing Systems, NeurIPS 2021.

[3]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer. “GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training”. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 139:5464–5474. Proceedings of Machine Learning Research. PMLR, 2021.

[4]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer. “GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning”. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, 8110–8118. AAAI Press, 2021.

[5]: Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. “Coresets for Data-efficient Training of Machine Learning Models”. In International Conference on Machine Learning (ICML), July 2020

[6]: Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, “Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision”. 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[7]: Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019).

[8]: Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” In ACL 2019.

[9]: Kai Wei, Rishabh Iyer, Jeff Bilmes, “Submodularity in Data Subset Selection and Active Learning”. International Conference on Machine Learning (ICML) 2015

[10]: Wei, Kai, et al. Submodular subset selection for large-scale speech training data. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

Owner

Name: decile-team
Login: decile-team
Kind: organization
Email: developer@decile.org

Website: www.decile.org
Twitter: decile_research
Repositories: 8
Profile: https://github.com/decile-team

DECILE: Data EffiCient machIne LEarning

Citation (CITATION.CFF)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Killamsetty"
    given-names: "Krishnateja"
    orcid: "https://orcid.org/0000-0001-5565-9126"
  - family-names: "Bhat"
    given-names: "Dheeraj"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Ramakrishnan"
    given-names: "Ganesh"
    orcid: "https://orcid.org/0000-0003-4533-2490" 
  - family-names: "Iyer"
    given-names: "Rishabh"
    orcid: "https://orcid.org/0000-0001-9851-463X"  
title: "CORDS: COResets and Data Subset selection for Efficient Learning"
version: v0.0.1
date-released: 2022-03-23
url: "https://github.com/decile-team/cords"

GitHub Events

Total

Issues event: 3
Watch event: 18
Issue comment event: 1
Fork event: 4

Last Year

Issues event: 3
Watch event: 18
Issue comment event: 1
Fork event: 4

Committers

Last synced: over 2 years ago

All Time

Total Commits: 498
Total Committers: 22
Avg Commits per committer: 22.636
Development Distribution Score (DDS): 0.606

Past Year

Commits: 37
Committers: 3
Avg Commits per committer: 12.333
Development Distribution Score (DDS): 0.432

Top Committers

Name	Email	Commits
krishnatejakk	k**y@u**u	196
Krishnateja Killamsetty	6****k	118
Dheeraj Bhat	d**t@g**m	48
krishnatejakk	k**y@g**m	38
Rishabh Iyer	r**d@g**m	19
suraksha	d**2@g**m	15
gsaiabhishek	g**5@g**m	11
Krishnateja Killamsetty	k**k@i**m	10
Krishnateja Killamsetty	k**t@g**m	8
Krishnateja-Killamsetty1	k**1@i**m	6
krishnatejakk	k**1@u**u	5
dssresearch	7****h	4
Sahasra Ranjan	s**n@g**m	3
noil-reed	p**2@g**m	3
atul04	a**1@g**m	3
Savan Visalpara	s****7	3
Krishnateja Killamsetty	t**t@m**m	2
Rishabh Iyer	r**r@R**t	2
Krishnateja Killamsetty	k**y@u**m	1
Aakriti	a**0@g**m	1
Dennis Duan	d****7	1
durga	y**u@e**m	1

Committer Domains (Top 20 + Academic)

ibm.com: 2 utdallas.edu: 2 utdallas.com: 1 rishabhs-mbp.attlocal.net: 1 microsoft.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 50
Total pull requests: 19
Average time to close issues: about 1 month
Average time to close pull requests: 14 days
Total issue authors: 19
Total pull request authors: 7
Average comments per issue: 0.82
Average comments per pull request: 0.05
Merged pull requests: 18
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

krishnatejakk (22)
rishabhk108 (8)
meghbhalerao (2)
wgcban (2)
noilreed (2)
Hayoung93 (1)
Janghyun1230 (1)
chengwuxinlin (1)
lishaguo (1)
eendee (1)
football-prince (1)
HaoKang-Timmy (1)
youdutaidi (1)
INF800 (1)
victor-ribeiro (1)

Pull Request Authors

krishnatejakk (11)
sahasrarjn (2)
noilreed (2)
savan77 (1)
dheerajnbhat (1)
dduan97 (1)
ghost (1)

Top Labels

Issue Labels

enhancement (14) high priority (5) documentation (5) help wanted (3) New benchmarks (3) in progress (2) good first issue (2) normal priority (1) low priority (1) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 99 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 4
Total maintainers: 1

pypi.org: cords

cords is a package for data subset selection for efficient and robust machine learning.

Homepage: https://github.com/decile-team/cords
Documentation: https://cords.readthedocs.io/
License: LICENSE.txt
Latest release: 0.0.4
published about 4 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 99 Last month

Rankings

Stargazers count: 3.7%

Forks count: 6.0%

Dependent packages count: 7.3%

Average: 12.0%

Downloads: 21.1%

Dependent repos count: 22.1%

Maintainers (1)

krishnateja_kk

Last synced: 10 months ago

Dependencies

.github/workflows/codeql-analysis.yml actions

actions/checkout v2 composite
github/codeql-action/analyze v2 composite
github/codeql-action/autobuild v2 composite
github/codeql-action/init v2 composite

.github/workflows/run_tests.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

requirements/requirements.txt pypi

apricot-select >=0.6.0
datasets *
dotmap *
matplotlib *
numba >=0.43.0
numpy >=1.19.0
pandas >=1.1.0
pillow >=8.4.0
pyyaml *
ray *
scikit-image >=0.17.0
scikit-learn *
scipy >=1.5.0
setuptools *
sphinx-rtd-theme *
sphinxcontrib-bibtex *
sphinxcontrib-napoleon *
torch >=1.8.0
torchtext *
torchvision *
tqdm >=4.24.0

setup.py pypi

apricot-select >=0.6.0
dotmap *
matplotlib *
numba >=0.43.0
numpy >=1.19.0
pandas >=1.1.0
pillow >=8.4.0
pyyaml *
ray *
scikit-image >=0.17.0
scikit-learn *
scipy >=1.5.0
setuptools >=58.0.4
sphinx-rtd-theme *
sphinxcontrib-bibtex *
sphinxcontrib-napoleon *
torch >=1.8.0
torchtext *
torchvision >=0.10.1
torchvision *
tqdm >=4.24.0

cords

Science Score: 64.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

Highlights

Starting with CORDS

Pip Installation

From Git Repository

First Steps

Using subset selection based data loaders

Pass on necessary arguments for GLISTERDataLoader

Create GLISTER subset selection dataloader

Using default supervised training loop,

Using default semi-supervised training loop,

Applications

Efficient Hyper-parameter Optimization(HPO)

Speedups achieved using CORDS

SpeedUps in Supervised Learning

SpeedUps in Semi-supervised Learning

SpeedUps in Hyperparameter Tuning

Tutorials

Documentation

Contributing to CORDS

Mailing List

Acknowledgment

Team

Resources

Publications

Owner

Citation (CITATION.CFF)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: cords

Rankings

Maintainers (1)

Dependencies