https://github.com/cohere-labs-community/iterative-data-selection
https://github.com/cohere-labs-community/iterative-data-selection
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Cohere-Labs-Community
- Language: Python
- Default Branch: main
- Size: 45.8 MB
Statistics
- Stars: 28
- Watchers: 7
- Forks: 5
- Open Issues: 3
- Releases: 0
Metadata Files
README.md
Selecting Diverse Instructions
This repository contains the official code for the paper: Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement.

Dataset
To download the datasets used in this project, run this script. We used Alpaca, ShareGPT and WizardLM datasets for training and evaluation.
After downloading, datasets will be stored in the data/processed directory of the project.
Coreset Selection
The hyperparameters and configurations are managed by Hydra. The configurations are stored in selection/config/.
You should run the code by executing main.py in the selection directory. You can also specify the hyperparameters by command line arguments.
bash
cd selection
python main.py data=[sharegpt|wizardlm] encoder=miniLM coreset=random
The selected indices are stored under selection/indices/.
Finetuning
```bash
Llama-2-7b-hf (with accelerate and deepspeed)
bash scripts/finetunellamawith_accelerate.sh [INDICES]
``
Iterative selection is implemented in thescripts/iter/` directory.
Evaluation
bash
bash scripts/eval/{eval}.sh
Reference
This code is based on the following repository: - open-instruct
Citation
If you find this code useful, please cite our paper:
@misc{yu2024diversify,
title={Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement},
author={Simon Yu and Liangyu Chen and Sara Ahmadian and Marzieh Fadaee},
year={2024},
eprint={2409.11378},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Owner
- Name: Cohere Labs Community
- Login: Cohere-Labs-Community
- Kind: organization
- Email: info@for.ai
- Location: Toronto, Canada
- Website: https://cohere.com/research
- Twitter: Cohere_Labs
- Repositories: 3
- Profile: https://github.com/Cohere-Labs-Community
Cohere Labs is Cohere's non-profit research lab that seeks to solve complex ML problems and are focused on creating more points of entry to the field.
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1