https://github.com/cohere-labs-community/iterative-data-selection

https://github.com/cohere-labs-community/iterative-data-selection

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Cohere-Labs-Community
  • Language: Python
  • Default Branch: main
  • Size: 45.8 MB
Statistics
  • Stars: 28
  • Watchers: 7
  • Forks: 5
  • Open Issues: 3
  • Releases: 0
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

Selecting Diverse Instructions

This repository contains the official code for the paper: Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement.

KMQ Visualization

Dataset

To download the datasets used in this project, run this script. We used Alpaca, ShareGPT and WizardLM datasets for training and evaluation.

After downloading, datasets will be stored in the data/processed directory of the project.

Coreset Selection

The hyperparameters and configurations are managed by Hydra. The configurations are stored in selection/config/. You should run the code by executing main.py in the selection directory. You can also specify the hyperparameters by command line arguments. bash cd selection python main.py data=[sharegpt|wizardlm] encoder=miniLM coreset=random The selected indices are stored under selection/indices/.

Finetuning

```bash

Llama-2-7b-hf (with accelerate and deepspeed)

bash scripts/finetunellamawith_accelerate.sh [INDICES] `` Iterative selection is implemented in thescripts/iter/` directory.

Evaluation

bash bash scripts/eval/{eval}.sh

Reference

This code is based on the following repository: - open-instruct

Citation

If you find this code useful, please cite our paper: @misc{yu2024diversify, title={Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement}, author={Simon Yu and Liangyu Chen and Sara Ahmadian and Marzieh Fadaee}, year={2024}, eprint={2409.11378}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Owner

  • Name: Cohere Labs Community
  • Login: Cohere-Labs-Community
  • Kind: organization
  • Email: info@for.ai
  • Location: Toronto, Canada

Cohere Labs is Cohere's non-profit research lab that seeks to solve complex ML problems and are focused on creating more points of entry to the field.

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1