https://github.com/alon-albalak/data-selection-survey

A Survey on Data Selection for Language Models

https://github.com/alon-albalak/data-selection-survey

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 14 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.1%) to scientific vocabulary

Keywords

data-selection language-model llm survey
Last synced: 5 months ago · JSON representation

Repository

A Survey on Data Selection for Language Models

Basic Info
  • Host: GitHub
  • Owner: alon-albalak
  • License: cc0-1.0
  • Default Branch: main
  • Homepage:
  • Size: 1.55 MB
Statistics
  • Stars: 227
  • Watchers: 5
  • Forks: 13
  • Open Issues: 1
  • Releases: 0
Topics
data-selection language-model llm survey
Created almost 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

A Survey on Data Selection for Language Models

GitHub stars GitHub forks License

This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!

For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models. By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

A conceptual demonstration of the data pipeline for language model training

Table of Contents

Data Selection for Pretraining

Conceptualization of objectives and constraints on data selection for pretraining

Language Filtering

Back to Table of Contents

Heuristic Approaches

Back to Table of Contents

Data Quality

Back to Table of Contents

Domain-Specific Selection

Back to Table of Contents

Data Deduplication

Back to Table of Contents

Filtering Toxic and Explicit Content

Back to Table of Contents

Specialized Selection for Multilingual Models

Back to Table of Contents

Data Mixing

Back to Table of Contents

Data Selection for Instruction-Tuning and Multitask Training

Conceptualization of objectives and constraints on data selection for instruction-tuning

Back to Table of Contents

Data Selection for Preference Fine-tuning: Alignment

Conceptualization of objectives and constraints on data selection for alignment

Back to Table of Contents

Data Selection for In-Context Learning

Conceptualization of objectives and constraints on data selection for in-context learning

Back to Table of Contents

Data Selection for Task-specific Fine-tuning

Conceptualization of objectives and constraints on data selection for task-specific fine-tuning

Back to Table of Contents

Contribution

There are likely some amazing works in the field that we missed, so please contribute to the repo.

Feel free to open a pull request with new papers or create an issue and we can add them for you. Thank you in advance for your efforts!

Citation

We hope this work serves as inspiration for many impactful future works. If you found our work useful, please cite this paper as: @article{albalak2024survey, title={A Survey on Data Selection for Language Models}, author={Alon Albalak and Yanai Elazar and Sang Michael Xie and Shayne Longpre and Nathan Lambert and Xinyi Wang and Niklas Muennighoff and Bairu Hou and Liangming Pan and Haewon Jeong and Colin Raffel and Shiyu Chang and Tatsunori Hashimoto and William Yang Wang}, year={2024}, journal={arXiv preprint arXiv:2402.16827}, note={\url{https://arxiv.org/abs/2402.16827}} }

Owner

  • Name: Alon Albalak
  • Login: alon-albalak
  • Kind: user
  • Location: Santa Barbara, CA

PhD student, Natural Language Processing and Deep Learning

GitHub Events

Total
  • Watch event: 75
  • Push event: 1
  • Pull request review event: 1
  • Pull request event: 3
  • Fork event: 5
Last Year
  • Watch event: 75
  • Push event: 1
  • Pull request review event: 1
  • Pull request event: 3
  • Fork event: 5