dolma

Data and tools for generating and inspecting OLMo pre-training data.

https://github.com/allenai/dolma

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    1 of 22 committers (4.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.9%) to scientific vocabulary

Keywords

data-processing large-language-models llm machile-learning nlp

Keywords from Contributors

transformers cryptocurrencies language-model jax embedded cryptography optim interactive spacy-extension tokenizer
Last synced: 4 months ago · JSON representation ·

Repository

Data and tools for generating and inspecting OLMo pre-training data.

Basic Info
Statistics
  • Stars: 1,293
  • Watchers: 24
  • Forks: 147
  • Open Issues: 24
  • Releases: 31
Topics
data-processing large-language-models llm machile-learning nlp
Created over 2 years ago · Last pushed 4 months ago
Metadata Files
Readme License Citation

README.md

Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background.

Dolma is two things:

  1. Dolma Dataset: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
  2. Dolma Toolkit: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.

Dolma Dataset

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, a language model from the Allen Institute for AI (AI2).

Dolma is available for download on the HuggingFace 🤗 Hub: huggingface.co/datasets/allenai/dolma. Dolma is licensed under ODC-BY; see our blog post for explanation.

You can also read more about Dolma in our announcement, as well as by consulting its data sheet.

Dolma Toolkit

This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:

  1. High Performance ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
  2. Portability 🧳: Works on a single machine, a cluster, or cloud environment.
  3. Built-In Taggers 🏷: Includes ready-to-use taggers commonly used to curate datasets such as Gopher, C4, and OpenWebText.
  4. Fast Deduplication 🗑: Speedy document deduplication using a Rust Bloom filter.
  5. Extensibility 🧩 & Cloud Support ☁: Supports custom taggers and AWS S3-compatible locations.

To install, simply type pip install dolma in your terminal.

To learn more about how to use the Dolma Toolkit, please visit the documentation.

Citation

If you use the Dolma dataset or toolkit, please cite the following items:

bibtex @article{dolma, title = {{Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}}, author={Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo}, year={2024}, journal={arXiv preprint}, url={https://arxiv.org/abs/2402.00159} } <!-- {% endraw %} -->

Owner

  • Name: AI2
  • Login: allenai
  • Kind: organization
  • Email: ai2-info@allenai.org
  • Location: Seattle, WA

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Dolma: an Open Corpus of Three Trillion Tokens for
  Language Model Pretraining Research
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - family-names: Soldaini
    given-names: Luca
    email: lucas@allenai.org
    affiliation: Allen Institute For AI
    orcid: 'https://orcid.org/0000-0001-6998-9863'
  - family-names: Kinney
    given-names: Rodney
    email: rodneyk@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Bhagia
    given-names: Akshita
    email: akshitab@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Schwenk
    given-names: Dustin
    email: dustins@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Atkinson
    given-names: David
    email: davida@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Authur
    given-names: Russell
    email: russell.authur@gmail.com
    affiliation: Allen Institute For AI
  - family-names: Bogin
    given-names: Ben
    email: benb@allenai.org
    affiliation: 'Allen Institute For AI, University of Washington'
  - family-names: Chandu
    given-names: Khyathi
    email: khyathic@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Dumas
    given-names: Jennifer
    email: jend@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Elazar
    given-names: Yanai
    email: yanaiela@gmail.com
    affiliation: 'Allen Institute For AI, University of Washington'
  - family-names: Hofmann
    given-names: Valentin
    email: valentinh@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Jha
    given-names: Ananya Harsh
    email: ananyah@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Kumar
    given-names: Sachin
    email: sachink@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Lucy
    given-names: Li
    email: lucy3_li@berkeley.edu
    affiliation: 'University for Berkeley, Allen Institute For AI'
  - family-names: Lyu
    given-names: Xinxi
    email: alrope@cs.washington.edu
    affiliation: Allen Institute For AI
  - family-names: Lambert
    given-names: Nathan
    email: nathanl@allenai.org
    affiliation: Allen Institute For AI
    orcid: 'https://orcid.org/0000-0002-9997-6817'
  - family-names: Magnusson
    given-names: Ian
    email: ianm@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Morrison
    given-names: Jacob
    email: jacobm@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Muennighoff
    given-names: Niklas
    email: n.muennighoff@gmail.com
  - family-names: Naik
    given-names: Aakanksha
    email: aakankshan@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Nam
    given-names: Crystal
    email: crystaln@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Peters
    given-names: Matthew E
    affiliation: Spiffy AI
    email: matt@spiffy.ai
  - family-names: Ravichander
    given-names: Abhilasha
    email: abhilashar@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Richardson
    given-names: Kyle
    email: kyler@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Shen
    given-names: Shannon Zejiang
    email: zejiangshen@gmail.com
    affiliation: Massachusetts Institute of Technology
  - family-names: Strubell
    given-names: Emma
    email: strubell@cmu.edu
    affiliation: 'Carnegie Mellon University, Allen Institute For AI'
    orcid: 'https://orcid.org/0000-0003-2798-0726'
  - family-names: Subramani
    given-names: Nishant
    email: nishant.subramani23@gmail.com
    affiliation: 'Carnegie Mellon University, Allen Institute For AI'
  - family-names: Tafjord
    given-names: Oyvind
    email: oyvindt@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Walsh
    given-names: Pete
    email: petew@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Zettlemoyer
    given-names: Luke
    email: lsz@cs.washington.edu
    affiliation: University of Washington
    orcid: 'https://orcid.org/0009-0008-8296-0764'
  - family-names: Smith
    given-names: Noah A
    email: noah@allenai.org
    affiliation: 'Allen Institute For AI, University of Washington'
    orcid: 'https://orcid.org/0000-0002-2310-6380'
  - family-names: Hajishirzi
    given-names: Hannaneh
    email: hannah@allenai.org
    affiliation: 'Allen Institute For AI, University of Washington'
    orcid: 'https://orcid.org/0000-0002-1055-6657'
  - family-names: Beltagy
    given-names: Iz
    email: beltagy@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Groeneveld
    given-names: Dirk
    email: dirkg@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Dodge
    given-names: Jesse
    email: jessed@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Lo
    given-names: Kyle
    email: kylel@allenai.org
    affiliation: Allen Institute For AI
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2402.00159'
    description: arXiv
  - type: url
    value: 'https://huggingface.co/datasets/allenai/dolma'
    description: Dataset
repository-code: 'https://github.com/allenai/dolma'
url: 'https://github.com/allenai/dolma'
abstract: >
  Language models have become a critical technology to
  tackling a wide range of natural language processing
  tasks, yet many details about how the best-performing
  language models were developed are not reported. In
  particular, information about their pretraining corpora is
  seldom discussed: commercial language models rarely
  provide any information about their data; even open models
  rarely release datasets they are trained on, or an exact
  recipe to reproduce them. As a result, it is challenging
  to conduct certain threads of language modeling research,
  such as understanding how training data impacts model
  capabilities and shapes their limitations. To facilitate
  open research on language model pretraining, we release
  Dolma, a three trillion tokens English corpus, built from
  a diverse mixture of web content, scientific papers, code,
  public-domain books, social media, and encyclopedic
  materials. In addition, we open source our data curation
  toolkit to enable further experimentation and reproduction
  of our work. In this report, we document Dolma, including
  its design principles, details about its construction, and
  a summary of its contents. We interleave this report with
  analyses and experimental results from training language
  models on intermediate states of Dolma to share what we
  have learned about important data curation practices,
  including the role of content or quality filters,
  deduplication, and multi-source mixing. Dolma has been
  used to train OLMo, a state-of-the-art, open language
  model and framework designed to build and study the
  science of language modeling.
license: Apache-2.0

GitHub Events

Total
  • Create event: 75
  • Release event: 4
  • Issues event: 42
  • Watch event: 304
  • Delete event: 30
  • Issue comment event: 52
  • Push event: 517
  • Pull request review comment event: 31
  • Pull request review event: 53
  • Pull request event: 62
  • Fork event: 43
Last Year
  • Create event: 75
  • Release event: 4
  • Issues event: 42
  • Watch event: 304
  • Delete event: 30
  • Issue comment event: 52
  • Push event: 517
  • Pull request review comment event: 31
  • Pull request review event: 53
  • Pull request event: 62
  • Fork event: 43

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 283
  • Total Committers: 22
  • Avg Commits per committer: 12.864
  • Development Distribution Score (DDS): 0.689
Past Year
  • Commits: 174
  • Committers: 20
  • Avg Commits per committer: 8.7
  • Development Distribution Score (DDS): 0.603
Top Committers
Name Email Commits
Luca Soldaini l****s@a****g 88
Luca Soldaini l****a@s****t 78
chris-ha458 h****9@g****m 69
kyleclo k****o@u****u 13
dependabot[bot] 4****] 7
Peter Bjørn Jørgensen p****n@g****m 5
David Graham d****1@g****m 3
Niklas Muennighoff n****f@g****m 3
Tyler Murray t****m@a****g 3
Rodney Kinney r****k@a****g 2
Arnavi Chheda a****c@l****m 1
Ben Bogin b****9@g****m 1
Dirk Groeneveld d****g@a****g 1
Ishan Anand g****b@i****g 1
Kenneth Enevoldsen k****n@g****m 1
Rohit Singh Rathaur r****5@g****m 1
Simon Willison s****n@g****m 1
Tyler Murray t****7@g****m 1
Ikko Eltociear Ashimine e****r@g****m 1
Ian Magnusson 4****n 1
Dustin Schwenk d****k 1
epwalsh p****w@a****g 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 86
  • Total pull requests: 185
  • Average time to close issues: 5 months
  • Average time to close pull requests: 11 days
  • Total issue authors: 45
  • Total pull request authors: 33
  • Average comments per issue: 1.6
  • Average comments per pull request: 0.38
  • Merged pull requests: 144
  • Bot issues: 0
  • Bot pull requests: 11
Past Year
  • Issues: 18
  • Pull requests: 62
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 18 days
  • Issue authors: 15
  • Pull request authors: 15
  • Average comments per issue: 0.83
  • Average comments per pull request: 0.32
  • Merged pull requests: 38
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
  • hannahzacharski55 (20)
  • soldni (8)
  • peterbjorgensen (7)
  • mihara-bot (3)
  • wannaphong (3)
  • yushengsu-thu (2)
  • mrqorib (2)
  • chschroeder (2)
  • codefly13 (2)
  • zxnie (1)
  • Jackwaterveg (1)
  • ehartford (1)
  • silverriver (1)
  • joellliu (1)
  • XevWright (1)
Pull Request Authors
  • soldni (123)
  • undfined (30)
  • dependabot[bot] (17)
  • Whattabatt (9)
  • kyleclo (8)
  • peterbjorgensen (8)
  • chris-ha458 (5)
  • cmwilhelm (5)
  • no0p (4)
  • revbucket (4)
  • yushengsu-thu (3)
  • Muennighoff (3)
  • rodneykinney (3)
  • mariia-iureva (2)
  • power10dan (2)
Top Labels
Issue Labels
enhancement (5)
Pull Request Labels
dependencies (17) python (2) github_actions (2) rust (1)

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 11,143 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 79
  • Total maintainers: 3
proxy.golang.org: github.com/allenai/dolma
  • Versions: 38
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 4 months ago
pypi.org: dolma

Toolkit for pre-processing LLM training data.

  • Versions: 41
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 11,143 Last month
Rankings
Stargazers count: 4.9%
Downloads: 5.2%
Dependent packages count: 7.5%
Forks count: 10.6%
Average: 19.6%
Dependent repos count: 69.8%
Maintainers (3)
Last synced: 4 months ago

Dependencies

.github/workflows/CI.yml actions
  • PyO3/maturin-action v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v1 composite
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v2 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
Cargo.lock cargo
  • 233 dependencies
pyproject.toml pypi
  • anyascii >=0.3.2
  • blingfire ==0.1.8
  • boto3 *
  • cached-path ==1.3.4
  • detect-secrets ==1.4.0
  • fasttext-wheel ==0.9.2
  • fsspec *
  • msgspec >=0.14.2
  • nltk ==3.8.1
  • omegaconf >=2.3.0
  • presidio_analyzer ==2.2.32
  • pycld2 ==0.41
  • pyyaml *
  • requests *
  • rich *
  • s3fs *
  • smart-open *
  • tokenizers >=0.13.3,<1.0.0
  • tqdm *
  • uniseg *
.github/workflows/ISSUE_TEMPLATE/bug_report.yml actions
.github/workflows/ISSUE_TEMPLATE/documentation.yml actions
.github/workflows/ISSUE_TEMPLATE/feature_request.yml actions
.github/workflows/ISSUE_TEMPLATE/question.yml actions
Cargo.toml cargo
sources/reddit/atomic_content_v3/requirements.txt pypi
  • apache-beam *
  • jsonlines *
sources/reddit/atomic_content_v3/setup.py pypi
  • jsonlines *
sources/reddit/atomic_content_v5/requirements.txt pypi
  • apache-beam *
  • jsonlines *
sources/reddit/atomic_content_v5/setup.py pypi
  • jsonlines *
sources/reddit/comment_threads_v1/requirements.txt pypi
  • apache-beam *
  • datasets *
  • jsonlines *
sources/reddit/comment_threads_v1/setup.py pypi
  • jsonlines *
sources/reddit/comment_threads_v2/requirements.txt pypi
  • apache-beam *
  • datasets *
  • jsonlines *
sources/reddit/comment_threads_v2/setup.py pypi
  • jsonlines *
sources/reddit/complete_threads_codelike_v4/requirements.txt pypi
  • apache-beam *
  • datasets *
  • jsonlines *
sources/reddit/complete_threads_codelike_v4/setup.py pypi
  • jsonlines *
sources/starcoder/requirements.txt pypi
  • pyarrow *