dolma
Data and tools for generating and inspecting OLMo pre-training data.
Science Score: 64.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
1 of 22 committers (4.5%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Data and tools for generating and inspecting OLMo pre-training data.
Basic Info
- Host: GitHub
- Owner: allenai
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://allenai.github.io/dolma/
- Size: 62.8 MB
Statistics
- Stars: 1,293
- Watchers: 24
- Forks: 147
- Open Issues: 24
- Releases: 31
Topics
Metadata Files
README.md

Dolma is two things:
- Dolma Dataset: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
- Dolma Toolkit: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.
Dolma Dataset
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, a language model from the Allen Institute for AI (AI2).
Dolma is available for download on the HuggingFace 🤗 Hub: huggingface.co/datasets/allenai/dolma. Dolma is licensed under ODC-BY; see our blog post for explanation.
You can also read more about Dolma in our announcement, as well as by consulting its data sheet.
Dolma Toolkit
This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:
- High Performance ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
- Portability 🧳: Works on a single machine, a cluster, or cloud environment.
- Built-In Taggers 🏷: Includes ready-to-use taggers commonly used to curate datasets such as Gopher, C4, and OpenWebText.
- Fast Deduplication 🗑: Speedy document deduplication using a Rust Bloom filter.
- Extensibility 🧩 & Cloud Support ☁: Supports custom taggers and AWS S3-compatible locations.
To install, simply type pip install dolma in your terminal.
To learn more about how to use the Dolma Toolkit, please visit the documentation.
Citation
If you use the Dolma dataset or toolkit, please cite the following items:
bibtex
@article{dolma,
title = {{Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},
author={Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo},
year={2024},
journal={arXiv preprint},
url={https://arxiv.org/abs/2402.00159}
}
<!-- {% endraw %} -->
Owner
- Name: AI2
- Login: allenai
- Kind: organization
- Email: ai2-info@allenai.org
- Location: Seattle, WA
- Website: http://www.allenai.org
- Repositories: 454
- Profile: https://github.com/allenai
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Dolma: an Open Corpus of Three Trillion Tokens for
Language Model Pretraining Research
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Soldaini
given-names: Luca
email: lucas@allenai.org
affiliation: Allen Institute For AI
orcid: 'https://orcid.org/0000-0001-6998-9863'
- family-names: Kinney
given-names: Rodney
email: rodneyk@allenai.org
affiliation: Allen Institute For AI
- family-names: Bhagia
given-names: Akshita
email: akshitab@allenai.org
affiliation: Allen Institute For AI
- family-names: Schwenk
given-names: Dustin
email: dustins@allenai.org
affiliation: Allen Institute For AI
- family-names: Atkinson
given-names: David
email: davida@allenai.org
affiliation: Allen Institute For AI
- family-names: Authur
given-names: Russell
email: russell.authur@gmail.com
affiliation: Allen Institute For AI
- family-names: Bogin
given-names: Ben
email: benb@allenai.org
affiliation: 'Allen Institute For AI, University of Washington'
- family-names: Chandu
given-names: Khyathi
email: khyathic@allenai.org
affiliation: Allen Institute For AI
- family-names: Dumas
given-names: Jennifer
email: jend@allenai.org
affiliation: Allen Institute For AI
- family-names: Elazar
given-names: Yanai
email: yanaiela@gmail.com
affiliation: 'Allen Institute For AI, University of Washington'
- family-names: Hofmann
given-names: Valentin
email: valentinh@allenai.org
affiliation: Allen Institute For AI
- family-names: Jha
given-names: Ananya Harsh
email: ananyah@allenai.org
affiliation: Allen Institute For AI
- family-names: Kumar
given-names: Sachin
email: sachink@allenai.org
affiliation: Allen Institute For AI
- family-names: Lucy
given-names: Li
email: lucy3_li@berkeley.edu
affiliation: 'University for Berkeley, Allen Institute For AI'
- family-names: Lyu
given-names: Xinxi
email: alrope@cs.washington.edu
affiliation: Allen Institute For AI
- family-names: Lambert
given-names: Nathan
email: nathanl@allenai.org
affiliation: Allen Institute For AI
orcid: 'https://orcid.org/0000-0002-9997-6817'
- family-names: Magnusson
given-names: Ian
email: ianm@allenai.org
affiliation: Allen Institute For AI
- family-names: Morrison
given-names: Jacob
email: jacobm@allenai.org
affiliation: Allen Institute For AI
- family-names: Muennighoff
given-names: Niklas
email: n.muennighoff@gmail.com
- family-names: Naik
given-names: Aakanksha
email: aakankshan@allenai.org
affiliation: Allen Institute For AI
- family-names: Nam
given-names: Crystal
email: crystaln@allenai.org
affiliation: Allen Institute For AI
- family-names: Peters
given-names: Matthew E
affiliation: Spiffy AI
email: matt@spiffy.ai
- family-names: Ravichander
given-names: Abhilasha
email: abhilashar@allenai.org
affiliation: Allen Institute For AI
- family-names: Richardson
given-names: Kyle
email: kyler@allenai.org
affiliation: Allen Institute For AI
- family-names: Shen
given-names: Shannon Zejiang
email: zejiangshen@gmail.com
affiliation: Massachusetts Institute of Technology
- family-names: Strubell
given-names: Emma
email: strubell@cmu.edu
affiliation: 'Carnegie Mellon University, Allen Institute For AI'
orcid: 'https://orcid.org/0000-0003-2798-0726'
- family-names: Subramani
given-names: Nishant
email: nishant.subramani23@gmail.com
affiliation: 'Carnegie Mellon University, Allen Institute For AI'
- family-names: Tafjord
given-names: Oyvind
email: oyvindt@allenai.org
affiliation: Allen Institute For AI
- family-names: Walsh
given-names: Pete
email: petew@allenai.org
affiliation: Allen Institute For AI
- family-names: Zettlemoyer
given-names: Luke
email: lsz@cs.washington.edu
affiliation: University of Washington
orcid: 'https://orcid.org/0009-0008-8296-0764'
- family-names: Smith
given-names: Noah A
email: noah@allenai.org
affiliation: 'Allen Institute For AI, University of Washington'
orcid: 'https://orcid.org/0000-0002-2310-6380'
- family-names: Hajishirzi
given-names: Hannaneh
email: hannah@allenai.org
affiliation: 'Allen Institute For AI, University of Washington'
orcid: 'https://orcid.org/0000-0002-1055-6657'
- family-names: Beltagy
given-names: Iz
email: beltagy@allenai.org
affiliation: Allen Institute For AI
- family-names: Groeneveld
given-names: Dirk
email: dirkg@allenai.org
affiliation: Allen Institute For AI
- family-names: Dodge
given-names: Jesse
email: jessed@allenai.org
affiliation: Allen Institute For AI
- family-names: Lo
given-names: Kyle
email: kylel@allenai.org
affiliation: Allen Institute For AI
identifiers:
- type: url
value: 'https://arxiv.org/abs/2402.00159'
description: arXiv
- type: url
value: 'https://huggingface.co/datasets/allenai/dolma'
description: Dataset
repository-code: 'https://github.com/allenai/dolma'
url: 'https://github.com/allenai/dolma'
abstract: >
Language models have become a critical technology to
tackling a wide range of natural language processing
tasks, yet many details about how the best-performing
language models were developed are not reported. In
particular, information about their pretraining corpora is
seldom discussed: commercial language models rarely
provide any information about their data; even open models
rarely release datasets they are trained on, or an exact
recipe to reproduce them. As a result, it is challenging
to conduct certain threads of language modeling research,
such as understanding how training data impacts model
capabilities and shapes their limitations. To facilitate
open research on language model pretraining, we release
Dolma, a three trillion tokens English corpus, built from
a diverse mixture of web content, scientific papers, code,
public-domain books, social media, and encyclopedic
materials. In addition, we open source our data curation
toolkit to enable further experimentation and reproduction
of our work. In this report, we document Dolma, including
its design principles, details about its construction, and
a summary of its contents. We interleave this report with
analyses and experimental results from training language
models on intermediate states of Dolma to share what we
have learned about important data curation practices,
including the role of content or quality filters,
deduplication, and multi-source mixing. Dolma has been
used to train OLMo, a state-of-the-art, open language
model and framework designed to build and study the
science of language modeling.
license: Apache-2.0
GitHub Events
Total
- Create event: 75
- Release event: 4
- Issues event: 42
- Watch event: 304
- Delete event: 30
- Issue comment event: 52
- Push event: 517
- Pull request review comment event: 31
- Pull request review event: 53
- Pull request event: 62
- Fork event: 43
Last Year
- Create event: 75
- Release event: 4
- Issues event: 42
- Watch event: 304
- Delete event: 30
- Issue comment event: 52
- Push event: 517
- Pull request review comment event: 31
- Pull request review event: 53
- Pull request event: 62
- Fork event: 43
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Luca Soldaini | l****s@a****g | 88 |
| Luca Soldaini | l****a@s****t | 78 |
| chris-ha458 | h****9@g****m | 69 |
| kyleclo | k****o@u****u | 13 |
| dependabot[bot] | 4****] | 7 |
| Peter Bjørn Jørgensen | p****n@g****m | 5 |
| David Graham | d****1@g****m | 3 |
| Niklas Muennighoff | n****f@g****m | 3 |
| Tyler Murray | t****m@a****g | 3 |
| Rodney Kinney | r****k@a****g | 2 |
| Arnavi Chheda | a****c@l****m | 1 |
| Ben Bogin | b****9@g****m | 1 |
| Dirk Groeneveld | d****g@a****g | 1 |
| Ishan Anand | g****b@i****g | 1 |
| Kenneth Enevoldsen | k****n@g****m | 1 |
| Rohit Singh Rathaur | r****5@g****m | 1 |
| Simon Willison | s****n@g****m | 1 |
| Tyler Murray | t****7@g****m | 1 |
| Ikko Eltociear Ashimine | e****r@g****m | 1 |
| Ian Magnusson | 4****n | 1 |
| Dustin Schwenk | d****k | 1 |
| epwalsh | p****w@a****g | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 86
- Total pull requests: 185
- Average time to close issues: 5 months
- Average time to close pull requests: 11 days
- Total issue authors: 45
- Total pull request authors: 33
- Average comments per issue: 1.6
- Average comments per pull request: 0.38
- Merged pull requests: 144
- Bot issues: 0
- Bot pull requests: 11
Past Year
- Issues: 18
- Pull requests: 62
- Average time to close issues: about 2 months
- Average time to close pull requests: 18 days
- Issue authors: 15
- Pull request authors: 15
- Average comments per issue: 0.83
- Average comments per pull request: 0.32
- Merged pull requests: 38
- Bot issues: 0
- Bot pull requests: 2
Top Authors
Issue Authors
- hannahzacharski55 (20)
- soldni (8)
- peterbjorgensen (7)
- mihara-bot (3)
- wannaphong (3)
- yushengsu-thu (2)
- mrqorib (2)
- chschroeder (2)
- codefly13 (2)
- zxnie (1)
- Jackwaterveg (1)
- ehartford (1)
- silverriver (1)
- joellliu (1)
- XevWright (1)
Pull Request Authors
- soldni (123)
- undfined (30)
- dependabot[bot] (17)
- Whattabatt (9)
- kyleclo (8)
- peterbjorgensen (8)
- chris-ha458 (5)
- cmwilhelm (5)
- no0p (4)
- revbucket (4)
- yushengsu-thu (3)
- Muennighoff (3)
- rodneykinney (3)
- mariia-iureva (2)
- power10dan (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 11,143 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 79
- Total maintainers: 3
proxy.golang.org: github.com/allenai/dolma
- Documentation: https://pkg.go.dev/github.com/allenai/dolma#section-documentation
- License: apache-2.0
-
Latest release: v1.2.1
published 6 months ago
Rankings
pypi.org: dolma
Toolkit for pre-processing LLM training data.
- Homepage: https://github.com/allenai/dolma
- Documentation: https://dolma.readthedocs.io/
- License: Apache-2.0
-
Latest release: 1.2.1
published 6 months ago
Rankings
Dependencies
- PyO3/maturin-action v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v1 composite
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-python v2 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- 233 dependencies
- anyascii >=0.3.2
- blingfire ==0.1.8
- boto3 *
- cached-path ==1.3.4
- detect-secrets ==1.4.0
- fasttext-wheel ==0.9.2
- fsspec *
- msgspec >=0.14.2
- nltk ==3.8.1
- omegaconf >=2.3.0
- presidio_analyzer ==2.2.32
- pycld2 ==0.41
- pyyaml *
- requests *
- rich *
- s3fs *
- smart-open *
- tokenizers >=0.13.3,<1.0.0
- tqdm *
- uniseg *
- apache-beam *
- jsonlines *
- jsonlines *
- apache-beam *
- jsonlines *
- jsonlines *
- apache-beam *
- datasets *
- jsonlines *
- jsonlines *
- apache-beam *
- datasets *
- jsonlines *
- jsonlines *
- apache-beam *
- datasets *
- jsonlines *
- jsonlines *
- pyarrow *