dolma

Data and tools for generating and inspecting OLMo pre-training data.

https://github.com/allenai/dolma

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
1 of 22 committers (4.5%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Keywords

data-processing large-language-models llm machile-learning nlp

Keywords from Contributors

transformers cryptocurrencies language-model jax embedded cryptography optim interactive spacy-extension tokenizer

Last synced: 6 months ago · JSON representation ·

Repository

Data and tools for generating and inspecting OLMo pre-training data.

Basic Info

Host: GitHub
Owner: allenai
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://allenai.github.io/dolma/
Size: 62.8 MB

Statistics

Stars: 1,293
Watchers: 24
Forks: 147
Open Issues: 24
Releases: 31

Topics

data-processing large-language-models llm machile-learning nlp

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

README.md

Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background.

Dolma is two things:

Dolma Dataset: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
Dolma Toolkit: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.

Dolma Dataset

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, a language model from the Allen Institute for AI (AI2).

Dolma is available for download on the HuggingFace 🤗 Hub: huggingface.co/datasets/allenai/dolma. Dolma is licensed under ODC-BY; see our blog post for explanation.

You can also read more about Dolma in our announcement, as well as by consulting its data sheet.

Dolma Toolkit

This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:

High Performance ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
Portability 🧳: Works on a single machine, a cluster, or cloud environment.
Built-In Taggers 🏷: Includes ready-to-use taggers commonly used to curate datasets such as Gopher, C4, and OpenWebText.
Fast Deduplication 🗑: Speedy document deduplication using a Rust Bloom filter.
Extensibility 🧩 & Cloud Support ☁: Supports custom taggers and AWS S3-compatible locations.

To install, simply type pip install dolma in your terminal.

To learn more about how to use the Dolma Toolkit, please visit the documentation.

Citation

If you use the Dolma dataset or toolkit, please cite the following items:

bibtex @article{dolma, title = {{Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}}, author={Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo}, year={2024}, journal={arXiv preprint}, url={https://arxiv.org/abs/2402.00159} }

Owner

Name: AI2
Login: allenai
Kind: organization
Email: ai2-info@allenai.org
Location: Seattle, WA

Website: http://www.allenai.org
Repositories: 454
Profile: https://github.com/allenai

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Dolma: an Open Corpus of Three Trillion Tokens for
  Language Model Pretraining Research
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - family-names: Soldaini
    given-names: Luca
    email: lucas@allenai.org
    affiliation: Allen Institute For AI
    orcid: 'https://orcid.org/0000-0001-6998-9863'
  - family-names: Kinney
    given-names: Rodney
    email: rodneyk@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Bhagia
    given-names: Akshita
    email: akshitab@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Schwenk
    given-names: Dustin
    email: dustins@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Atkinson
    given-names: David
    email: davida@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Authur
    given-names: Russell
    email: russell.authur@gmail.com
    affiliation: Allen Institute For AI
  - family-names: Bogin
    given-names: Ben
    email: benb@allenai.org
    affiliation: 'Allen Institute For AI, University of Washington'
  - family-names: Chandu
    given-names: Khyathi
    email: khyathic@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Dumas
    given-names: Jennifer
    email: jend@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Elazar
    given-names: Yanai
    email: yanaiela@gmail.com
    affiliation: 'Allen Institute For AI, University of Washington'
  - family-names: Hofmann
    given-names: Valentin
    email: valentinh@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Jha
    given-names: Ananya Harsh
    email: ananyah@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Kumar
    given-names: Sachin
    email: sachink@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Lucy
    given-names: Li
    email: lucy3_li@berkeley.edu
    affiliation: 'University for Berkeley, Allen Institute For AI'
  - family-names: Lyu
    given-names: Xinxi
    email: alrope@cs.washington.edu
    affiliation: Allen Institute For AI
  - family-names: Lambert
    given-names: Nathan
    email: nathanl@allenai.org
    affiliation: Allen Institute For AI
    orcid: 'https://orcid.org/0000-0002-9997-6817'
  - family-names: Magnusson
    given-names: Ian
    email: ianm@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Morrison
    given-names: Jacob
    email: jacobm@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Muennighoff
    given-names: Niklas
    email: n.muennighoff@gmail.com
  - family-names: Naik
    given-names: Aakanksha
    email: aakankshan@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Nam
    given-names: Crystal
    email: crystaln@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Peters
    given-names: Matthew E
    affiliation: Spiffy AI
    email: matt@spiffy.ai
  - family-names: Ravichander
    given-names: Abhilasha
    email: abhilashar@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Richardson
    given-names: Kyle
    email: kyler@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Shen
    given-names: Shannon Zejiang
    email: zejiangshen@gmail.com
    affiliation: Massachusetts Institute of Technology
  - family-names: Strubell
    given-names: Emma
    email: strubell@cmu.edu
    affiliation: 'Carnegie Mellon University, Allen Institute For AI'
    orcid: 'https://orcid.org/0000-0003-2798-0726'
  - family-names: Subramani
    given-names: Nishant
    email: nishant.subramani23@gmail.com
    affiliation: 'Carnegie Mellon University, Allen Institute For AI'
  - family-names: Tafjord
    given-names: Oyvind
    email: oyvindt@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Walsh
    given-names: Pete
    email: petew@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Zettlemoyer
    given-names: Luke
    email: lsz@cs.washington.edu
    affiliation: University of Washington
    orcid: 'https://orcid.org/0009-0008-8296-0764'
  - family-names: Smith
    given-names: Noah A
    email: noah@allenai.org
    affiliation: 'Allen Institute For AI, University of Washington'
    orcid: 'https://orcid.org/0000-0002-2310-6380'
  - family-names: Hajishirzi
    given-names: Hannaneh
    email: hannah@allenai.org
    affiliation: 'Allen Institute For AI, University of Washington'
    orcid: 'https://orcid.org/0000-0002-1055-6657'
  - family-names: Beltagy
    given-names: Iz
    email: beltagy@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Groeneveld
    given-names: Dirk
    email: dirkg@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Dodge
    given-names: Jesse
    email: jessed@allenai.org
    affiliation: Allen Institute For AI
  - family-names: Lo
    given-names: Kyle
    email: kylel@allenai.org
    affiliation: Allen Institute For AI
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2402.00159'
    description: arXiv
  - type: url
    value: 'https://huggingface.co/datasets/allenai/dolma'
    description: Dataset
repository-code: 'https://github.com/allenai/dolma'
url: 'https://github.com/allenai/dolma'
abstract: >
  Language models have become a critical technology to
  tackling a wide range of natural language processing
  tasks, yet many details about how the best-performing
  language models were developed are not reported. In
  particular, information about their pretraining corpora is
  seldom discussed: commercial language models rarely
  provide any information about their data; even open models
  rarely release datasets they are trained on, or an exact
  recipe to reproduce them. As a result, it is challenging
  to conduct certain threads of language modeling research,
  such as understanding how training data impacts model
  capabilities and shapes their limitations. To facilitate
  open research on language model pretraining, we release
  Dolma, a three trillion tokens English corpus, built from
  a diverse mixture of web content, scientific papers, code,
  public-domain books, social media, and encyclopedic
  materials. In addition, we open source our data curation
  toolkit to enable further experimentation and reproduction
  of our work. In this report, we document Dolma, including
  its design principles, details about its construction, and
  a summary of its contents. We interleave this report with
  analyses and experimental results from training language
  models on intermediate states of Dolma to share what we
  have learned about important data curation practices,
  including the role of content or quality filters,
  deduplication, and multi-source mixing. Dolma has been
  used to train OLMo, a state-of-the-art, open language
  model and framework designed to build and study the
  science of language modeling.
license: Apache-2.0

GitHub Events

Total

Create event: 75
Release event: 4
Issues event: 42
Watch event: 304
Delete event: 30
Issue comment event: 52
Push event: 517
Pull request review comment event: 31
Pull request review event: 53
Pull request event: 62
Fork event: 43

Last Year

Create event: 75
Release event: 4
Issues event: 42
Watch event: 304
Delete event: 30
Issue comment event: 52
Push event: 517
Pull request review comment event: 31
Pull request review event: 53
Pull request event: 62
Fork event: 43

Committers

Last synced: over 1 year ago

All Time

Total Commits: 283
Total Committers: 22
Avg Commits per committer: 12.864
Development Distribution Score (DDS): 0.689

Past Year

Commits: 174
Committers: 20
Avg Commits per committer: 8.7
Development Distribution Score (DDS): 0.603

Top Committers

Name	Email	Commits
Luca Soldaini	l**s@a**g	88
Luca Soldaini	l**a@s**t	78
chris-ha458	h**9@g**m	69
kyleclo	k**o@u**u	13
dependabot[bot]	4****]	7
Peter Bjørn Jørgensen	p**n@g**m	5
David Graham	d**1@g**m	3
Niklas Muennighoff	n**f@g**m	3
Tyler Murray	t**m@a**g	3
Rodney Kinney	r**k@a**g	2
Arnavi Chheda	a**c@l**m	1
Ben Bogin	b**9@g**m	1
Dirk Groeneveld	d**g@a**g	1
Ishan Anand	g**b@i**g	1
Kenneth Enevoldsen	k**n@g**m	1
Rohit Singh Rathaur	r**5@g**m	1
Simon Willison	s**n@g**m	1
Tyler Murray	t**7@g**m	1
Ikko Eltociear Ashimine	e**r@g**m	1
Ian Magnusson	4****n	1
Dustin Schwenk	d****k	1
epwalsh	p**w@a**g	1

Committer Domains (Top 20 + Academic)

allenai.org: 5 ishan.org: 1 uw.edu: 1 soldaini.net: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 86
Total pull requests: 185
Average time to close issues: 5 months
Average time to close pull requests: 11 days
Total issue authors: 45
Total pull request authors: 33
Average comments per issue: 1.6
Average comments per pull request: 0.38
Merged pull requests: 144
Bot issues: 0
Bot pull requests: 11

Past Year

Issues: 18
Pull requests: 62
Average time to close issues: about 2 months
Average time to close pull requests: 18 days
Issue authors: 15
Pull request authors: 15
Average comments per issue: 0.83
Average comments per pull request: 0.32
Merged pull requests: 38
Bot issues: 0
Bot pull requests: 2

View more stats

Top Authors

Issue Authors

hannahzacharski55 (20)
soldni (8)
peterbjorgensen (7)
mihara-bot (3)
wannaphong (3)
yushengsu-thu (2)
mrqorib (2)
chschroeder (2)
codefly13 (2)
zxnie (1)
Jackwaterveg (1)
ehartford (1)
silverriver (1)
joellliu (1)
XevWright (1)

Pull Request Authors

soldni (123)
undfined (30)
dependabot[bot] (17)
Whattabatt (9)
kyleclo (8)
peterbjorgensen (8)
chris-ha458 (5)
cmwilhelm (5)
no0p (4)
revbucket (4)
yushengsu-thu (3)
Muennighoff (3)
rodneykinney (3)
mariia-iureva (2)
power10dan (2)

Top Labels

Issue Labels

enhancement (5)

Pull Request Labels

dependencies (17) python (2) github_actions (2) rust (1)

Packages

Total packages: 2
Total downloads:
- pypi 11,143 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 79
Total maintainers: 3

proxy.golang.org: github.com/allenai/dolma

Documentation: https://pkg.go.dev/github.com/allenai/dolma#section-documentation
License: apache-2.0
Latest release: v1.2.1
published 8 months ago

Versions: 38
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

pypi.org: dolma

Toolkit for pre-processing LLM training data.

Homepage: https://github.com/allenai/dolma
Documentation: https://dolma.readthedocs.io/
License: Apache-2.0
Latest release: 1.2.1
published 8 months ago

Versions: 41
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 11,143 Last month

Rankings

Stargazers count: 4.9%

Downloads: 5.2%

Dependent packages count: 7.5%

Forks count: 10.6%

Average: 19.6%

Dependent repos count: 69.8%

Maintainers (3)

soldni kyleclo undfined

Last synced: 6 months ago

Dependencies

.github/workflows/CI.yml actions

PyO3/maturin-action v1 composite
actions-rs/toolchain v1 composite
actions/checkout v1 composite
actions/checkout v3 composite
actions/download-artifact v3 composite
actions/setup-python v2 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite

Cargo.lock cargo

233 dependencies

pyproject.toml pypi

anyascii >=0.3.2
blingfire ==0.1.8
boto3 *
cached-path ==1.3.4
detect-secrets ==1.4.0
fasttext-wheel ==0.9.2
fsspec *
msgspec >=0.14.2
nltk ==3.8.1
omegaconf >=2.3.0
presidio_analyzer ==2.2.32
pycld2 ==0.41
pyyaml *
requests *
rich *
s3fs *
smart-open *
tokenizers >=0.13.3,<1.0.0
tqdm *
uniseg *

.github/workflows/ISSUE_TEMPLATE/bug_report.yml actions

.github/workflows/ISSUE_TEMPLATE/documentation.yml actions

.github/workflows/ISSUE_TEMPLATE/feature_request.yml actions

.github/workflows/ISSUE_TEMPLATE/question.yml actions

Cargo.toml cargo

sources/reddit/atomic_content_v3/requirements.txt pypi

apache-beam *
jsonlines *

sources/reddit/atomic_content_v3/setup.py pypi

jsonlines *

sources/reddit/atomic_content_v5/requirements.txt pypi

apache-beam *
jsonlines *

sources/reddit/atomic_content_v5/setup.py pypi

jsonlines *

sources/reddit/comment_threads_v1/requirements.txt pypi

apache-beam *
datasets *
jsonlines *

sources/reddit/comment_threads_v1/setup.py pypi

jsonlines *

sources/reddit/comment_threads_v2/requirements.txt pypi

apache-beam *
datasets *
jsonlines *

sources/reddit/comment_threads_v2/setup.py pypi

jsonlines *

sources/reddit/complete_threads_codelike_v4/requirements.txt pypi

apache-beam *
datasets *
jsonlines *

sources/reddit/complete_threads_codelike_v4/setup.py pypi

jsonlines *

sources/starcoder/requirements.txt pypi

pyarrow *