Recent Releases of dolma

dolma - v1.2.1

What's Changed

  • FIM script modifications by @regan-huff in https://github.com/allenai/dolma/pull/258
  • Adding tool to reshard npy files based on maximum desired size. by @soldni in https://github.com/allenai/dolma/pull/269

New Contributors

  • @regan-huff made their first contribution in https://github.com/allenai/dolma/pull/258

Full Changelog: https://github.com/allenai/dolma/compare/v1.2.0...v1.2.1

- Python
Published by soldni 11 months ago

dolma - v1.2.0

What's Changed

  • Code-prose-composition tagger. by @no0p in https://github.com/allenai/dolma/pull/247
  • Support for WARC resource record types. by @no0p in https://github.com/allenai/dolma/pull/248
  • Bump artifacts version to 4.4.1 by @no0p in https://github.com/allenai/dolma/pull/252
  • Sanitize-concat-fim by @cmwilhelm in https://github.com/allenai/dolma/pull/253
  • Use original s3 path to delete local cache by @CodeCreator in https://github.com/allenai/dolma/pull/257
  • Safe tokenization by skipping failing docs. by @soldni in https://github.com/allenai/dolma/pull/245
  • Update RedPajama branch link by @guspan-tanadi in https://github.com/allenai/dolma/pull/263
  • Skipping empty tagger key instead of erroring out by @Whattabatt in https://github.com/allenai/dolma/pull/262
  • Tokenizer over custom fields and w/o IDs; BOS/EOS tokens. by @soldni in https://github.com/allenai/dolma/pull/266

New Contributors

  • @no0p made their first contribution in https://github.com/allenai/dolma/pull/247
  • @CodeCreator made their first contribution in https://github.com/allenai/dolma/pull/257
  • @guspan-tanadi made their first contribution in https://github.com/allenai/dolma/pull/263

Full Changelog: https://github.com/allenai/dolma/compare/v1.1.2...v1.2.0

- Python
Published by soldni 12 months ago

dolma - v1.1.2

What's Changed

  • Matrix targets stomping each other by @undfined in https://github.com/allenai/dolma/pull/236
  • Artifact name change in release too by @undfined in https://github.com/allenai/dolma/pull/237
  • Pattern match for all artifacts by @undfined in https://github.com/allenai/dolma/pull/239
  • Handle retries in the aws client, use adaptive backoff by @undfined in https://github.com/allenai/dolma/pull/241
  • Bump version to 1.1.2 for release by @undfined in https://github.com/allenai/dolma/pull/243

Full Changelog: https://github.com/allenai/dolma/compare/v1.1.1...v1.1.2

- Python
Published by undfined over 1 year ago

dolma - v1.1.1

What's Changed

  • Fixes checks by @cmwilhelm in https://github.com/allenai/dolma/pull/232
  • Adds strict filter mechanism to Mixer by @undfined in https://github.com/allenai/dolma/pull/231
  • v3 download-artifact GH action deprecated by @undfined in https://github.com/allenai/dolma/pull/235

Full Changelog: https://github.com/allenai/dolma/compare/v1.1.0...v1.1.1

- Python
Published by undfined over 1 year ago

dolma - v1.1.0

What's Changed

  • Bump version with postfix for PyPI by @undfined in https://github.com/allenai/dolma/pull/206
  • Pin maturin in CI by @undfined in https://github.com/allenai/dolma/pull/207
  • Also pin maturin in action by @undfined in https://github.com/allenai/dolma/pull/208
  • 'File partition' option and 'document' directory specification by @Whattabatt in https://github.com/allenai/dolma/pull/213
  • Fixed issues and improved documentation in getting-started.md by @aman-17 in https://github.com/allenai/dolma/pull/216
  • Mixer validator by @mariia-iureva in https://github.com/allenai/dolma/pull/215
  • Typo in optional dependency (lingua) check by @soldni in https://github.com/allenai/dolma/pull/221

New Contributors

  • @aman-17 made their first contribution in https://github.com/allenai/dolma/pull/216
  • @mariia-iureva made their first contribution in https://github.com/allenai/dolma/pull/215

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.14...v1.0.15

- Python
Published by soldni over 1 year ago

dolma - v1.0.14.post1

What's Changed

  • Bump version with postfix for PyPI by @undfined in https://github.com/allenai/dolma/pull/206
  • Pin maturin in CI by @undfined in https://github.com/allenai/dolma/pull/207
  • Also pin maturin in action by @undfined in https://github.com/allenai/dolma/pull/208

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.14...v1.0.14.post1

- Python
Published by undfined over 1 year ago

dolma - v1.0.14

What's Changed

  • Adds dclm fasttext classifier by @undfined in https://github.com/allenai/dolma/pull/205

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.13...v1.0.14

- Python
Published by undfined over 1 year ago

dolma - v1.0.13

What's Changed

  • Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/198
  • Fix bug in length filtering for deduping by @soldni in https://github.com/allenai/dolma/pull/197
  • Polymorphic span replacement by @undfined in https://github.com/allenai/dolma/pull/200
  • Dependabot fail, match upload/download action versions by @undfined in https://github.com/allenai/dolma/pull/202
  • [Json fooramt error in line 133] Update getting-started.md by @yushengsu-thu in https://github.com/allenai/dolma/pull/196
  • Revert upload/download to v3 for now by @undfined in https://github.com/allenai/dolma/pull/203
  • Undfined/runner v3 by @undfined in https://github.com/allenai/dolma/pull/204

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.12...v1.0.13

- Python
Published by undfined over 1 year ago

dolma - v1.0.12

What's Changed

  • Added tokenizers for length by @soldni in https://github.com/allenai/dolma/pull/189
  • Update getting-started.md by @yushengsu-thu in https://github.com/allenai/dolma/pull/193
  • Bump nltk from 3.8.1 to 3.9 in the pip group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/187
  • Use Numpy v1.x instead of 2.x by @soldni in https://github.com/allenai/dolma/pull/195

New Contributors

  • @yushengsu-thu made their first contribution in https://github.com/allenai/dolma/pull/193

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.11...v1.0.12

- Python
Published by soldni almost 2 years ago

dolma - v1.0.11

What's Changed

  • Changed entrypoint to increase IntelliJ compatibility by @soldni in https://github.com/allenai/dolma/pull/188

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.10...v1.0.11

- Python
Published by soldni almost 2 years ago

dolma - v1.0.10

What's Changed

  • Fix local installation on MacOS by @epwalsh in https://github.com/allenai/dolma/pull/185
  • Count Bytes and Docs by @soldni in https://github.com/allenai/dolma/pull/186

New Contributors

  • @epwalsh made their first contribution in https://github.com/allenai/dolma/pull/185

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.9...v1.0.10

- Python
Published by soldni almost 2 years ago

dolma - v1.0.9

What's Changed

  • Fix Tests to pass with new mixer behavior by @soldni in https://github.com/allenai/dolma/pull/184

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.8...v1.0.9

- Python
Published by soldni almost 2 years ago

dolma - v1.0.8

What's Changed

  • Always use inferred extension by @undfined in https://github.com/allenai/dolma/pull/183

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.7...v1.0.8

- Python
Published by undfined almost 2 years ago

dolma - v1.0.7

What's Changed

  • Better Filters Error Handling by @soldni in https://github.com/allenai/dolma/pull/171
  • Bump openssl from 0.10.64 to 0.10.66 in the cargo group by @dependabot in https://github.com/allenai/dolma/pull/178
  • Bump to 1.0.7 by @undfined in https://github.com/allenai/dolma/pull/182

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.6...v1.0.7

- Python
Published by soldni almost 2 years ago

dolma - v1.0.6

What's Changed

  • V2 of Gopher tagger by @soldni in https://github.com/allenai/dolma/pull/181

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.5...v1.0.6

- Python
Published by undfined almost 2 years ago

dolma - v1.0.5

What's Changed

  • Cherry pick zstd compressor by @undfined in https://github.com/allenai/dolma/pull/180

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.4...v1.0.5

- Python
Published by undfined almost 2 years ago

dolma - v1.0.4

What's Changed

  • Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/149
  • fix divide by 0 in gopher tagger by @peterbjorgensen in https://github.com/allenai/dolma/pull/148
  • Fixing dtype option not being correctly propagated by @soldni in https://github.com/allenai/dolma/pull/154
  • Add support for parsing WARC by @soldni in https://github.com/allenai/dolma/pull/153
  • Reducing hash calls by @Whattabatt in https://github.com/allenai/dolma/pull/156
  • Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/155
  • Adding Quality Classifier from Dolma 1.7 by @soldni in https://github.com/allenai/dolma/pull/163
  • Adds ZST support in Deduper and Mixer by @soldni in https://github.com/allenai/dolma/pull/170
  • Workaround to fix memory leak in HuggingFace tokenizer by @soldni in https://github.com/allenai/dolma/pull/169
  • Adding partition logic by @Whattabatt in https://github.com/allenai/dolma/pull/161
  • added option for tokenizer to split on special tokens by @soldni in https://github.com/allenai/dolma/pull/176
  • Version bump for new release (1.0.4) by @soldni in https://github.com/allenai/dolma/pull/179

New Contributors

  • @Whattabatt made their first contribution in https://github.com/allenai/dolma/pull/156

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.3...v1.0.4

- Python
Published by soldni almost 2 years ago

dolma - v1.0.3

What's Changed

  • Fix local shuffling failure by @soldni in https://github.com/allenai/dolma/pull/140
  • Fix issue in getting started tutorial using wikipedia data by @RohitRathore1 in https://github.com/allenai/dolma/pull/117
  • Add an option to improve tokenization shuffling by @soldni in https://github.com/allenai/dolma/pull/141
  • Optionally add total/sum to output of analyzer by @soldni in https://github.com/allenai/dolma/pull/144
  • Add extra tests for multi-byte unicode spans in deduper. by @soldni in https://github.com/allenai/dolma/pull/145
  • Bump s3 client lib and parameterize region in s3 tests + devcontainer by @undfined in https://github.com/allenai/dolma/pull/147

New Contributors

  • @RohitRathore1 made their first contribution in https://github.com/allenai/dolma/pull/117
  • @undfined made their first contribution in https://github.com/allenai/dolma/pull/147

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.2...v1.0.3

- Python
Published by soldni about 2 years ago

dolma - v1.0.2

What's Changed

  • Taggers for URL filtering by @soldni in https://github.com/allenai/dolma/pull/112
  • Updated CFF and Bibtex by @soldni in https://github.com/allenai/dolma/pull/118
  • Add preliminary Dolma v1.7 configurations, fix corner case in tokens. by @soldni in https://github.com/allenai/dolma/pull/120
  • Update CITATION.cff by @soldni in https://github.com/allenai/dolma/pull/126
  • Option to use ngram overlap to dedupe paragraphs by @rodneykinney in https://github.com/allenai/dolma/pull/122
  • Tagger modules import (fix for #128) by @soldni in https://github.com/allenai/dolma/pull/129
  • Added Support for JQ syntax in include/exclude mixer config by @soldni in https://github.com/allenai/dolma/pull/131
  • Added JQ syntax for replacements + added minimum score. by @soldni in https://github.com/allenai/dolma/pull/133
  • Bump the cargo group group with 1 update by @dependabot in https://github.com/allenai/dolma/pull/132
  • Improves tool to compute statistics; adds deduplication options. by @soldni in https://github.com/allenai/dolma/pull/135
  • use precompiled regex when loading url blocklists by @peterbjorgensen in https://github.com/allenai/dolma/pull/137

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.1...v1.0.2

- Python
Published by soldni about 2 years ago

dolma - v1.0.1

What's Changed

  • Update README.md by @eltociear in https://github.com/allenai/dolma/pull/115
  • do not overwrite tagger outputs with the same output path, fixes #113 by @peterbjorgensen in https://github.com/allenai/dolma/pull/114
  • Fix broken data sheet link in README by @simonw in https://github.com/allenai/dolma/pull/107
  • Modify CI to build when version is incremented; increment to v1.0.1 by @soldni in https://github.com/allenai/dolma/pull/116

New Contributors

  • @eltociear made their first contribution in https://github.com/allenai/dolma/pull/115
  • @simonw made their first contribution in https://github.com/allenai/dolma/pull/107

Full Changelog: https://github.com/allenai/dolma/compare/v1.0.0...v1.0.1

- Python
Published by soldni over 2 years ago

dolma - v1.0.0

What's Changed

  • Add robust median to gopher filter by @KennethEnevoldsen in https://github.com/allenai/dolma/pull/98
  • Disambiguating that the repo is for the dolma toolkit in various docs by @arnavic in https://github.com/allenai/dolma/pull/104
  • V1.0 candidate; new deduper options, new taggers by @soldni in https://github.com/allenai/dolma/pull/100
  • Fixing Errors in Linux Build by @soldni in https://github.com/allenai/dolma/pull/105

New Contributors

  • @KennethEnevoldsen made their first contribution in https://github.com/allenai/dolma/pull/98
  • @arnavic made their first contribution in https://github.com/allenai/dolma/pull/104

Full Changelog: https://github.com/allenai/dolma/compare/v0.9.4...v1.0.0

- Python
Published by soldni over 2 years ago

dolma - v0.9.4

What's Changed

  • Bump h2 from 0.3.20 to 0.3.24 by @dependabot in https://github.com/allenai/dolma/pull/101
  • BOS/EOS/PAD options in tokens cli; speed up tokenization by segmenting paragraphs. by @soldni in https://github.com/allenai/dolma/pull/102
  • Fixed Dangling CLI Options; E2E Tokenizer Tests by @soldni in https://github.com/allenai/dolma/pull/103

Full Changelog: https://github.com/allenai/dolma/compare/v0.9.2...v0.9.4

- Python
Published by soldni over 2 years ago

dolma - v0.9.2

What's Changed

  • Remove unnecessary spawn in tokenizer, fix config with multiple paths by @soldni in https://github.com/allenai/dolma/pull/67
  • Add tagger_modules option to tagger cli by @peterbjorgensen in https://github.com/allenai/dolma/pull/69
  • feature to get the compliment of a hash sample by @IanMagnusson in https://github.com/allenai/dolma/pull/72
  • Fix Hardcoded Tokenizer by @soldni in https://github.com/allenai/dolma/pull/71
  • Fix a few issues of the FixedBucketsValTracker by @peterbjorgensen in https://github.com/allenai/dolma/pull/73
  • Add attribute correlations by @Muennighoff in https://github.com/allenai/dolma/pull/68
  • Porting missing code filtering rules to dolma repo by @soldni in https://github.com/allenai/dolma/pull/86
  • Disable cache in CI to prevent build failures by @soldni in https://github.com/allenai/dolma/pull/90
  • Reddit processing code by @drschwenk in https://github.com/allenai/dolma/pull/74
  • update readme by @kyleclo in https://github.com/allenai/dolma/pull/95
  • code/reasoning evaluation script by @benbogin in https://github.com/allenai/dolma/pull/94
  • Add The Stack statistics by @Muennighoff in https://github.com/allenai/dolma/pull/92
  • Fixing Build Config Issues by @soldni in https://github.com/allenai/dolma/pull/99

New Contributors

  • @peterbjorgensen made their first contribution in https://github.com/allenai/dolma/pull/69
  • @IanMagnusson made their first contribution in https://github.com/allenai/dolma/pull/72
  • @drschwenk made their first contribution in https://github.com/allenai/dolma/pull/74
  • @benbogin made their first contribution in https://github.com/allenai/dolma/pull/94

Full Changelog: https://github.com/allenai/dolma/compare/v0.9.1...v0.9.2

- Python
Published by soldni over 2 years ago

dolma - v0.9.1

What's Changed

  • Fix Jekyll Docs Build by @soldni in https://github.com/allenai/dolma/pull/55
  • Adding Citation text back to README by @soldni in https://github.com/allenai/dolma/pull/56
  • Bump rustix from 0.37.20 to 0.37.25 by @dependabot in https://github.com/allenai/dolma/pull/59
  • Documentation on BaseParallelProcessor by @soldni in https://github.com/allenai/dolma/pull/62
  • Add download instruction by @Muennighoff in https://github.com/allenai/dolma/pull/63
  • Fix spawn method for multiprocessing by @soldni in https://github.com/allenai/dolma/pull/64
  • Fix hardcoded URL by @soldni in https://github.com/allenai/dolma/pull/65
  • Fix Accidental Override of Boolean Value by @soldni in https://github.com/allenai/dolma/pull/66

New Contributors

  • @Muennighoff made their first contribution in https://github.com/allenai/dolma/pull/63

Full Changelog: https://github.com/allenai/dolma/compare/v0.9.0...v0.9.1

- Python
Published by soldni over 2 years ago

dolma - v0.9.0

What's Changed

  • Skipping AWS checks when aws access key is not available by @soldni in https://github.com/allenai/dolma/pull/28
  • env variable is not passed to tests by @soldni in https://github.com/allenai/dolma/pull/29
  • Fix make by @chris-ha458 in https://github.com/allenai/dolma/pull/24
  • Fix make more by @chris-ha458 in https://github.com/allenai/dolma/pull/31
  • ff by @soldni in https://github.com/allenai/dolma/pull/36
  • Adding C4 example, dryrun mode, profiling taggers by @soldni in https://github.com/allenai/dolma/pull/37
  • Only run Python style checks on source and tests by @soldni in https://github.com/allenai/dolma/pull/38
  • fix rust parts by @chris-ha458 in https://github.com/allenai/dolma/pull/23
  • Add rust unit tests by @chris-ha458 in https://github.com/allenai/dolma/pull/35
  • Bump webpki from 0.22.0 to 0.22.2 by @dependabot in https://github.com/allenai/dolma/pull/52
  • Adding Tokenizer, Writing Documentation, Misc Bugs & CLI improvements by @soldni in https://github.com/allenai/dolma/pull/54

New Contributors

  • @chris-ha458 made their first contribution in https://github.com/allenai/dolma/pull/24
  • @dependabot made their first contribution in https://github.com/allenai/dolma/pull/52

Full Changelog: https://github.com/allenai/dolma/compare/v0.8.0...v0.9.0

- Python
Published by soldni over 2 years ago

dolma - v0.8.0

What's Changed

  • Analyzer to save and plot taggers distribution by @soldni in https://github.com/allenai/dolma/pull/21
  • Scripts to compute statistics by @soldni in https://github.com/allenai/dolma/pull/22

Full Changelog: https://github.com/allenai/dolma/compare/v0.7.0...v0.8.0

- Python
Published by soldni almost 3 years ago

dolma - v0.7.0

What's Changed

  • CLI improvements, remove need of experiment name by @soldni in https://github.com/allenai/dolma/pull/20

Full Changelog: https://github.com/allenai/dolma/compare/v0.6.5...v0.7.0

- Python
Published by soldni almost 3 years ago

dolma - v0.6.5

What's Changed

  • added validation of configs, tagger bugfixes by @soldni in https://github.com/allenai/dolma/pull/18
  • upping version by @soldni in https://github.com/allenai/dolma/pull/19

Full Changelog: https://github.com/allenai/dolma/compare/v0.6.4...v0.6.5

- Python
Published by soldni almost 3 years ago

dolma - v0.6.4

What's Changed

  • adding tests in CI by @soldni in https://github.com/allenai/dolma/pull/17
  • Added tests for local/remote bindings for deduper/mixer by @soldni in https://github.com/allenai/dolma/pull/15

Full Changelog: https://github.com/allenai/dolma/compare/v0.6.3...v0.6.4

- Python
Published by soldni almost 3 years ago

dolma - v0.6.3

What's Changed

  • Mixer can use s3 or local paths by @rodneykinney in https://github.com/allenai/dolma/pull/14

New Contributors

  • @rodneykinney made their first contribution in https://github.com/allenai/dolma/pull/14

Full Changelog: https://github.com/allenai/dolma/compare/v0.6.2...v0.6.3

- Python
Published by soldni almost 3 years ago

dolma - v0.6.2

What's Changed

  • Add Dirk as an author by @dirkgr in https://github.com/allenai/dolma/pull/2
  • README, tests by @kyleclo in https://github.com/allenai/dolma/pull/1
  • ff main by @soldni in https://github.com/allenai/dolma/pull/4
  • Kylel/test taggers by @kyleclo in https://github.com/allenai/dolma/pull/5
  • ff soldni/cli branch by @soldni in https://github.com/allenai/dolma/pull/7
  • add tests for data types by @kyleclo in https://github.com/allenai/dolma/pull/8
  • ff by @soldni in https://github.com/allenai/dolma/pull/9
  • CLI for dolma by @soldni in https://github.com/allenai/dolma/pull/6
  • Readme and instructions by @soldni in https://github.com/allenai/dolma/pull/11
  • Update README.md by @ianand in https://github.com/allenai/dolma/pull/12
  • Fixing Build Issues by @soldni in https://github.com/allenai/dolma/pull/13

New Contributors

  • @dirkgr made their first contribution in https://github.com/allenai/dolma/pull/2
  • @kyleclo made their first contribution in https://github.com/allenai/dolma/pull/1
  • @soldni made their first contribution in https://github.com/allenai/dolma/pull/4
  • @ianand made their first contribution in https://github.com/allenai/dolma/pull/12

Full Changelog: https://github.com/allenai/dolma/commits/v0.6.2

- Python
Published by soldni almost 3 years ago