Recent Releases of dolma
dolma - v1.2.1
What's Changed
- FIM script modifications by @regan-huff in https://github.com/allenai/dolma/pull/258
- Adding tool to reshard npy files based on maximum desired size. by @soldni in https://github.com/allenai/dolma/pull/269
New Contributors
- @regan-huff made their first contribution in https://github.com/allenai/dolma/pull/258
Full Changelog: https://github.com/allenai/dolma/compare/v1.2.0...v1.2.1
- Python
Published by soldni 11 months ago
dolma - v1.2.0
What's Changed
- Code-prose-composition tagger. by @no0p in https://github.com/allenai/dolma/pull/247
- Support for WARC resource record types. by @no0p in https://github.com/allenai/dolma/pull/248
- Bump artifacts version to 4.4.1 by @no0p in https://github.com/allenai/dolma/pull/252
- Sanitize-concat-fim by @cmwilhelm in https://github.com/allenai/dolma/pull/253
- Use original s3 path to delete local cache by @CodeCreator in https://github.com/allenai/dolma/pull/257
- Safe tokenization by skipping failing docs. by @soldni in https://github.com/allenai/dolma/pull/245
- Update RedPajama branch link by @guspan-tanadi in https://github.com/allenai/dolma/pull/263
- Skipping empty tagger key instead of erroring out by @Whattabatt in https://github.com/allenai/dolma/pull/262
- Tokenizer over custom fields and w/o IDs; BOS/EOS tokens. by @soldni in https://github.com/allenai/dolma/pull/266
New Contributors
- @no0p made their first contribution in https://github.com/allenai/dolma/pull/247
- @CodeCreator made their first contribution in https://github.com/allenai/dolma/pull/257
- @guspan-tanadi made their first contribution in https://github.com/allenai/dolma/pull/263
Full Changelog: https://github.com/allenai/dolma/compare/v1.1.2...v1.2.0
- Python
Published by soldni 12 months ago
dolma - v1.1.2
What's Changed
- Matrix targets stomping each other by @undfined in https://github.com/allenai/dolma/pull/236
- Artifact name change in release too by @undfined in https://github.com/allenai/dolma/pull/237
- Pattern match for all artifacts by @undfined in https://github.com/allenai/dolma/pull/239
- Handle retries in the aws client, use adaptive backoff by @undfined in https://github.com/allenai/dolma/pull/241
- Bump version to 1.1.2 for release by @undfined in https://github.com/allenai/dolma/pull/243
Full Changelog: https://github.com/allenai/dolma/compare/v1.1.1...v1.1.2
- Python
Published by undfined over 1 year ago
dolma - v1.1.1
What's Changed
- Fixes checks by @cmwilhelm in https://github.com/allenai/dolma/pull/232
- Adds strict filter mechanism to Mixer by @undfined in https://github.com/allenai/dolma/pull/231
- v3 download-artifact GH action deprecated by @undfined in https://github.com/allenai/dolma/pull/235
Full Changelog: https://github.com/allenai/dolma/compare/v1.1.0...v1.1.1
- Python
Published by undfined over 1 year ago
dolma - v1.1.0
What's Changed
- Bump version with postfix for PyPI by @undfined in https://github.com/allenai/dolma/pull/206
- Pin maturin in CI by @undfined in https://github.com/allenai/dolma/pull/207
- Also pin maturin in action by @undfined in https://github.com/allenai/dolma/pull/208
- 'File partition' option and 'document' directory specification by @Whattabatt in https://github.com/allenai/dolma/pull/213
- Fixed issues and improved documentation in getting-started.md by @aman-17 in https://github.com/allenai/dolma/pull/216
- Mixer validator by @mariia-iureva in https://github.com/allenai/dolma/pull/215
- Typo in optional dependency (
lingua) check by @soldni in https://github.com/allenai/dolma/pull/221
New Contributors
- @aman-17 made their first contribution in https://github.com/allenai/dolma/pull/216
- @mariia-iureva made their first contribution in https://github.com/allenai/dolma/pull/215
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.14...v1.0.15
- Python
Published by soldni over 1 year ago
dolma - v1.0.14.post1
What's Changed
- Bump version with postfix for PyPI by @undfined in https://github.com/allenai/dolma/pull/206
- Pin maturin in CI by @undfined in https://github.com/allenai/dolma/pull/207
- Also pin maturin in action by @undfined in https://github.com/allenai/dolma/pull/208
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.14...v1.0.14.post1
- Python
Published by undfined over 1 year ago
dolma - v1.0.13
What's Changed
- Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/198
- Fix bug in length filtering for deduping by @soldni in https://github.com/allenai/dolma/pull/197
- Polymorphic span replacement by @undfined in https://github.com/allenai/dolma/pull/200
- Dependabot fail, match upload/download action versions by @undfined in https://github.com/allenai/dolma/pull/202
- [Json fooramt error in line 133] Update getting-started.md by @yushengsu-thu in https://github.com/allenai/dolma/pull/196
- Revert upload/download to v3 for now by @undfined in https://github.com/allenai/dolma/pull/203
- Undfined/runner v3 by @undfined in https://github.com/allenai/dolma/pull/204
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.12...v1.0.13
- Python
Published by undfined over 1 year ago
dolma - v1.0.12
What's Changed
- Added tokenizers for length by @soldni in https://github.com/allenai/dolma/pull/189
- Update getting-started.md by @yushengsu-thu in https://github.com/allenai/dolma/pull/193
- Bump nltk from 3.8.1 to 3.9 in the pip group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/187
- Use Numpy v1.x instead of 2.x by @soldni in https://github.com/allenai/dolma/pull/195
New Contributors
- @yushengsu-thu made their first contribution in https://github.com/allenai/dolma/pull/193
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.11...v1.0.12
- Python
Published by soldni almost 2 years ago
dolma - v1.0.10
What's Changed
- Fix local installation on MacOS by @epwalsh in https://github.com/allenai/dolma/pull/185
- Count Bytes and Docs by @soldni in https://github.com/allenai/dolma/pull/186
New Contributors
- @epwalsh made their first contribution in https://github.com/allenai/dolma/pull/185
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.9...v1.0.10
- Python
Published by soldni almost 2 years ago
dolma - v1.0.7
What's Changed
- Better Filters Error Handling by @soldni in https://github.com/allenai/dolma/pull/171
- Bump openssl from 0.10.64 to 0.10.66 in the cargo group by @dependabot in https://github.com/allenai/dolma/pull/178
- Bump to 1.0.7 by @undfined in https://github.com/allenai/dolma/pull/182
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.6...v1.0.7
- Python
Published by soldni almost 2 years ago
dolma - v1.0.4
What's Changed
- Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/149
- fix divide by 0 in gopher tagger by @peterbjorgensen in https://github.com/allenai/dolma/pull/148
- Fixing dtype option not being correctly propagated by @soldni in https://github.com/allenai/dolma/pull/154
- Add support for parsing WARC by @soldni in https://github.com/allenai/dolma/pull/153
- Reducing hash calls by @Whattabatt in https://github.com/allenai/dolma/pull/156
- Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory by @dependabot in https://github.com/allenai/dolma/pull/155
- Adding Quality Classifier from Dolma 1.7 by @soldni in https://github.com/allenai/dolma/pull/163
- Adds ZST support in Deduper and Mixer by @soldni in https://github.com/allenai/dolma/pull/170
- Workaround to fix memory leak in HuggingFace tokenizer by @soldni in https://github.com/allenai/dolma/pull/169
- Adding partition logic by @Whattabatt in https://github.com/allenai/dolma/pull/161
- added option for tokenizer to split on special tokens by @soldni in https://github.com/allenai/dolma/pull/176
- Version bump for new release (1.0.4) by @soldni in https://github.com/allenai/dolma/pull/179
New Contributors
- @Whattabatt made their first contribution in https://github.com/allenai/dolma/pull/156
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.3...v1.0.4
- Python
Published by soldni almost 2 years ago
dolma - v1.0.3
What's Changed
- Fix local shuffling failure by @soldni in https://github.com/allenai/dolma/pull/140
- Fix issue in getting started tutorial using wikipedia data by @RohitRathore1 in https://github.com/allenai/dolma/pull/117
- Add an option to improve tokenization shuffling by @soldni in https://github.com/allenai/dolma/pull/141
- Optionally add total/sum to output of analyzer by @soldni in https://github.com/allenai/dolma/pull/144
- Add extra tests for multi-byte unicode spans in deduper. by @soldni in https://github.com/allenai/dolma/pull/145
- Bump s3 client lib and parameterize region in s3 tests + devcontainer by @undfined in https://github.com/allenai/dolma/pull/147
New Contributors
- @RohitRathore1 made their first contribution in https://github.com/allenai/dolma/pull/117
- @undfined made their first contribution in https://github.com/allenai/dolma/pull/147
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.2...v1.0.3
- Python
Published by soldni about 2 years ago
dolma - v1.0.2
What's Changed
- Taggers for URL filtering by @soldni in https://github.com/allenai/dolma/pull/112
- Updated CFF and Bibtex by @soldni in https://github.com/allenai/dolma/pull/118
- Add preliminary Dolma v1.7 configurations, fix corner case in tokens. by @soldni in https://github.com/allenai/dolma/pull/120
- Update CITATION.cff by @soldni in https://github.com/allenai/dolma/pull/126
- Option to use ngram overlap to dedupe paragraphs by @rodneykinney in https://github.com/allenai/dolma/pull/122
- Tagger modules import (fix for #128) by @soldni in https://github.com/allenai/dolma/pull/129
- Added Support for JQ syntax in include/exclude mixer config by @soldni in https://github.com/allenai/dolma/pull/131
- Added JQ syntax for replacements + added minimum score. by @soldni in https://github.com/allenai/dolma/pull/133
- Bump the cargo group group with 1 update by @dependabot in https://github.com/allenai/dolma/pull/132
- Improves tool to compute statistics; adds deduplication options. by @soldni in https://github.com/allenai/dolma/pull/135
- use precompiled regex when loading url blocklists by @peterbjorgensen in https://github.com/allenai/dolma/pull/137
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.1...v1.0.2
- Python
Published by soldni about 2 years ago
dolma - v1.0.1
What's Changed
- Update README.md by @eltociear in https://github.com/allenai/dolma/pull/115
- do not overwrite tagger outputs with the same output path, fixes #113 by @peterbjorgensen in https://github.com/allenai/dolma/pull/114
- Fix broken data sheet link in README by @simonw in https://github.com/allenai/dolma/pull/107
- Modify CI to build when version is incremented; increment to v1.0.1 by @soldni in https://github.com/allenai/dolma/pull/116
New Contributors
- @eltociear made their first contribution in https://github.com/allenai/dolma/pull/115
- @simonw made their first contribution in https://github.com/allenai/dolma/pull/107
Full Changelog: https://github.com/allenai/dolma/compare/v1.0.0...v1.0.1
- Python
Published by soldni over 2 years ago
dolma - v1.0.0
What's Changed
- Add robust median to gopher filter by @KennethEnevoldsen in https://github.com/allenai/dolma/pull/98
- Disambiguating that the repo is for the dolma toolkit in various docs by @arnavic in https://github.com/allenai/dolma/pull/104
- V1.0 candidate; new deduper options, new taggers by @soldni in https://github.com/allenai/dolma/pull/100
- Fixing Errors in Linux Build by @soldni in https://github.com/allenai/dolma/pull/105
New Contributors
- @KennethEnevoldsen made their first contribution in https://github.com/allenai/dolma/pull/98
- @arnavic made their first contribution in https://github.com/allenai/dolma/pull/104
Full Changelog: https://github.com/allenai/dolma/compare/v0.9.4...v1.0.0
- Python
Published by soldni over 2 years ago
dolma - v0.9.4
What's Changed
- Bump h2 from 0.3.20 to 0.3.24 by @dependabot in https://github.com/allenai/dolma/pull/101
- BOS/EOS/PAD options in
tokenscli; speed up tokenization by segmenting paragraphs. by @soldni in https://github.com/allenai/dolma/pull/102 - Fixed Dangling CLI Options; E2E Tokenizer Tests by @soldni in https://github.com/allenai/dolma/pull/103
Full Changelog: https://github.com/allenai/dolma/compare/v0.9.2...v0.9.4
- Python
Published by soldni over 2 years ago
dolma - v0.9.2
What's Changed
- Remove unnecessary spawn in tokenizer, fix config with multiple paths by @soldni in https://github.com/allenai/dolma/pull/67
- Add tagger_modules option to tagger cli by @peterbjorgensen in https://github.com/allenai/dolma/pull/69
- feature to get the compliment of a hash sample by @IanMagnusson in https://github.com/allenai/dolma/pull/72
- Fix Hardcoded Tokenizer by @soldni in https://github.com/allenai/dolma/pull/71
- Fix a few issues of the FixedBucketsValTracker by @peterbjorgensen in https://github.com/allenai/dolma/pull/73
- Add attribute correlations by @Muennighoff in https://github.com/allenai/dolma/pull/68
- Porting missing code filtering rules to dolma repo by @soldni in https://github.com/allenai/dolma/pull/86
- Disable cache in CI to prevent build failures by @soldni in https://github.com/allenai/dolma/pull/90
- Reddit processing code by @drschwenk in https://github.com/allenai/dolma/pull/74
- update readme by @kyleclo in https://github.com/allenai/dolma/pull/95
- code/reasoning evaluation script by @benbogin in https://github.com/allenai/dolma/pull/94
- Add The Stack statistics by @Muennighoff in https://github.com/allenai/dolma/pull/92
- Fixing Build Config Issues by @soldni in https://github.com/allenai/dolma/pull/99
New Contributors
- @peterbjorgensen made their first contribution in https://github.com/allenai/dolma/pull/69
- @IanMagnusson made their first contribution in https://github.com/allenai/dolma/pull/72
- @drschwenk made their first contribution in https://github.com/allenai/dolma/pull/74
- @benbogin made their first contribution in https://github.com/allenai/dolma/pull/94
Full Changelog: https://github.com/allenai/dolma/compare/v0.9.1...v0.9.2
- Python
Published by soldni over 2 years ago
dolma - v0.9.1
What's Changed
- Fix Jekyll Docs Build by @soldni in https://github.com/allenai/dolma/pull/55
- Adding Citation text back to README by @soldni in https://github.com/allenai/dolma/pull/56
- Bump rustix from 0.37.20 to 0.37.25 by @dependabot in https://github.com/allenai/dolma/pull/59
- Documentation on BaseParallelProcessor by @soldni in https://github.com/allenai/dolma/pull/62
- Add download instruction by @Muennighoff in https://github.com/allenai/dolma/pull/63
- Fix spawn method for multiprocessing by @soldni in https://github.com/allenai/dolma/pull/64
- Fix hardcoded URL by @soldni in https://github.com/allenai/dolma/pull/65
- Fix Accidental Override of Boolean Value by @soldni in https://github.com/allenai/dolma/pull/66
New Contributors
- @Muennighoff made their first contribution in https://github.com/allenai/dolma/pull/63
Full Changelog: https://github.com/allenai/dolma/compare/v0.9.0...v0.9.1
- Python
Published by soldni over 2 years ago
dolma - v0.9.0
What's Changed
- Skipping AWS checks when aws access key is not available by @soldni in https://github.com/allenai/dolma/pull/28
- env variable is not passed to tests by @soldni in https://github.com/allenai/dolma/pull/29
- Fix make by @chris-ha458 in https://github.com/allenai/dolma/pull/24
- Fix
makemore by @chris-ha458 in https://github.com/allenai/dolma/pull/31 - ff by @soldni in https://github.com/allenai/dolma/pull/36
- Adding C4 example, dryrun mode, profiling taggers by @soldni in https://github.com/allenai/dolma/pull/37
- Only run Python style checks on source and tests by @soldni in https://github.com/allenai/dolma/pull/38
- fix rust parts by @chris-ha458 in https://github.com/allenai/dolma/pull/23
- Add rust unit tests by @chris-ha458 in https://github.com/allenai/dolma/pull/35
- Bump webpki from 0.22.0 to 0.22.2 by @dependabot in https://github.com/allenai/dolma/pull/52
- Adding Tokenizer, Writing Documentation, Misc Bugs & CLI improvements by @soldni in https://github.com/allenai/dolma/pull/54
New Contributors
- @chris-ha458 made their first contribution in https://github.com/allenai/dolma/pull/24
- @dependabot made their first contribution in https://github.com/allenai/dolma/pull/52
Full Changelog: https://github.com/allenai/dolma/compare/v0.8.0...v0.9.0
- Python
Published by soldni over 2 years ago
dolma - v0.8.0
What's Changed
- Analyzer to save and plot taggers distribution by @soldni in https://github.com/allenai/dolma/pull/21
- Scripts to compute statistics by @soldni in https://github.com/allenai/dolma/pull/22
Full Changelog: https://github.com/allenai/dolma/compare/v0.7.0...v0.8.0
- Python
Published by soldni almost 3 years ago
dolma - v0.6.5
What's Changed
- added validation of configs, tagger bugfixes by @soldni in https://github.com/allenai/dolma/pull/18
- upping version by @soldni in https://github.com/allenai/dolma/pull/19
Full Changelog: https://github.com/allenai/dolma/compare/v0.6.4...v0.6.5
- Python
Published by soldni almost 3 years ago
dolma - v0.6.4
What's Changed
- adding tests in CI by @soldni in https://github.com/allenai/dolma/pull/17
- Added tests for local/remote bindings for deduper/mixer by @soldni in https://github.com/allenai/dolma/pull/15
Full Changelog: https://github.com/allenai/dolma/compare/v0.6.3...v0.6.4
- Python
Published by soldni almost 3 years ago
dolma - v0.6.3
What's Changed
- Mixer can use s3 or local paths by @rodneykinney in https://github.com/allenai/dolma/pull/14
New Contributors
- @rodneykinney made their first contribution in https://github.com/allenai/dolma/pull/14
Full Changelog: https://github.com/allenai/dolma/compare/v0.6.2...v0.6.3
- Python
Published by soldni almost 3 years ago
dolma - v0.6.2
What's Changed
- Add Dirk as an author by @dirkgr in https://github.com/allenai/dolma/pull/2
- README, tests by @kyleclo in https://github.com/allenai/dolma/pull/1
- ff main by @soldni in https://github.com/allenai/dolma/pull/4
- Kylel/test taggers by @kyleclo in https://github.com/allenai/dolma/pull/5
- ff soldni/cli branch by @soldni in https://github.com/allenai/dolma/pull/7
- add tests for data types by @kyleclo in https://github.com/allenai/dolma/pull/8
- ff by @soldni in https://github.com/allenai/dolma/pull/9
- CLI for dolma by @soldni in https://github.com/allenai/dolma/pull/6
- Readme and instructions by @soldni in https://github.com/allenai/dolma/pull/11
- Update README.md by @ianand in https://github.com/allenai/dolma/pull/12
- Fixing Build Issues by @soldni in https://github.com/allenai/dolma/pull/13
New Contributors
- @dirkgr made their first contribution in https://github.com/allenai/dolma/pull/2
- @kyleclo made their first contribution in https://github.com/allenai/dolma/pull/1
- @soldni made their first contribution in https://github.com/allenai/dolma/pull/4
- @ianand made their first contribution in https://github.com/allenai/dolma/pull/12
Full Changelog: https://github.com/allenai/dolma/commits/v0.6.2
- Python
Published by soldni almost 3 years ago