Recent Releases of datatrove
datatrove - v0.5.0
What's Changed
- fix by @kylematoba in https://github.com/huggingface/datatrove/pull/314
- Changed FTFY defaults by @guipenedo in https://github.com/huggingface/datatrove/pull/319
- Adding Megatron Tokenization pipeline by @TJ-Solergibert in https://github.com/huggingface/datatrove/pull/304
- Add
job_id_positionParameter tolaunch_slurm_jobMethod by @StephenRebel in https://github.com/huggingface/datatrove/pull/282 - load_tokenizer can now load local hf folder by @ceferisbarov in https://github.com/huggingface/datatrove/pull/306
- Add glob pattern for hash index by @jordane95 in https://github.com/huggingface/datatrove/pull/313
- fix(utils): Enhance the dependencies check to include pip distribution by @aiqwe in https://github.com/huggingface/datatrove/pull/317
- Update README.md by @saforem2 in https://github.com/huggingface/datatrove/pull/323
- Fix issues with URL Deduplication when using the Index by @muzzynine in https://github.com/huggingface/datatrove/pull/327
- Add customization for fetching SLURM job id by @BramVanroy in https://github.com/huggingface/datatrove/pull/320
- fixes stopwors implementation by @guipenedo in https://github.com/huggingface/datatrove/pull/329
- Allow custom parquet schema by @BramVanroy in https://github.com/huggingface/datatrove/pull/330
- [draft] Add chunking option to DocumentTokenizer by @craffel in https://github.com/huggingface/datatrove/pull/342
- Revert "[draft] Add chunking option to DocumentTokenizer" by @guipenedo in https://github.com/huggingface/datatrove/pull/343
- fix: root condition for SENTINEL by @jordane95 in https://github.com/huggingface/datatrove/pull/349
- correct metadata parsing for finemath by @VivienCabannes in https://github.com/huggingface/datatrove/pull/355
- add oom score + shorter polling by @hynky1999 in https://github.com/huggingface/datatrove/pull/361
- Resolve issue 308 by @habanoz in https://github.com/huggingface/datatrove/pull/309
- [draft] Add chunking option to DocumentTokenizer by @craffel in https://github.com/huggingface/datatrove/pull/344
- Add RayPipelineExecutor by @nelson-liu in https://github.com/huggingface/datatrove/pull/331
- Bump ring from 0.17.8 to 0.17.14 in /src/datatrove/tools/fast_mh3 by @dependabot in https://github.com/huggingface/datatrove/pull/363
- Bump tokio from 1.41.1 to 1.43.1 in /src/datatrove/tools/fast_mh3 by @dependabot in https://github.com/huggingface/datatrove/pull/362
- Fix signatures priority queue initialization in MinhashBuildIndex by @nelson-liu in https://github.com/huggingface/datatrove/pull/334
- Shuffle by chunks support in DocumentTokenizerMerger by @guipenedo in https://github.com/huggingface/datatrove/pull/364
- return positions based on .index if return_positions=True in the data… by @guipenedo in https://github.com/huggingface/datatrove/pull/356
New Contributors
- @kylematoba made their first contribution in https://github.com/huggingface/datatrove/pull/314
- @StephenRebel made their first contribution in https://github.com/huggingface/datatrove/pull/282
- @ceferisbarov made their first contribution in https://github.com/huggingface/datatrove/pull/306
- @saforem2 made their first contribution in https://github.com/huggingface/datatrove/pull/323
- @muzzynine made their first contribution in https://github.com/huggingface/datatrove/pull/327
- @craffel made their first contribution in https://github.com/huggingface/datatrove/pull/342
- @VivienCabannes made their first contribution in https://github.com/huggingface/datatrove/pull/355
- @habanoz made their first contribution in https://github.com/huggingface/datatrove/pull/309
- @nelson-liu made their first contribution in https://github.com/huggingface/datatrove/pull/331
- @dependabot made their first contribution in https://github.com/huggingface/datatrove/pull/363
Full Changelog: https://github.com/huggingface/datatrove/compare/v0.4.0...v0.5.0
- Python
Published by guipenedo 11 months ago
datatrove - v0.4.0
What's Changed
- Readme nits by @hynky1999 in https://github.com/huggingface/datatrove/pull/280
- Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in https://github.com/huggingface/datatrove/pull/286
- Fix languages listify bug by @BramVanroy in https://github.com/huggingface/datatrove/pull/294
- [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in https://github.com/huggingface/datatrove/pull/296
- [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in https://github.com/huggingface/datatrove/pull/307
- FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in https://github.com/huggingface/datatrove/pull/285:
- upgrades to support numpy 2.0
- added additional word tokenizers and revamped word tokenizer assignment mechanism
- MinHash optimizations + new rust tool to speed up step3
- MinHash cluster sizes feature
- fixed memory leaks from some word tokenizers
- updated url blocklists
- added caching to some word tokenization calls
- glotlid support
- general bugfixes
New Contributors
- @lyuwen made their first contribution in https://github.com/huggingface/datatrove/pull/286
- @BramVanroy made their first contribution in https://github.com/huggingface/datatrove/pull/294
- @silverriver made their first contribution in https://github.com/huggingface/datatrove/pull/296
- @Youggls made their first contribution in https://github.com/huggingface/datatrove/pull/307
Full Changelog: https://github.com/huggingface/datatrove/compare/v0.3.0...v0.4.0
- Python
Published by guipenedo over 1 year ago
datatrove - v0.3.0
What's Changed
- Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in https://github.com/huggingface/datatrove/pull/160
- Add a skip parameter to all readers (defaults to zero) by @rantav in https://github.com/huggingface/datatrove/pull/167
- Adds n-gram based decontamination by @guipenedo in https://github.com/huggingface/datatrove/pull/172
- Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in https://github.com/huggingface/datatrove/pull/139
- Adds
tasks_per_jobto slurm executor by @guipenedo in https://github.com/huggingface/datatrove/pull/153 - Unsigned int tokenizer and srun args by @marianna13 in https://github.com/huggingface/datatrove/pull/154
- Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in https://github.com/huggingface/datatrove/pull/169
- remove ListFilter from the processcommoncrawl_dump example by @QasidSaleem in https://github.com/huggingface/datatrove/pull/181
- Hf dataset update by @hynky1999 in https://github.com/huggingface/datatrove/pull/170
- Optimize URLFilter and add option to disable integrated wordlists by @its5Q in https://github.com/huggingface/datatrove/pull/174
- Add progres for files by @hynky1999 in https://github.com/huggingface/datatrove/pull/176
- Make colorization configurable for both files and console output by @guipenedo in https://github.com/huggingface/datatrove/pull/185
- Migrate dedup to xxhash by @guipenedo in https://github.com/huggingface/datatrove/pull/179
- [WIP] Multi-Lingual Tokenization by @beme248 in https://github.com/huggingface/datatrove/pull/147
- Add more word tokenizers by @vsabolcec in https://github.com/huggingface/datatrove/pull/187
- Speed up CI with uv by @guipenedo in https://github.com/huggingface/datatrove/pull/188
- Url Index + missing hash_config struct inference by @hynky1999 in https://github.com/huggingface/datatrove/pull/191
- Migrate pipeline blocks to new word tokenizers by @guipenedo in https://github.com/huggingface/datatrove/pull/189
- Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in https://github.com/huggingface/datatrove/pull/192
- Extend randomize_start feature to local executor by @justHungryMan in https://github.com/huggingface/datatrove/pull/193
- Add description for randomize_start by @justHungryMan in https://github.com/huggingface/datatrove/pull/194
- Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in https://github.com/huggingface/datatrove/pull/199
- Issues w/ DatatroveFolderDataset by @TJ-Solergibert in https://github.com/huggingface/datatrove/pull/203
- code consistency about radomizestartduration by @justHungryMan in https://github.com/huggingface/datatrove/pull/207
- feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/datatrove/pull/211
- fix(ci): remove unnecessary permissions by @McPatate in https://github.com/huggingface/datatrove/pull/212
- Add label_only option to LanguageFilter by @justHungryMan in https://github.com/huggingface/datatrove/pull/210
- Fixes text normalization by @hynky1999 in https://github.com/huggingface/datatrove/pull/218
- Summary stats by @hynky1999 in https://github.com/huggingface/datatrove/pull/158
- Speedup json writer by @its5Q in https://github.com/huggingface/datatrove/pull/175
- add alternative fasttext lid models by @guipenedo in https://github.com/huggingface/datatrove/pull/226
- Adds paths_file to readers by @guipenedo in https://github.com/huggingface/datatrove/pull/228
- Add an example for filtering an HF dataset and push to hub by @loubnabnl in https://github.com/huggingface/datatrove/pull/201
- checks if minnumsentences is disabled or not before computing the n… by @QasidSaleem in https://github.com/huggingface/datatrove/pull/232
- DocumentTokenizerContextShuffler fixes by @sippycoder in https://github.com/huggingface/datatrove/pull/229
- add dependencies lid.py, io.py #239 by @aiqwe in https://github.com/huggingface/datatrove/pull/241
- Add withdirs to extraoptions only when not using globpattern by @olga1988olga in https://github.com/huggingface/datatrove/pull/244
- Add token and char count to histogram stats by @guipenedo in https://github.com/huggingface/datatrove/pull/251
- fix correct type inference for cached filesystems by @hynky1999 in https://github.com/huggingface/datatrove/pull/257
- Simple enhancement for readibility by @aiqwe in https://github.com/huggingface/datatrove/pull/253
- Fix
test_basic_article_trafilaturatest failure by @tylerjthomas9 in https://github.com/huggingface/datatrove/pull/264 - Update MinhashConfig with detailed settings and add default language … by @justHungryMan in https://github.com/huggingface/datatrove/pull/252
- Update README.md by @shizhediao in https://github.com/huggingface/datatrove/pull/276
- Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in https://github.com/huggingface/datatrove/pull/230
- Update filterhfdataset.py by @shizhediao in https://github.com/huggingface/datatrove/pull/274
- Add expand_metadata Option to JsonlWriter by @justHungryMan in https://github.com/huggingface/datatrove/pull/268
- Add shuffle option on huggingface reader by @justHungryMan in https://github.com/huggingface/datatrove/pull/224
New Contributors
- @rantav made their first contribution in https://github.com/huggingface/datatrove/pull/167
- @QasidSaleem made their first contribution in https://github.com/huggingface/datatrove/pull/181
- @its5Q made their first contribution in https://github.com/huggingface/datatrove/pull/174
- @beme248 made their first contribution in https://github.com/huggingface/datatrove/pull/147
- @vsabolcec made their first contribution in https://github.com/huggingface/datatrove/pull/187
- @TJ-Solergibert made their first contribution in https://github.com/huggingface/datatrove/pull/203
- @McPatate made their first contribution in https://github.com/huggingface/datatrove/pull/211
- @loubnabnl made their first contribution in https://github.com/huggingface/datatrove/pull/201
- @sippycoder made their first contribution in https://github.com/huggingface/datatrove/pull/229
- @aiqwe made their first contribution in https://github.com/huggingface/datatrove/pull/241
- @olga1988olga made their first contribution in https://github.com/huggingface/datatrove/pull/244
- @tylerjthomas9 made their first contribution in https://github.com/huggingface/datatrove/pull/264
- @shizhediao made their first contribution in https://github.com/huggingface/datatrove/pull/276
Full Changelog: https://github.com/huggingface/datatrove/compare/v0.2.0...v0.3.0
- Python
Published by guipenedo over 1 year ago
datatrove - v0.2.0
What's Changed
- Adds multi node parallelism to local executor by @guipenedo in https://github.com/huggingface/datatrove/pull/85
- Changed fsx default filepath for logging output to user's home by @Anacheron51 in https://github.com/huggingface/datatrove/pull/86
- [
Docs] Fix typos by @standardAI in https://github.com/huggingface/datatrove/pull/91 - bugfix stats file not being saved to s3 by @guipenedo in https://github.com/huggingface/datatrove/pull/92
- Fix url stats by @thomwolf in https://github.com/huggingface/datatrove/pull/89
- Efficiency: np.fromiter instead of np.array by @giorgioangel in https://github.com/huggingface/datatrove/pull/88
- Adds language option for nltk by @guipenedo in https://github.com/huggingface/datatrove/pull/94
- Fix compression type by @jordane95 in https://github.com/huggingface/datatrove/pull/95
- Decoupled reading logic from DedupReader by @guipenedo in https://github.com/huggingface/datatrove/pull/98
- Support for arbitrary fasttext models by @guipenedo in https://github.com/huggingface/datatrove/pull/99
- Adds citation by @guipenedo in https://github.com/huggingface/datatrove/pull/101
- Adds parquet writer by @guipenedo in https://github.com/huggingface/datatrove/pull/103
- Utilities to efficiently parallelize the upload of dataset files to the HuggingFace hub by @guipenedo in https://github.com/huggingface/datatrove/pull/105
- Adding doc strings + adding a faster tokenized doc merger by @thomwolf in https://github.com/huggingface/datatrove/pull/90
- Add email on slurm and extend fasttext filter functionalities by @thomwolf in https://github.com/huggingface/datatrove/pull/111
- Add
jobs_statuscommand. by @lvwerra in https://github.com/huggingface/datatrove/pull/113 - Re-enable
datasetstest by @mariosasko in https://github.com/huggingface/datatrove/pull/114 - Update warc.py by @jordane95 in https://github.com/huggingface/datatrove/pull/115
- Bug fix: when file is empty by @jordane95 in https://github.com/huggingface/datatrove/pull/126
- Load tokenizer using
from_fileby @guipenedo in https://github.com/huggingface/datatrove/pull/122 - Adds
depends=to LocalPipelineExecutor by @guipenedo in https://github.com/huggingface/datatrove/pull/100 - Improve C4 filter and dedup by @guipenedo in https://github.com/huggingface/datatrove/pull/124
- Adds option to shuffle input files in readers by @guipenedo in https://github.com/huggingface/datatrove/pull/128
- update Trafilatura version by @adbar in https://github.com/huggingface/datatrove/pull/130
- Changes to text normalization + FTFY and lines symbol formatters by @guipenedo in https://github.com/huggingface/datatrove/pull/133
- Minor Terminology and Documentation Updates for Local Tokenizer Loading by @justHungryMan in https://github.com/huggingface/datatrove/pull/134
- add requeue and QOS slurm options by @marianna13 in https://github.com/huggingface/datatrove/pull/144
- Fix substring dedup range by @jordane95 in https://github.com/huggingface/datatrove/pull/132
- Line dedup min remove words option by @guipenedo in https://github.com/huggingface/datatrove/pull/146
- New options for FastTextClassifierFilter: apply on sentence or paragraph (line) level by @guipenedo in https://github.com/huggingface/datatrove/pull/151
- Url deduplication by @hynky1999 in https://github.com/huggingface/datatrove/pull/145
- Fix race conditions during download/extraction by @hynky1999 in https://github.com/huggingface/datatrove/pull/155
- Adds PII removal by @guipenedo in https://github.com/huggingface/datatrove/pull/156
- Pypi Publish Action by @hynky1999 in https://github.com/huggingface/datatrove/pull/159
New Contributors
- @Anacheron51 made their first contribution in https://github.com/huggingface/datatrove/pull/86
- @standardAI made their first contribution in https://github.com/huggingface/datatrove/pull/91
- @giorgioangel made their first contribution in https://github.com/huggingface/datatrove/pull/88
- @lvwerra made their first contribution in https://github.com/huggingface/datatrove/pull/113
- @adbar made their first contribution in https://github.com/huggingface/datatrove/pull/130
- @justHungryMan made their first contribution in https://github.com/huggingface/datatrove/pull/134
- @marianna13 made their first contribution in https://github.com/huggingface/datatrove/pull/144
- @hynky1999 made their first contribution in https://github.com/huggingface/datatrove/pull/145
Full Changelog: https://github.com/huggingface/datatrove/compare/v0.0.1...v0.2.0
- Python
Published by guipenedo almost 2 years ago