Recent Releases of datatrove

datatrove - v0.5.0

What's Changed

  • fix by @kylematoba in https://github.com/huggingface/datatrove/pull/314
  • Changed FTFY defaults by @guipenedo in https://github.com/huggingface/datatrove/pull/319
  • Adding Megatron Tokenization pipeline by @TJ-Solergibert in https://github.com/huggingface/datatrove/pull/304
  • Add job_id_position Parameter to launch_slurm_job Method by @StephenRebel in https://github.com/huggingface/datatrove/pull/282
  • load_tokenizer can now load local hf folder by @ceferisbarov in https://github.com/huggingface/datatrove/pull/306
  • Add glob pattern for hash index by @jordane95 in https://github.com/huggingface/datatrove/pull/313
  • fix(utils): Enhance the dependencies check to include pip distribution by @aiqwe in https://github.com/huggingface/datatrove/pull/317
  • Update README.md by @saforem2 in https://github.com/huggingface/datatrove/pull/323
  • Fix issues with URL Deduplication when using the Index by @muzzynine in https://github.com/huggingface/datatrove/pull/327
  • Add customization for fetching SLURM job id by @BramVanroy in https://github.com/huggingface/datatrove/pull/320
  • fixes stopwors implementation by @guipenedo in https://github.com/huggingface/datatrove/pull/329
  • Allow custom parquet schema by @BramVanroy in https://github.com/huggingface/datatrove/pull/330
  • [draft] Add chunking option to DocumentTokenizer by @craffel in https://github.com/huggingface/datatrove/pull/342
  • Revert "[draft] Add chunking option to DocumentTokenizer" by @guipenedo in https://github.com/huggingface/datatrove/pull/343
  • fix: root condition for SENTINEL by @jordane95 in https://github.com/huggingface/datatrove/pull/349
  • correct metadata parsing for finemath by @VivienCabannes in https://github.com/huggingface/datatrove/pull/355
  • add oom score + shorter polling by @hynky1999 in https://github.com/huggingface/datatrove/pull/361
  • Resolve issue 308 by @habanoz in https://github.com/huggingface/datatrove/pull/309
  • [draft] Add chunking option to DocumentTokenizer by @craffel in https://github.com/huggingface/datatrove/pull/344
  • Add RayPipelineExecutor by @nelson-liu in https://github.com/huggingface/datatrove/pull/331
  • Bump ring from 0.17.8 to 0.17.14 in /src/datatrove/tools/fast_mh3 by @dependabot in https://github.com/huggingface/datatrove/pull/363
  • Bump tokio from 1.41.1 to 1.43.1 in /src/datatrove/tools/fast_mh3 by @dependabot in https://github.com/huggingface/datatrove/pull/362
  • Fix signatures priority queue initialization in MinhashBuildIndex by @nelson-liu in https://github.com/huggingface/datatrove/pull/334
  • Shuffle by chunks support in DocumentTokenizerMerger by @guipenedo in https://github.com/huggingface/datatrove/pull/364
  • return positions based on .index if return_positions=True in the data… by @guipenedo in https://github.com/huggingface/datatrove/pull/356

New Contributors

  • @kylematoba made their first contribution in https://github.com/huggingface/datatrove/pull/314
  • @StephenRebel made their first contribution in https://github.com/huggingface/datatrove/pull/282
  • @ceferisbarov made their first contribution in https://github.com/huggingface/datatrove/pull/306
  • @saforem2 made their first contribution in https://github.com/huggingface/datatrove/pull/323
  • @muzzynine made their first contribution in https://github.com/huggingface/datatrove/pull/327
  • @craffel made their first contribution in https://github.com/huggingface/datatrove/pull/342
  • @VivienCabannes made their first contribution in https://github.com/huggingface/datatrove/pull/355
  • @habanoz made their first contribution in https://github.com/huggingface/datatrove/pull/309
  • @nelson-liu made their first contribution in https://github.com/huggingface/datatrove/pull/331
  • @dependabot made their first contribution in https://github.com/huggingface/datatrove/pull/363

Full Changelog: https://github.com/huggingface/datatrove/compare/v0.4.0...v0.5.0

- Python
Published by guipenedo 11 months ago

datatrove - v0.4.0

What's Changed

  • Readme nits by @hynky1999 in https://github.com/huggingface/datatrove/pull/280
  • Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in https://github.com/huggingface/datatrove/pull/286
  • Fix languages listify bug by @BramVanroy in https://github.com/huggingface/datatrove/pull/294
  • [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in https://github.com/huggingface/datatrove/pull/296
  • [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in https://github.com/huggingface/datatrove/pull/307
  • FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in https://github.com/huggingface/datatrove/pull/285:
    • upgrades to support numpy 2.0
    • added additional word tokenizers and revamped word tokenizer assignment mechanism
    • MinHash optimizations + new rust tool to speed up step3
    • MinHash cluster sizes feature
    • fixed memory leaks from some word tokenizers
    • updated url blocklists
    • added caching to some word tokenization calls
    • glotlid support
    • general bugfixes

New Contributors

  • @lyuwen made their first contribution in https://github.com/huggingface/datatrove/pull/286
  • @BramVanroy made their first contribution in https://github.com/huggingface/datatrove/pull/294
  • @silverriver made their first contribution in https://github.com/huggingface/datatrove/pull/296
  • @Youggls made their first contribution in https://github.com/huggingface/datatrove/pull/307

Full Changelog: https://github.com/huggingface/datatrove/compare/v0.3.0...v0.4.0

- Python
Published by guipenedo over 1 year ago

datatrove - v0.3.0

What's Changed

  • Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in https://github.com/huggingface/datatrove/pull/160
  • Add a skip parameter to all readers (defaults to zero) by @rantav in https://github.com/huggingface/datatrove/pull/167
  • Adds n-gram based decontamination by @guipenedo in https://github.com/huggingface/datatrove/pull/172
  • Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in https://github.com/huggingface/datatrove/pull/139
  • Adds tasks_per_job to slurm executor by @guipenedo in https://github.com/huggingface/datatrove/pull/153
  • Unsigned int tokenizer and srun args by @marianna13 in https://github.com/huggingface/datatrove/pull/154
  • Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in https://github.com/huggingface/datatrove/pull/169
  • remove ListFilter from the processcommoncrawl_dump example by @QasidSaleem in https://github.com/huggingface/datatrove/pull/181
  • Hf dataset update by @hynky1999 in https://github.com/huggingface/datatrove/pull/170
  • Optimize URLFilter and add option to disable integrated wordlists by @its5Q in https://github.com/huggingface/datatrove/pull/174
  • Add progres for files by @hynky1999 in https://github.com/huggingface/datatrove/pull/176
  • Make colorization configurable for both files and console output by @guipenedo in https://github.com/huggingface/datatrove/pull/185
  • Migrate dedup to xxhash by @guipenedo in https://github.com/huggingface/datatrove/pull/179
  • [WIP] Multi-Lingual Tokenization by @beme248 in https://github.com/huggingface/datatrove/pull/147
  • Add more word tokenizers by @vsabolcec in https://github.com/huggingface/datatrove/pull/187
  • Speed up CI with uv by @guipenedo in https://github.com/huggingface/datatrove/pull/188
  • Url Index + missing hash_config struct inference by @hynky1999 in https://github.com/huggingface/datatrove/pull/191
  • Migrate pipeline blocks to new word tokenizers by @guipenedo in https://github.com/huggingface/datatrove/pull/189
  • Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in https://github.com/huggingface/datatrove/pull/192
  • Extend randomize_start feature to local executor by @justHungryMan in https://github.com/huggingface/datatrove/pull/193
  • Add description for randomize_start by @justHungryMan in https://github.com/huggingface/datatrove/pull/194
  • Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in https://github.com/huggingface/datatrove/pull/199
  • Issues w/ DatatroveFolderDataset by @TJ-Solergibert in https://github.com/huggingface/datatrove/pull/203
  • code consistency about radomizestartduration by @justHungryMan in https://github.com/huggingface/datatrove/pull/207
  • feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/datatrove/pull/211
  • fix(ci): remove unnecessary permissions by @McPatate in https://github.com/huggingface/datatrove/pull/212
  • Add label_only option to LanguageFilter by @justHungryMan in https://github.com/huggingface/datatrove/pull/210
  • Fixes text normalization by @hynky1999 in https://github.com/huggingface/datatrove/pull/218
  • Summary stats by @hynky1999 in https://github.com/huggingface/datatrove/pull/158
  • Speedup json writer by @its5Q in https://github.com/huggingface/datatrove/pull/175
  • add alternative fasttext lid models by @guipenedo in https://github.com/huggingface/datatrove/pull/226
  • Adds paths_file to readers by @guipenedo in https://github.com/huggingface/datatrove/pull/228
  • Add an example for filtering an HF dataset and push to hub by @loubnabnl in https://github.com/huggingface/datatrove/pull/201
  • checks if minnumsentences is disabled or not before computing the n… by @QasidSaleem in https://github.com/huggingface/datatrove/pull/232
  • DocumentTokenizerContextShuffler fixes by @sippycoder in https://github.com/huggingface/datatrove/pull/229
  • add dependencies lid.py, io.py #239 by @aiqwe in https://github.com/huggingface/datatrove/pull/241
  • Add withdirs to extraoptions only when not using globpattern by @olga1988olga in https://github.com/huggingface/datatrove/pull/244
  • Add token and char count to histogram stats by @guipenedo in https://github.com/huggingface/datatrove/pull/251
  • fix correct type inference for cached filesystems by @hynky1999 in https://github.com/huggingface/datatrove/pull/257
  • Simple enhancement for readibility by @aiqwe in https://github.com/huggingface/datatrove/pull/253
  • Fix test_basic_article_trafilatura test failure by @tylerjthomas9 in https://github.com/huggingface/datatrove/pull/264
  • Update MinhashConfig with detailed settings and add default language … by @justHungryMan in https://github.com/huggingface/datatrove/pull/252
  • Update README.md by @shizhediao in https://github.com/huggingface/datatrove/pull/276
  • Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in https://github.com/huggingface/datatrove/pull/230
  • Update filterhfdataset.py by @shizhediao in https://github.com/huggingface/datatrove/pull/274
  • Add expand_metadata Option to JsonlWriter by @justHungryMan in https://github.com/huggingface/datatrove/pull/268
  • Add shuffle option on huggingface reader by @justHungryMan in https://github.com/huggingface/datatrove/pull/224

New Contributors

  • @rantav made their first contribution in https://github.com/huggingface/datatrove/pull/167
  • @QasidSaleem made their first contribution in https://github.com/huggingface/datatrove/pull/181
  • @its5Q made their first contribution in https://github.com/huggingface/datatrove/pull/174
  • @beme248 made their first contribution in https://github.com/huggingface/datatrove/pull/147
  • @vsabolcec made their first contribution in https://github.com/huggingface/datatrove/pull/187
  • @TJ-Solergibert made their first contribution in https://github.com/huggingface/datatrove/pull/203
  • @McPatate made their first contribution in https://github.com/huggingface/datatrove/pull/211
  • @loubnabnl made their first contribution in https://github.com/huggingface/datatrove/pull/201
  • @sippycoder made their first contribution in https://github.com/huggingface/datatrove/pull/229
  • @aiqwe made their first contribution in https://github.com/huggingface/datatrove/pull/241
  • @olga1988olga made their first contribution in https://github.com/huggingface/datatrove/pull/244
  • @tylerjthomas9 made their first contribution in https://github.com/huggingface/datatrove/pull/264
  • @shizhediao made their first contribution in https://github.com/huggingface/datatrove/pull/276

Full Changelog: https://github.com/huggingface/datatrove/compare/v0.2.0...v0.3.0

- Python
Published by guipenedo over 1 year ago

datatrove - v0.2.0

What's Changed

  • Adds multi node parallelism to local executor by @guipenedo in https://github.com/huggingface/datatrove/pull/85
  • Changed fsx default filepath for logging output to user's home by @Anacheron51 in https://github.com/huggingface/datatrove/pull/86
  • [Docs] Fix typos by @standardAI in https://github.com/huggingface/datatrove/pull/91
  • bugfix stats file not being saved to s3 by @guipenedo in https://github.com/huggingface/datatrove/pull/92
  • Fix url stats by @thomwolf in https://github.com/huggingface/datatrove/pull/89
  • Efficiency: np.fromiter instead of np.array by @giorgioangel in https://github.com/huggingface/datatrove/pull/88
  • Adds language option for nltk by @guipenedo in https://github.com/huggingface/datatrove/pull/94
  • Fix compression type by @jordane95 in https://github.com/huggingface/datatrove/pull/95
  • Decoupled reading logic from DedupReader by @guipenedo in https://github.com/huggingface/datatrove/pull/98
  • Support for arbitrary fasttext models by @guipenedo in https://github.com/huggingface/datatrove/pull/99
  • Adds citation by @guipenedo in https://github.com/huggingface/datatrove/pull/101
  • Adds parquet writer by @guipenedo in https://github.com/huggingface/datatrove/pull/103
  • Utilities to efficiently parallelize the upload of dataset files to the HuggingFace hub by @guipenedo in https://github.com/huggingface/datatrove/pull/105
  • Adding doc strings + adding a faster tokenized doc merger by @thomwolf in https://github.com/huggingface/datatrove/pull/90
  • Add email on slurm and extend fasttext filter functionalities by @thomwolf in https://github.com/huggingface/datatrove/pull/111
  • Add jobs_status command. by @lvwerra in https://github.com/huggingface/datatrove/pull/113
  • Re-enable datasets test by @mariosasko in https://github.com/huggingface/datatrove/pull/114
  • Update warc.py by @jordane95 in https://github.com/huggingface/datatrove/pull/115
  • Bug fix: when file is empty by @jordane95 in https://github.com/huggingface/datatrove/pull/126
  • Load tokenizer using from_file by @guipenedo in https://github.com/huggingface/datatrove/pull/122
  • Adds depends= to LocalPipelineExecutor by @guipenedo in https://github.com/huggingface/datatrove/pull/100
  • Improve C4 filter and dedup by @guipenedo in https://github.com/huggingface/datatrove/pull/124
  • Adds option to shuffle input files in readers by @guipenedo in https://github.com/huggingface/datatrove/pull/128
  • update Trafilatura version by @adbar in https://github.com/huggingface/datatrove/pull/130
  • Changes to text normalization + FTFY and lines symbol formatters by @guipenedo in https://github.com/huggingface/datatrove/pull/133
  • Minor Terminology and Documentation Updates for Local Tokenizer Loading by @justHungryMan in https://github.com/huggingface/datatrove/pull/134
  • add requeue and QOS slurm options by @marianna13 in https://github.com/huggingface/datatrove/pull/144
  • Fix substring dedup range by @jordane95 in https://github.com/huggingface/datatrove/pull/132
  • Line dedup min remove words option by @guipenedo in https://github.com/huggingface/datatrove/pull/146
  • New options for FastTextClassifierFilter: apply on sentence or paragraph (line) level by @guipenedo in https://github.com/huggingface/datatrove/pull/151
  • Url deduplication by @hynky1999 in https://github.com/huggingface/datatrove/pull/145
  • Fix race conditions during download/extraction by @hynky1999 in https://github.com/huggingface/datatrove/pull/155
  • Adds PII removal by @guipenedo in https://github.com/huggingface/datatrove/pull/156
  • Pypi Publish Action by @hynky1999 in https://github.com/huggingface/datatrove/pull/159

New Contributors

  • @Anacheron51 made their first contribution in https://github.com/huggingface/datatrove/pull/86
  • @standardAI made their first contribution in https://github.com/huggingface/datatrove/pull/91
  • @giorgioangel made their first contribution in https://github.com/huggingface/datatrove/pull/88
  • @lvwerra made their first contribution in https://github.com/huggingface/datatrove/pull/113
  • @adbar made their first contribution in https://github.com/huggingface/datatrove/pull/130
  • @justHungryMan made their first contribution in https://github.com/huggingface/datatrove/pull/134
  • @marianna13 made their first contribution in https://github.com/huggingface/datatrove/pull/144
  • @hynky1999 made their first contribution in https://github.com/huggingface/datatrove/pull/145

Full Changelog: https://github.com/huggingface/datatrove/compare/v0.0.1...v0.2.0

- Python
Published by guipenedo almost 2 years ago

datatrove - v0.0.1

First release

- Python
Published by guipenedo about 2 years ago