Recent Releases of curator
curator - NVIDIA NeMo Curator 0.9.0
Major Features and Enhancements
- New How-to Data Recipes (Tutorials)
- Multimodal DAPT Curation w/ PDF Extraction
- Llama Nemotron Data Curation
- LLM NIM - PII Redaction
- Performance and Code Optimizations
- Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
- Removed convoluted backend switching logic that caused performance issues
- Eliminated expensive length assertions that could cause timeouts on large datasets
- Improved GPU utilization during KMeans clustering operations
- Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains
Bug Fixes
- FastText Download URL Fix
- Corrected the
fasttextmodel download URL in nemotron-cc tutorial - Changed from
dl.fbaipublicfiles.com/fastText/todl.fbaipublicfiles.com/fasttext/ - Ensures reliable model downloads for language identification
- Corrected the
- NeMo Retriever Tutorial Bug Fix
- Fixed lambda function bug in
RetrieverEvalSetGenerator - Corrected score assignment from
df["question"].apply(lambda: 1)todf["score"] = 1
- Fixed lambda function bug in
- API Usage Updates
- Updated examples and tutorials to use correct
DocumentDatasetAPI - Replaced deprecated
write_to_disk(result, output_dir, output_type="parquet")withresult.to_parquet(output_dir) - Updated exact deduplication workflows:
deduplicator.remove()now returnsDocumentDatasetdirectly
- Updated examples and tutorials to use correct
- Python
Published by chtruong814 10 months ago
curator - NVIDIA NeMo Curator 0.8.0
- Llama Based PII Redaction
- Trafilatura Text Extractor
- Chinese & Japanese Stopwords for Text Extractors
- Writing gzip compressed jsonl datasets
- Training dataset curation for retriever customization using hard-negative mining
- Implemented a memory efficient pairwise similarity in Semantic Deduplication
- Python
Published by ryantwolf about 1 year ago
curator - NVIDIA NeMo Curator 0.8.0rc3.dev0
Prerelease: NVIDIA NeMo Curator 0.8.0rc3.dev0 (2025-04-15)
- Python
Published by ryantwolf about 1 year ago
curator - NVIDIA NeMo Curator 0.8.0rc2.dev0
Prerelease: NVIDIA NeMo Curator 0.8.0rc2.dev0 (2025-04-07)
- Python
Published by ryantwolf about 1 year ago
curator - NVIDIA NeMo Curator 0.7.1
- Fix Transformers + Cuda Context bug
- Fix rate limit in SDG Retriever Eval Tutorial
- Python
Published by ryantwolf about 1 year ago
curator - NVIDIA NeMo Curator 0.7.0
- Python 3.12 Support
- Curator on Blackwell
- Nemotron-CC Dataset Recipe
- Performant S3 for Fuzzy Deduplication
- Python
Published by ryantwolf about 1 year ago
curator - NVIDIA NeMo Curator 0.7.0rc2.dev0
Prerelease: NVIDIA NeMo Curator 0.7.0rc2.dev0 (2025-02-25)
- Python
Published by ryantwolf over 1 year ago
curator - NVIDIA NeMo Curator 0.7.0rc1.dev1
Prerelease: NVIDIA NeMo Curator 0.7.0rc1.dev1 (2025-02-19)
- Python
Published by ryantwolf over 1 year ago
curator - NVIDIA NeMo Curator 0.7.0rc0.dev1
Prerelease: NVIDIA NeMo Curator 0.7.0rc0.dev1 (2025-02-04)
- Python
Published by ko3n1g over 1 year ago
curator - NVIDIA NeMo Curator 0.6.0
What's changed
- Synthetic Data Generation for Text Retrieval
- LLM-based Filters
- Easiness
- Answerability
- Q&A Retrieval Generation Pipeline
- Parallel Dataset Curation for Machine Translation
- Load/Write Bitext Files
- Heuristic filtering (Histogram, Length Ratio)
- Classifier filtering (Comet, Cometoid)
- Python
Published by ryantwolf over 1 year ago
curator - NVIDIA NeMo Curator 0.6.0rc2.dev1
Prerelease: NVIDIA NeMo Curator 0.6.0rc2.dev1 (2025-01-03)
- Python
Published by ryantwolf over 1 year ago
curator - NVIDIA NeMo Curator 0.6.0rc1.dev1
Prerelease: NVIDIA NeMo Curator 0.6.0rc1.dev1 (2024-12-20)
- Python
Published by ko3n1g over 1 year ago
curator - v0.5.0
Highlights
- Image Curation
- Image Embedding Creation
- Aesthetic Classifier
- NSFW Classifier
- Semantic Deduplication
- Text Curation
- Quality Classifier
- Aegis Classifier
- FineWeb-Edu Classifier
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0
- Python
Published by ryantwolf over 1 year ago
curator - v0.4.0
Highlights
- Semantic Deduplication
- Resiliparse for Text Extraction
- Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching
- Synthetic data generation for fine-tuning
What's Changed
- Update README by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/5
- Add workflow for running cpu pytests by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/13
- Add pre-commit style checks by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/14
- Add citation by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/15
- Fix Noisy CUDA Shutdown by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/20
- Bump Python and RAPIDS versions by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/16
- Add batched decorator by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/18
- Add issue templates by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/22
- Add dependency to fix justext by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/24
- Fix metadata inference with pandas and dask by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/34
- Improve speed of AddId module by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/36
- Make GPU dependencies optional by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/27
- Fix failing GPU tests with latest pandas bump by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/41
- Adds Nemo Curator K8s example by @terrykong in https://github.com/NVIDIA/NeMo-Curator/pull/40
- Move common dedup utils and remove unused code by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/42
- Fix lang id example by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/37
- Add dataset blending tool by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/32
- High level fuzzy duplicates module by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/46
- Fix indexing in PII Modifier by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/55
- Disable string conversion globally by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/45
- Only import PII constants during Curator import by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/61
- Align
extract_partitioning_indexlogic with upstream shuffling by @rjzamora in https://github.com/NVIDIA/NeMo-Curator/pull/60 - [REVIEW] Switch Models to use Crossfit by @VibhuJawa in https://github.com/NVIDIA/NeMo-Curator/pull/58
- Remove argparse from get_client function signature by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/12
- Fuzzy Dedup: Use text_field instead of hardcoded text column by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/74
- Add pull request template by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/78
- Add jupyter notebook tutorial for single node mulilingual dataset by @nicoleeeluo in https://github.com/NVIDIA/NeMo-Curator/pull/30
- Update issue templates by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/81
- Fix #91 - Incorrect reference to domainclassifierexample.py by @miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/92
- Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by @miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/75
- Update readme by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/93
- Update documentation for new version by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/83
- Update requirements documentation. by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/98
- Make sure query-planning is disabled for now by @rjzamora in https://github.com/NVIDIA/NeMo-Curator/pull/97
- Applying SEO Best Pratices by @aschilling-nv in https://github.com/NVIDIA/NeMo-Curator/pull/104
- Shuffle CC result on group before writing out by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/110
- Added tutorials to index.rst by @jgerh in https://github.com/NVIDIA/NeMo-Curator/pull/113
- Pin to numpy<2 to avoid spacy compat issues by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/119
- Fix #116. Fix broken links by @miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/117
- Update index.rst by @aschilling-nv in https://github.com/NVIDIA/NeMo-Curator/pull/129
- Fix nemo_curator import in CPU only environment when GPU packages are installed. by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/123
- Improve Common Crawl download by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/82
- Update README.md by @Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/126
- Allow multiple filenames per partition when separating by metadata by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/99
- [REVIEW] Add Resiliparse option for text extraction by @sarahyurick in https://github.com/NVIDIA/NeMo-Curator/pull/128
- Fix 69 - Refactor how arguments are added to scripts by @miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/102
- Stricter check for query planning. by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/107
- Add DataFrame example to Distributed Data Classification tutorial by @sarahyurick in https://github.com/NVIDIA/NeMo-Curator/pull/137
- Enable Sem-dedup by @VibhuJawa in https://github.com/NVIDIA/NeMo-Curator/pull/130
- Remove lxml installation by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/140
- Nemotron 340 SDG Pipeline Tutorial by @chrisalexiuk-nvidia in https://github.com/NVIDIA/NeMo-Curator/pull/144
- Add Synthetic Data Generation Module by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/136
- Skip explicit comms shuffle for dask-cuda 24.06 by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/147
- Add support for NeMo SDK by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/131
- [REVIEW] Fix SemDedup bugs by @VibhuJawa in https://github.com/NVIDIA/NeMo-Curator/pull/151
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/NVIDIA/NeMo-Curator/pull/135
- Fix bug with torch rmm and nemo by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/155
New Contributors
- @ayushdg made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/13
- @terrykong made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/40
- @rjzamora made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/60
- @nicoleeeluo made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/30
- @aschilling-nv made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/104
- @pre-commit-ci made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/135
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.4.0
- Python
Published by ryantwolf almost 2 years ago
curator - v0.3.0
What's Changed
- Update README by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/5
- Add workflow for running cpu pytests by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/13
- Add pre-commit style checks by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/14
- Add citation by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/15
- Fix Noisy CUDA Shutdown by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/20
- Bump Python and RAPIDS versions by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/16
- Add batched decorator by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/18
- Add issue templates by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/22
- Add dependency to fix justext by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/24
- Fix metadata inference with pandas and dask by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/34
- Improve speed of AddId module by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/36
- Make GPU dependencies optional by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/27
- Fix failing GPU tests with latest pandas bump by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/41
- Adds Nemo Curator K8s example by @terrykong in https://github.com/NVIDIA/NeMo-Curator/pull/40
- Move common dedup utils and remove unused code by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/42
- Fix lang id example by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/37
- Add dataset blending tool by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/32
- High level fuzzy duplicates module by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/46
- Fix indexing in PII Modifier by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/55
- Disable string conversion globally by @ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/45
- Only import PII constants during Curator import by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/61
- Align
extract_partitioning_indexlogic with upstream shuffling by @rjzamora in https://github.com/NVIDIA/NeMo-Curator/pull/60
New Contributors
- @Maghoumi made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/5
- @terrykong made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/40
- @miguelusque made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/57
- @rjzamora made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/60
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.3.0
PyPi
https://pypi.org/project/nemo-curator/0.3.0/
- Python
Published by ayushdg almost 2 years ago