curator

Scalable data pre processing and curation toolkit for LLMs

https://github.com/nvidia-nemo/curator

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 4 months ago · JSON representation ·

Repository

Scalable data pre processing and curation toolkit for LLMs

Basic Info
  • Host: GitHub
  • Owner: NVIDIA-NeMo
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 16.1 MB
Statistics
  • Stars: 1,111
  • Watchers: 17
  • Forks: 166
  • Open Issues: 95
  • Releases: 18
Topics
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Created almost 2 years ago · Last pushed 4 months ago
Metadata Files
Readme Changelog Contributing License Citation Security

README.md

![https://pypi.org/project/nemo-curator](https://img.shields.io/github/license/NVIDIA-NeMo/Curator) ![https://pypi.org/project/nemo-curator/](https://img.shields.io/pypi/pyversions/nemo-curator.svg) ![NVIDIA-NeMo/Curator](https://img.shields.io/github/contributors/NVIDIA-NeMo/Curator) ![https://github.com/NVIDIA-NeMo/Curator/releases](https://img.shields.io/github/release/NVIDIA-NeMo/Curator) ![https://github.com/Naereen/badges/](https://badgen.net/badge/open%20source/❤/blue?icon=github)

Accelerate Data Processing and Streamline Synthetic Data Generation with NVIDIA NeMo Curator

NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).

It greatly accelerates data processing and curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.

NeMo Curator also provides pre-built pipelines for synthetic data generation for customization and evaluation of generative AI systems. You can use any OpenAI API compatible model and plug it in NeMo Curator's synthetic data generation pipelines to process and curate high-quality synthetic data for various use cases.

Getting Started

New to NeMo Curator? Start with our quickstart guides for hands-on experience:

For production deployments and advanced configurations, see our Setup & Deployment documentation.


Key Features

With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.

Text Curation

All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data curation pipelines. Text curation follows a three-stage workflow: LoadProcessGenerate. A typical pipeline starts by downloading raw data from public resources, then applies cleaning and filtering steps, and optionally generates synthetic data for training enhancement.

Load Data

  • Download and Extraction - Default implementations for Common Crawl, Wikipedia, and ArXiv sources with easy customization for other sources

Process Data

  • Quality Assessment & Filtering

  • Deduplication

  • Content Processing & Cleaning

    • Text Cleaning - Remove improperly decoded Unicode characters, inconsistent line spacing, and excessive URLs
    • PII Redaction - Identify and remove personally identifiable information from training datasets
  • Specialized Processing

Generate Data


Image Curation

NeMo Curator provides powerful image curation features to curate high-quality image data for training generative AI models such as LLMs, VLMs, and WFMs. Image curation follows a LoadProcess workflow: download datasets in WebDataset format, create embeddings, apply quality filters (NSFW and Aesthetic), and remove duplicates using semantic deduplication.

Load Data

Process Data


Module Ablation and Compute Performance

The modules within NeMo Curator were primarily designed to process and curate high-quality documents at scale. To evaluate the quality of the data, we curated Common Crawl documents and conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator.

The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.

drawing

NeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, developers achieve approximately 16× faster fuzzy‑deduplication on an 8 TB RedPajama‑v2 subset, with ~40% lower TCO and near‑linear scaling on 1–4 H100 80 GB nodes. Refer to the chart below to learn more details.

drawing

NeMo Curator exhibits near‑linear scaling for fuzzy deduplication. On an 8 TB RedPajama‑v2 subset (~1.78 trillion tokens), processing time drops from 2.05 hours on one H100 80 GB node to 0.50 hours on four nodes. Refer to the scaling chart below to learn more:

drawing

Contribute to NeMo Curator

We welcome community contributions! Please refer to CONTRIBUTING.md for the process.

Owner

  • Name: NVIDIA-NeMo
  • Login: NVIDIA-NeMo
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
title: "NeMo-Curator: a toolkit for data curation"
repository-code: https://github.com/NVIDIA/NeMo-Curator
authors:
  - family-names: Jennings
    given-names: Joseph
  - family-names: Patwary
    given-names: Mostofa
  - family-names: Subramanian
    given-names: Sandeep
  - family-names: Prabhumoye
    given-names: Shrimai
  - family-names: Dattagupta
    given-names: Ayush
  - family-names: Jawa
    given-names: Vibhu
  - family-names: Liu
    given-names: Jiwei
  - family-names: Wolf
    given-names: Ryan
  - family-names: Yurick
    given-names: Sarah
  - family-names: Singh
    given-names: Varun

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 38
  • Total pull requests: 168
  • Average time to close issues: 6 months
  • Average time to close pull requests: 13 days
  • Total issue authors: 16
  • Total pull request authors: 25
  • Average comments per issue: 0.63
  • Average comments per pull request: 1.38
  • Merged pull requests: 81
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 36
  • Pull requests: 167
  • Average time to close issues: 3 months
  • Average time to close pull requests: 8 days
  • Issue authors: 15
  • Pull request authors: 24
  • Average comments per issue: 0.5
  • Average comments per pull request: 1.37
  • Merged pull requests: 81
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • sarahyurick (9)
  • VibhuJawa (6)
  • abhinavg4 (4)
  • praateekmahajan (4)
  • sithape2025 (2)
  • ayushdg (2)
  • chtruong814 (2)
  • richardliaw (1)
  • ronjer30 (1)
  • CharlieTruong (1)
  • bschifferer (1)
  • suiyoubi (1)
  • leekaimao (1)
  • miguelusque (1)
  • QuyAnh2005 (1)
Pull Request Authors
  • praateekmahajan (29)
  • chtruong814 (28)
  • suiyoubi (17)
  • thomasdhc (16)
  • sarahyurick (15)
  • ayushdg (13)
  • abhinavg4 (10)
  • lbliii (8)
  • Copilot (5)
  • huvunvidia (4)
  • VibhuJawa (3)
  • Maghoumi (3)
  • TsukiSama9292 (3)
  • arhamm1 (2)
  • karpnv (2)
Top Labels
Issue Labels
enhancement (13) bug (10) Stale (6) jira (5) documentation (2) community-request (1) Run CICD (1) cherry-pick (1)
Pull Request Labels
Run CICD (12) cherry-pick (12) ray-api (11) r0.9.0 (5) Stale (5) community-request (4) gpuci (4) r1.0.0 (2) documentation (1) dependencies (1) python (1)

Dependencies

setup.py pypi
  • dask *
.github/workflows/test.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
pyproject.toml pypi