https://github.com/modelscope/data-juicer

Data processing for and with foundation models! 🍎 πŸ‹ 🌽 ➑️ ➑️🍸 🍹 🍷

https://github.com/modelscope/data-juicer

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • β—‹
    CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • βœ“
    .zenodo.json file
    Found .zenodo.json file
  • β—‹
    DOI references
  • βœ“
    Academic publication links
    Links to: arxiv.org, ieee.org
  • βœ“
    Committers with academic emails
    2 of 33 committers (6.1%) from academic institutions
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (8.3%) to scientific vocabulary

Keywords

data data-analysis data-pipeline data-processing data-science data-visualization foundation-models instruction-tuning large-language-models llm llms multi-modal pre-training synthetic-data

Keywords from Contributors

transformer graph-computation
Last synced: 5 months ago · JSON representation

Repository

Data processing for and with foundation models! 🍎 πŸ‹ 🌽 ➑️ ➑️🍸 🍹 🍷

Basic Info
Statistics
  • Stars: 5,142
  • Watchers: 20
  • Forks: 267
  • Open Issues: 68
  • Releases: 19
Topics
data data-analysis data-pipeline data-processing data-science data-visualization foundation-models instruction-tuning large-language-models llm llms multi-modal pre-training synthetic-data
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

[δΈ­ζ–‡δΈ»ι‘΅] | [DJ-Cookbook] | [OperatorZoo] | [API] | [Awesome LLM Data]

Data Processing for and with Foundation Models

Data-Juicer

pypi version Docker version Docker on OSS

DataModality Usage ModelScope- Demos HuggingFace- Demos

Document_List ζ–‡ζ‘£εˆ—θ‘¨ OpZoo Paper Paper

Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs). We provide a playground with a managed JupyterLab. Try Data-Juicer straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our works.

Platform for AI of Alibaba Cloud (PAI) has deeply integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: PAI-Data Processing for Large Models.

Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us, in promoting data-model co-development along with research and applications of foundation models!

[Demo Video] DataJuicer-Agent: Quick start your data processing journey!

https://github.com/user-attachments/assets/6eb726b7-6054-4b0c-905e-506b2b9c7927

[Demo Video] DataJuicer-Sandbox: Better data-model co-dev at a lower cost!

https://github.com/user-attachments/assets/a45f0eee-0f0e-4ffe-9a42-d9a55370089d

News

History News: > - [2024-12-17] We propose *HumanVBench*, which comprises 16 human-centric tasks with synthetic data, benchmarking 22 video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it. - [2024-11-22] We release DJ [v1.0.0](https://github.com/modelscope/data-juicer/releases/tag/v1.0.0), in which we refactored Data-Juicer's *Operator*, *Dataset*, *Sandbox* and many other modules for better usability, such as supporting fault-tolerant, FastAPI and adaptive resource management. - [2024-08-25] We give a [tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) about data processing for multimodal LLMs in KDD'2024. - [2024-08-09] We propose Img-Diff, which enhances the performance of multimodal large language models through *contrastive data synthesis*, achieving a score that is 12 points higher than GPT-4V on the [MMVP benchmark](https://tsb0601.github.io/mmvp_blog/). See more details in our [paper](https://arxiv.org/abs/2408.04594), and download the dataset from [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff) and [modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff). - [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" β€” Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information. - [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through a co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms. - [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute! - [2024-06-01] ModelScope-Sora "Data Directors" creative sprintβ€”Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information. - [2024-03-07] We release **Data-Juicer [v0.2.0](https://github.com/modelscope/data-juicer/releases/tag/v0.2.0)** now! In this new version, we support more features for **multimodal data (including video now)**, and introduce **[DJ-SORA](docs/DJ_SORA.md)** to provide open large-scale, high-quality datasets for SORA-like models. - [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute! - [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track! - [2024-01-10] Discover new horizons in "Data Mixture"β€”Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information. - [2024-01-05] We release **Data-Juicer v0.1.3** now! In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/fmt_conversion/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future). Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033). - [2023-10-13] Our first data-centric LLM competition begins! Please visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.

Why Data-Juicer?

  • Systematic & Reusable: Empowering users with a systematic library of 100+ core OPs, and 50+ reusable config recipes and dedicated toolkits, designed to function independently of specific multimodal LLM datasets and processing pipelines. Supporting data analysis, cleaning, and synthesis in pre-training, post-tuning, en, zh, and more scenarios.

  • User-Friendly & Extensible: Designed for simplicity and flexibility, with easy-start guides, and DJ-Cookbook containing fruitful demo usages. Feel free to implement your own OPs for customizable data processing.

Data-Juicer now uses AI to automatically rewrite and optimize operator docstrings, generating detailed operator documentation to help users quickly understand the functionality and usage of each operator.
For details about the implementation of this documentation enhancement workflow, please visit the demos/opdocenhanceworkflow folder under the `djagents` branch.

  • Efficient & Robust: Providing performance-optimized parallel data processing (Aliyun-PAI\Ray\CUDA\OP Fusion), faster with less resource usage, verified in large-scale production environments.

  • Effect-Proven & Sandbox: Supporting data-model co-development, enabling rapid iteration through the sandbox laboratory, and providing features such as feedback loops and visualization, so that you can better understand and improve your data and models. Many effect-proven datasets and models have been derived from DJ, in scenarios such as pre-training, text-to-video and image-to-text generation. Data-in-the-loop

Doucmentation

License

Data-Juicer is released under Apache License 2.0.

Contribution and Acknowledgements

Data-Juicer has benefited greatly from and continues to welcome contributions at all levels: new operators (from simple functions to advanced algorithms based on existing papers), data-recipes & processing scenarios, feature requests, efficiency enhancements, bug fixes, better documentation and usage feedback. Please refer to our Developer Guide to get started. Spreading the word in the community and giving the repository a star ⭐ are also invaluable forms of support!

Our sincere gratitude goes to all our code contributors who are the cornerstone of this project. We strive to keep the list below updated and look forward to including more names (alphabetical order); please reach out if we have missed any acknowledgements. - Initiated by: Alibaba Tongyi Lab - Co-developed and Optimized with: Alibaba Cloud PAI, Anyscale (Ray Team), Sun Yat-sen University (Knowledge Engineering Lab), NVIDIA (NeMo Team), ... - Used by & Valuable Feedback from: AgentScope, Alibaba Group, Ant Group, BYD Auto, Bytedance, CAS, DiffSynth-Studio, EasyAnimate, Eval-Scope, JD.com, LLaMA-Factory, Nanjing University, OPPO, Peking University, RM-Gallery, RUC, Tsinghua University, Trinity-RFT, UCAS, Xiaohongshu, Xiaomi, Ximalaya, Zhejiang University, ... - Inspired by: Data-Juicer also thanks pioneering open-source projects such as Apache Arrow, BLOOM, RedPajama-Data, Ray, Hugging Face Datasets, ...

We look forward to your feedback and collaboration, including partnership inquiries or proposals for new sub-projects related to Data-Juicer. Feel free to contact via issues, PRs, Slack channel, DingDing group, and e-mails.

References

If you find Data-Juicer useful for your research or development, please kindly cite the following works, 1.0paper, 2.0paper. ``` @inproceedings{djv1, title={Data-Juicer: A One-Stop Data Processing System for Large Language Models}, author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou}, booktitle={International Conference on Management of Data}, year={2024} }

@article{djv2, title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models}, author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Zhang, Yilei and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren}, journal={arXiv preprint arXiv:2501.14755}, year={2024} } ```

More data-related papers from the Data-Juicer Team: > - (ICML'25 Spotlight) [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784) - (CVPR'25) [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594) - (TPAMI'25) [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583) - (Benchmark Data) [HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data](https://arxiv.org/abs/2412.17574) - (Benchmark Data) [DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?](https://www.arxiv.org/abs/2505.16915) - (Data Synthesis) [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380) - (Data Synthesis) [MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?](https://arxiv.org/abs/2503.09499) - (Data Scaling) [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)

Owner

  • Name: ModelScope
  • Login: modelscope
  • Kind: organization
  • Email: contact@modelscope.cn

Model-as-a-Service in the making: bring accessible AI to all.

GitHub Events

Total
  • Create event: 145
  • Release event: 16
  • Issues event: 163
  • Watch event: 2,073
  • Delete event: 173
  • Member event: 6
  • Issue comment event: 274
  • Push event: 1,129
  • Pull request review event: 477
  • Pull request review comment event: 353
  • Pull request event: 359
  • Fork event: 93
Last Year
  • Create event: 145
  • Release event: 16
  • Issues event: 163
  • Watch event: 2,073
  • Delete event: 173
  • Member event: 6
  • Issue comment event: 274
  • Push event: 1,129
  • Pull request review event: 477
  • Pull request review comment event: 353
  • Pull request event: 359
  • Fork event: 93

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 410
  • Total Committers: 33
  • Avg Commits per committer: 12.424
  • Development Distribution Score (DDS): 0.712
Past Year
  • Commits: 217
  • Committers: 23
  • Avg Commits per committer: 9.435
  • Development Distribution Score (DDS): 0.677
Top Committers
Name Email Commits
Yilun Huang l****l@a****m 118
BeachWang 1****7@p****n 48
Daoyuan Chen 6****c 45
Ce Ge (ζˆˆη­–) g****e@f****m 35
zhijianma z****j@a****m 30
Cathy0908 3****8 20
garyzhang99 4****9 16
chenhesen h****s@a****m 12
Xuchen Pan 3****c 11
Cyrus Zhang c****g@g****m 11
co63oc c****c 10
Yuhan Liu 3****x 9
cmgzn 8****n 8
Zhen Qin z****n@g****m 7
chenyushuo 2****6@q****m 5
Qirui-jiao 1****o 3
lingzhq 1****q 3
2108038773 1****3 2
JamieYu y****a@f****m 2
Yuexiang XIE y****x@a****m 2
weijie 3****o 1
simplaj 3****j 1
seanzhang-zhichen 7****n 1
ricksun2023 1****3 1
panghu 5****i 1
jackylee q****1@g****m 1
Yanyi Liu w****u@1****m 1
ShenQianli s****i@u****u 1
Ruidong-X x****g@g****m 1
NuODaniel z****n@b****m 1
and 3 more...
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 172
  • Total pull requests: 593
  • Average time to close issues: 30 days
  • Average time to close pull requests: 9 days
  • Total issue authors: 114
  • Total pull request authors: 34
  • Average comments per issue: 1.8
  • Average comments per pull request: 0.42
  • Merged pull requests: 433
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 113
  • Pull requests: 420
  • Average time to close issues: 26 days
  • Average time to close pull requests: 6 days
  • Issue authors: 83
  • Pull request authors: 28
  • Average comments per issue: 1.12
  • Average comments per pull request: 0.34
  • Merged pull requests: 303
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • BeachWang (10)
  • yxdyc (10)
  • drcege (9)
  • HYLcool (6)
  • abchbx (4)
  • simplew2011 (4)
  • javapythonphp (3)
  • charonkk (3)
  • HunterLG (3)
  • DietDietDiet (2)
  • wangpi26 (2)
  • tian969 (2)
  • baiyi-os (2)
  • Yang-QW (2)
  • xiafeng-nb (2)
Pull Request Authors
  • HYLcool (121)
  • BeachWang (82)
  • drcege (62)
  • yxdyc (48)
  • Cathy0908 (38)
  • liuyuhanalex (28)
  • cyruszhang (27)
  • garyzhang99 (26)
  • cmgzn (23)
  • co63oc (19)
  • Qirui-jiao (17)
  • pan-x-c (16)
  • chenyushuo (12)
  • zhenqincn (11)
  • lingzhq (10)
Top Labels
Issue Labels
question (78) bug (43) stale-issue (35) enhancement (35) dj:op (8) dj:multimodal (5) dj:dist (3) priority:high (2) good first issue (2) documentation (2) competition:BetterSynth (1) dj:dataset (1) dj:post-tuning (1) environment (1) dj:core (1) help wanted (1)
Pull Request Labels
enhancement (178) documentation (98) dj:op (91) bug (89) dj:multimodal (46) dj:ci/cd (39) dj:core (37) dj:dist (25) environment (23) dj:efficiency (16) dj:dataset (12) dj:cookbook (12) dj:post-tuning (8) stale-pr (8) priority:high (8) agent (6) good first issue (6) invalid (4) dj:tools (4) duplicate (2) dj:text (2) dj:lite (2) dj-ci/cd (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,330 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 21
  • Total maintainers: 1
pypi.org: py-data-juicer

Data Processing for and with Foundation Models.

  • Versions: 21
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 1,330 Last month
Rankings
Stargazers count: 7.1%
Dependent packages count: 7.4%
Forks count: 12.0%
Downloads: 16.9%
Average: 22.4%
Dependent repos count: 68.9%
Maintainers (1)
Last synced: 6 months ago