data-juicer

https://github.com/modelscope/data-juicer - Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Major Updates

💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
🧩 5 OPs for data attribution are added. #735
🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

New Operators

Filter

Validation-free
- llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
- instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
Validation-based
- in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
- llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
- text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

Enhancements

A new environment variable DATAJUICEREXTERNALMODELSHOME is added to allow to specify some private or read-only paths to store external and extra models. #740
Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
Support custom save_dir for OPs that produce extra multimodal data. #751
Add official and detailed docs about Data-Juicer Agent. #759
Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
Refining developer guide for better practice on building new OPs. #760

Bugs Fixed

Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
Fix some test cases. #754

Acknowledgement

@ShenQianli made their first contribution to 5 new OPs. #735

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.1...v1.4.2

- Python
Published by HYLcool 10 months ago

https://github.com/modelscope/data-juicer - Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

New analysis method: correlation analysis among stats is added. #663
Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
Support store and process bytes data of images in the dataset. #725

Bugs Fixed

The wheel & docker image building bug is fixed. #706
Fix bugs in log_summarization. #710
Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

@fanronghai helps to fix the param error in datasetsplittingby_language tool. #713
@ayushdg helps to support a GPU-version Minhash deduplicator. #644
@ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.0...v1.4.1

- Python
Published by HYLcool 11 months ago

https://github.com/modelscope/data-juicer - v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.

🔧 Major Refactors & Improvements

🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
📘 DJ-Doc Redesign (https://github.com/modelscope/data-juicer/pull/675):
- Now with multilingual support (English / Chinese) and a modernized style.
📦 Dependency Management Update (#660, #680):
- Migrated to uv for faster dependency resolution.
- Added sub-groups for better organization.

🌍 New Features & Integrations (#683, #688, #692)

🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
🛠️ New Operators Introduced (#673, #701):
- llm_analysis_filter
- general_field_filter

🚀 Core Optimizations & Bug Fixes

✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
🐳 Docker Build Improvement:
- Ignore installed distutils libraries during Docker image building. (#668)
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)

📚 Full Changelog

View all changes since v1.3.3 →

- Python
Published by yxdyc 12 months ago

https://github.com/modelscope/data-juicer - Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
Add new OPs and recipes for Img-Diff. #658

Enhancements

Support HF llm for two llmxxxscore_filter OPs. #655
Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

Address possibly missing cfg in unify_format. #653
Improve clarity & fix bad links for some docs. #659

Acknowledgement

@co63oc helps to fix some typos. #654

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.2...v1.3.3

- Python
Published by HYLcool about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

Human OP enhancements, in https://github.com/modelscope/data-juicer/pull/642 https://github.com/modelscope/data-juicer/pull/645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
OP efficiency optimization of document_minhash_deduplicator, in https://github.com/modelscope/data-juicer/pull/639
set temp_parser.usage to argparse.SUPPRESS, skip too much help log in https://github.com/modelscope/data-juicer/pull/643
fix date typo by in https://github.com/modelscope/data-juicer/pull/648
Fix docker building failure in https://github.com/modelscope/data-juicer/pull/650
Fix StreamToLoguru compatibility issue with torch._dynamo in https://github.com/modelscope/data-juicer/pull/651
add init file for annotation module, fix dj-process command error in https://github.com/modelscope/data-juicer/pull/652

New Contributor

@cmgzn made their first contribution in https://github.com/modelscope/data-juicer/pull/651

- Python
Published by yxdyc about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops

New OPs

extract_tables_from_html_mapper: extract tables from html texts. #634
general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

fix dataset builder initialization failure #630
update Executor references from Executor to DefaultExecutor #632 #633
switch the backend of plt to avoid sub-process/thread error #633
fix some boundary condition bugs in several deduplicators #635 #637

Others

check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
update docs to highlight light env installation part. #636

Acknowledgement

@liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.0...v1.3.1

- Python
Published by HYLcool about 1 year ago

https://github.com/modelscope/data-juicer - Refactor of dataset builder and executor!

The Big Change 🚀

Refactor of dataset builder and executor, see https://github.com/modelscope/data-juicer/pull/537, @cyruszhang 📜 YAML explicitly defines different sources of datasets; local and remote are defined separately. 🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations. 🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders. 🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc. 🔍 Add data format validation to ensure consistency and correctness. 🌐 Expanded data source support: a. 📦 ModelScope integration. b. 📚 ArXiv dataset import (download, decompress, ingest). c. 📚 Wikipedia dataset support (download, decompress, ingest). d. 🌐 Common Crawl integration (download, decompress, ingest). 🔗 Backward compatibility with existing dataset_path command-line syntax. 🔀 Support for data mixtures to combine multiple datasets dynamically. 🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audioaddgaussian_noise (PR #622), @liuyuhanalex 📊 Added dynamic coverage rate badge to the README for transparency (PR #625)

- Python
Published by yxdyc about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.2.2

Major Updates

🧪 Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

Fix config in LLaVa pretrain recipe. #610
Update news for MindGYM and fix doc. #615
Fix decode error through UTF-8 decoding. #618

- Python
Published by BeachWang about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.2.1

Major Updates

DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive @unittest.skip and remove SKIPPED_TESTS. #586
- upload test coverage reports to GitHub artifacts. #586

New OPs

image_remove_background_mapper: remove the background of images. #589

Others

add missing LOADEDAUDIOS to ALLINTER_VARS to enable OP fusion and context sharing. #585
only build doc for py3.10. #586
move dependency on ray to minimal requirements. #586 #594 #595
allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
fix undefined fileno bug of the logger. #594

Acknowledgement

@liuyuhanalex helps simplify the code logic of OP fusion, add a new OP image_remove_background_mapper, and fix some minor bugs. #581 #585 #589
@co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
@danielhjz helps to fix the implicit memory leak problem in image_nsfw_filter. #590

- Python
Published by HYLcool over 1 year ago

https://github.com/modelscope/data-juicer - v1.2.0 Doc refactored; New algorithm proposed

What's New

📚 The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
🔎 More unit-tests added.
🎛 The data pre-split and export are improved.
🔮 A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.

Detailed PRs

fix export error when export_stats columns is null in https://github.com/modelscope/data-juicer/pull/557
Resplit input dataset in ray mode in https://github.com/modelscope/data-juicer/pull/549
Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in https://github.com/modelscope/data-juicer/pull/561
Resolve most skipped unit-tests by in https://github.com/modelscope/data-juicer/pull/559
fix translation error in https://github.com/modelscope/data-juicer/pull/562
Add unittest for ray text dedup in https://github.com/modelscope/data-juicer/pull/540
[Typo]correct a small typo in https://github.com/modelscope/data-juicer/pull/563
update the 2.0 paper link & the DaaR news in https://github.com/modelscope/data-juicer/pull/566
Fix typos in https://github.com/modelscope/data-juicer/pull/571
Optimization for sdxlprompt2promptmapper dependency importing by in https://github.com/modelscope/data-juicer/pull/570
Fix typos in https://github.com/modelscope/data-juicer/pull/572

Acknowledgment

@liuyuhanalex @co63oc made their first PRs

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.1.0...v1.2.0

- Python
Published by yxdyc over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.1.0

Major Updates

🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
🛝 Add usability tags for OPs:
- alpha tag for OPs in which only the basic OP implementations are finished;
- beta tag for OPs in which unittests are added based on the alpha version;
- stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
sentence_augmentation_mapper: Augment sentences using LLMs. #550
text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
Fix model force download bug. #529
Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
Fix missing field meta tag on ray mode. #538
Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
Fix bug in the role playing data generation demo. #545

Others

Enhance unit test for API calling OPs. #528
Remove sandbox requirements installation from Dockerfile. #530
Update the datasource related APIs to be compatible with the latest version of Ray. #532
Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
Update docs for preparing DJ2.0 release. #542
Update a quick cdn link for arch figure. #543
Add a video demo for role playing data generation. #545
Optimize op doc for global textual search. #552
Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

@Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550

- Python
Published by BeachWang over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Major Updates

💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (meta, stats) #514 #518
- Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
🚀 Support Ray Actor mode for GPU-based OPs. #511

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.
dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.
dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default prompt) in feed back dialog data.
dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.
query_intent_detection_mapper: Mapper to predict user's Intent label in a query.
query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and 'positive') in a query.
query_topic_detection_mapper: Mapper to predict user's topic label in a query. ### Aggregator
meta_tags_aggregator: Merge similar meta tags to one tag. ### Selector
tags_specified_field_selector: Select samples based on the tags of specified field. ### Grouper
naive_reverse_grouper: Split bathed sample to samples.

Bug Fixed

Fix the wrong argument passing in generate_qa_from_example_mapper. #517
Update the out-of-date Dingding QR code on the main page. #513

Acknowledgement

@jackylee-ch made their first contribution to help fix several invalid links in the document. #521

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.0.2...v1.0.3

- Python
Published by HYLcool over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.2

Major Updates

Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators

extract_support_text_mapper, relation_identity_mapper, python_file_mapper, https://github.com/modelscope/data-juicer/pull/500
naive_grouper, key_value_grouper, https://github.com/modelscope/data-juicer/pull/500
nested_aggregator, entity_attribute_aggregator, most_relavant_entities_aggregator, https://github.com/modelscope/data-juicer/pull/500
video_extract_frames_mapper, https://github.com/modelscope/data-juicer/pull/507

Performance

Optimize ray mode performance, https://github.com/modelscope/data-juicer/pull/442
Patch for Performance Benchmark in CI/CD workflows, https://github.com/modelscope/data-juicer/pull/506
DJ Ray mode supports streaming loading of jsonl files, https://github.com/modelscope/data-juicer/pull/515

Usability and Analysis

support dj-install in recipe-level, https://github.com/modelscope/data-juicer/pull/508
support dj-analyze with --auto mode, https://github.com/modelscope/data-juicer/pull/512
support op-wise insight auto mining, https://github.com/modelscope/data-juicer/pull/516

Acknowledgment

Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

- Python
Published by yxdyc over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.1

Major Updates

🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493

OPs

Text OPs

pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491 ## Script OPs
python_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492
python_file_mapper: Mapper for executing customized Python functions on data samples. #493

Bugs Fixed

Add an argument to control whether to open Monitor for data processing. It's True by default. #483
For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504

Others

Pin the PyAV version to prevent inconsistent updates. #504
Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
Remove unnecessary UNFORKABLE marks for some OPs. #491
Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501

Acknowledgment

Here we thank public contributors for their PRs and issues to make Data-Juicer better!

- Python
Published by BeachWang over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Major Updates

🚀 Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
🧪 [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
🚀 Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
🚀 Support adaptive resource management:
- Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. #270 #329 #354
- Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. #429
💥 We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
💥 A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
🛝 A playground for Data-Juicer is opened for user trial. #277 #368

OPs

Text

ray_document_deduplicator: supports Ray-based distributed exact-match deduplication for text-only datasets. #263
Support sentencepiece tokenizer for MinHash deduplicators. #269
generate_qa_from_text_mapper: generates question and answer pairs from input texts. #333 #454
generate_qa_from_examples_mapper: generates question and answer pairs based on examples. #338 #454
optimize_qa_mapper: optimizes the question-answer pairs in question-answering samples. #338 #454
optimize_query_mapper: optimizes the query in question-answering samples. #338 #454
optimize_response_mapper: optimizes the response in question-answering samples. #454
calibrate_qa_mapper: calibrates question-answer pairs based on reference text. #463
calibrate_query_mapper: calibrates query in question-answer pairs based on reference text. #463
calibrate_response_mapper: calibrates response in question-answer pairs based on reference text. #463
text_chunk_mapper: splits input text to chunks. #481
extract_entity_attribute_mapper: extracts attributes for given entities from the text. #481
extract_entity_relation_mapper: extracts entities and relations in the text for knowledge graph. #481
extract_event_mapper: extracts events and relevant characters in the text. #481
extract_keyword_mapper: generates keywords for the text. #481
extract_nickname_mapper: extracts nickname relationship in the text.. #481

Image

image_face_blur_mapper: blurs faces detected in images. #249
image_nsfw_filter: keeps samples containing images with NSFW scores below the threshold. #252
image_watermark_filter: keeps samples containing images with predicted watermark probabilities below the threshold. #256
ray_image_deduplicator: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263
image_pair_similarity_filter: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393
image_tagging_mapper: generates image tags from the input images. #423
image_face_count_filter: keeps samples containing images with face counts within the specified range. #446

Video

video_face_blur_mapper: blurs faces detected in videos. #253
video_remove_watermark_mapper: removes the watermarks in given regions from the videos. #236
video_nsfw_filter: keeps samples containing videos with NSFW scores below the threshold. #252
video_watermark_filter: keeps samples containing videos with predicted watermark probabilities below the threshold. #256
ray_video_deduplicator: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263
video_tagging_from_frames_filter: keeps samples containing videos with given tags. #260
video_captioning_from_frames_mapper: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257
video_captioning_from_summarizer_mapper: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250
video_motion_score_raft_filter: keeps samples with video motion scores (based on RAFT model) within a specific range. #478
Enhance the video_motion_score_filter to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361

Misc.

Switch face detection used in 3 OPs (image_face_ratio_filter, image_face_blur_mapper, video_face_blur_mapper) from dlib to OpenCV to avoid dependency problems. #320
Deduplicators for multimodal datasets are allowed to consider text information as well. #313
Support batched processing for some OPs. #406 #435

Others (Engine, Job Control and Tools)

Support more multimodal (video) dataset conversion tools: MSR-VTT #248
Support distributed processing script for Slurm. #242
Support Minhash-LSH deduplication tools based on Spark. #290
Enable GPU usage for Ray executor. #274
Add debug mode for Data-Juicer. #303
Add video generation tools for several metrics. #273 #312
Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
Add sampled frames from videos for video OPs to support OP fusion. #271
Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
Support turbo mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402
Update type annotations from jsonargparse to Pydantic. #422
Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
Enable unit test coverage report. #460
Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479

Document Updates

Refine documentation system based on Sphinx. #245
Regular document updates. #234 #246
Update the class importing and document building logics for better automation. #299
Reorganize the operator documents for better reading. #472

Bugs Fixed

Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
Fix some problems in demos. #244
Fix "Undefined punctuation_pattern" error in two OPs. #301
Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
Fix the bug of out-of-work type hint checking for config files. #302
Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
Fix the memory leaking of video OPs. #374
Fix the bug of two OPs (video_aesthetics_filter and image_diffusion_mapper) that can not make use of GPUs. #389
Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

@chg0901 helps to fix typos in documents. #237
@lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
@shiweijiezero helps fix the bugs in updating the data keys. #300
@seanzhang-zhichen helps to support multiple patterns for replace_content_mapper. #319
@simplaj helps to fix a bug of a non-predefined attribute for video_captioning_from_summarizer_mapper. #343
@zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
@2108038773 helps to add trust_remote_code argument for some public models on HuggingFace. #382 #385
@TobyJasper helps to fix typos in documents and contribute a new OP image_face_count_filter. #392 #452
@co63oc helps to fix some typos in documents and code. #427

- Python
Published by yxdyc over 1 year ago

https://github.com/modelscope/data-juicer - Release v0.2.0: Multimodal Support & DJ-SORA

New Features

🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

Video

Filter
video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
video_resolution_filter: filters samples according to the resolution of videos in them. #227
video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

Mapper
video_split_by_scene_mapper: splits videos into scene clips. #227
video_split_by_duration_mapper: splits videos by specified duration interval. #227
video_split_by_key_frame_mapper: splits videos by their keyframes. #227
video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

Deduplicator
video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

Audio
audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

Image
image_blur_mapper: adds random noises to images to blur them. #180
image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

Document Updates
"Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Bugs Fixed
Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
Fix the bug that some images will be lost when converting their paths to absolute paths. #178
Fix the dependency problems of OPs who depend on other OPs. #181
Fix the bug that the predict.py tool gets stuck on the help page. #183
Fix face_area_filter: constrains the detection coordinates within the image. #202
Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
Fix or update invalid links in Data-Juicer. #201 #219

Others
Optimize the model management module. #196 #227
Optimize the unit test actions. #195 #196 #216 #227
Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
Update the docker image with JDK. #208
Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
Optimize the generated multimodal data storage. #227
Support running data-juicer process jobs on Aliyun PAI-DLC. #227
Better support for multi-machine distributed data processing in Ray mode. #227

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!
@liuyanyi helps to fix a bug in quality classifier tools. #183
@co63oc helps to fix some typos. #215
@liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
@zhenqincn helps to add more papers to the Awesome LLM Data doc. #226

- Python
Published by HYLcool about 2 years ago

https://github.com/modelscope/data-juicer - Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
- We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160 # New OPs ## Text
chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
text_action_filter: keeps samples containing action verbs in their texts. #122
text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149 ## Image
image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72 ## Multimodal
image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139 # Bugs fixed
Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
Fix the bug of commandline arguments parsing error in some cases. #108 #165
Store simhash value as string type to avoid errors from PyArrow. #168 #170 # Others
Dependency importing optimization: only require and import some dependencies when using. #35 #82
Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
Optimize the cache directory selection logic. #43
Support limiting the number of samples when mixing datasets. #86
Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151 # Acknowledgement Here we thank public contributors for their PRs to make Data-Juicer better!
@JONGSKY helps to remove some unnecessary code. #85
@xuruidong helps to fix several broken links in the README doc. #142

- Python
Published by HYLcool over 2 years ago

https://github.com/modelscope/data-juicer - Release v0.1.2: more core functions are available now.

New OPs

nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
Distributed data processing with Ray is supported now. #21
Config sys optimization:
- Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
- A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
- Display the config table after config parsing is ready. #17

Others

Replace original string constants with constant enums. #13
Expand the checkpoint protection range to cover the exporting process. #14
Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
Docs updates. #15 #16
PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
Docker building is available now. The official docker image for Docker Hub is in progress. #23
Deploy the unit tests for Data-Juicer. #29

- Python
Published by HYLcool over 2 years ago

https://github.com/modelscope/data-juicer - Release v0.1.0, the first internal version for open-source

Summarization - Table of Contents

Data-Juicer: A Data-Centric Text Processing System for Large Language Models
Table of Contents
- Features
- Prerequisites
- Installation
- Quick Start
  - Data Processing
  - Data Analysis
  - Data Visualization
  - Build Up Config Files
  - Preprocess raw data (Optional)
- Documentation | 文档
- Data Recipes
- Demos
- License
- Contributing
- References

Features

Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.
Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.
Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.
Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.
User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.

- Python
Published by yxdyc almost 3 years ago