Recent Releases of https://github.com/modelscope/data-juicer
https://github.com/modelscope/data-juicer - Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"
Major Updates
- ๐ช๐ป Data-Juicer now is compatible with Python 3.11 & 3.12. #749
- ๐งฉ 5 OPs for data attribution are added. #735
- ๐ค Now Data-Juicer support register and apply custom OPs in external paths using the argument
custom_operator_paths. #758 - ๐ง "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760
New Operators
Filter
- Validation-free
llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
- Validation-based
in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735
Enhancements
- A new environment variable DATAJUICEREXTERNALMODELSHOME is added to allow to specify some private or read-only paths to store external and extra models. #740
- Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
- Support custom save_dir for OPs that produce extra multimodal data. #751
- Add official and detailed docs about Data-Juicer Agent. #759
- Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
- Refining developer guide for better practice on building new OPs. #760
Bugs Fixed
- Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
- Fix some test cases. #754
Acknowledgement
- @ShenQianli made their first contribution to 5 new OPs. #735
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.1...v1.4.2
- Python
Published by HYLcool 10 months ago
https://github.com/modelscope/data-juicer - Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.
Major Updates
- ๐ง Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
- ๐ช๐ป Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
- ๐ค Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
- ๐งฉ RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
- ๐ฅ Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738
New Operators
download_file_mapperdownloads data from URLs to local files or specified fields. #709
Enhancements
- New analysis method: correlation analysis among stats is added. #663
- Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
- The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
- Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
- Support store and process bytes data of images in the dataset. #725
Bugs Fixed
- The wheel & docker image building bug is fixed. #706
- Fix bugs in log_summarization. #710
- Fix "no module named data_juicer" error after installing from the wheel file. #727
Acknowledgement
- @fanronghai helps to fix the param error in datasetsplittingby_language tool. #713
- @ayushdg helps to support a GPU-version Minhash deduplicator. #644
- @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.0...v1.4.1
- Python
Published by HYLcool 11 months ago
https://github.com/modelscope/data-juicer - v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)
Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.
๐ง Major Refactors & Improvements
๐ Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
๐ DJ-Doc Redesign (https://github.com/modelscope/data-juicer/pull/675):
- Now with multilingual support (English / Chinese) and a modernized style.
- Now with multilingual support (English / Chinese) and a modernized style.
๐ฆ Dependency Management Update (#660, #680):
- Migrated to
uvfor faster dependency resolution. - Added sub-groups for better organization.
- Migrated to
๐ New Features & Integrations (#683, #688, #692)
๐ Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
๐ DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
๐งช Synthetic Benchmark Added:
- DetailMaster โ a new benchmark for synthetic data evaluation.
๐ ๏ธ New Operators Introduced (#673, #701):
llm_analysis_filtergeneral_field_filter
๐ Core Optimizations & Bug Fixes
โ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
- File extension detection added.
โฑ๏ธ Startup Time Optimization:
- Improved startup performance. (#684)
๐ง Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
๐ณ Docker Build Improvement:
- Ignore installed
distutilslibraries during Docker image building. (#668)
- Ignore installed
๐ ๏ธ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
๐๏ธ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)
๐ Full Changelog
View all changes since v1.3.3 โ
- Python
Published by yxdyc 12 months ago
https://github.com/modelscope/data-juicer - Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.
Major Updates
- ๐ Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
- Add new OPs and recipes for Img-Diff. #658
Enhancements
- Support HF llm for two llmxxxscore_filter OPs. #655
- Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
- Split standalone and distributed unit tests to save time when re-running failed ones. #666
Bugs Fixed
- Address possibly missing cfg in
unify_format. #653 - Improve clarity & fix bad links for some docs. #659
Acknowledgement
- @co63oc helps to fix some typos. #654
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.2...v1.3.3
- Python
Published by HYLcool about 1 year ago
https://github.com/modelscope/data-juicer - Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes
What's Changed
- Human OP enhancements, in https://github.com/modelscope/data-juicer/pull/642 https://github.com/modelscope/data-juicer/pull/645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
- OP efficiency optimization of
document_minhash_deduplicator, in https://github.com/modelscope/data-juicer/pull/639 - set temp_parser.usage to argparse.SUPPRESS, skip too much help log in https://github.com/modelscope/data-juicer/pull/643
- fix date typo by in https://github.com/modelscope/data-juicer/pull/648
- Fix docker building failure in https://github.com/modelscope/data-juicer/pull/650
- Fix StreamToLoguru compatibility issue with torch._dynamo in https://github.com/modelscope/data-juicer/pull/651
- add init file for annotation module, fix dj-process command error in https://github.com/modelscope/data-juicer/pull/652
New Contributor
- @cmgzn made their first contribution in https://github.com/modelscope/data-juicer/pull/651
- Python
Published by yxdyc about 1 year ago
https://github.com/modelscope/data-juicer - Release v1.3.1: added HumanOPs & fixed some bugs
Major Updates
- ๐ฅ prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops
New OPs
extract_tables_from_html_mapper: extract tables from html texts. #634general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626
Bug Fixed
- fix dataset builder initialization failure #630
- update Executor references from Executor to DefaultExecutor #632 #633
- switch the backend of
pltto avoid sub-process/thread error #633 - fix some boundary condition bugs in several deduplicators #635 #637
Others
- check dataset when loading to support to pass dataset in the
DefaultExecutor.runmethod. #633 - update docs to highlight light env installation part. #636
Acknowledgement
- @liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.0...v1.3.1
- Python
Published by HYLcool about 1 year ago
https://github.com/modelscope/data-juicer - Refactor of dataset builder and executor!
The Big Change ๐
Refactor of dataset builder and executor, see https://github.com/modelscope/data-juicer/pull/537, @cyruszhang ๐ YAML explicitly defines different sources of datasets; local and remote are defined separately. ๐ง More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations. ๐ Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders. ๐ Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc. ๐ Add data format validation to ensure consistency and correctness. ๐ Expanded data source support: a. ๐ฆ ModelScope integration. b. ๐ ArXiv dataset import (download, decompress, ingest). c. ๐ Wikipedia dataset support (download, decompress, ingest). d. ๐ Common Crawl integration (download, decompress, ingest). ๐ Backward compatibility with existing dataset_path command-line syntax. ๐ Support for data mixtures to combine multiple datasets dynamically. ๐ง Support for empty formatters/generated datasets without pre-defined config files.
Others ๐ก
๐ New audio processing operator: audioaddgaussian_noise (PR #622), @liuyuhanalex ๐ Added dynamic coverage rate badge to the README for transparency (PR #625)
- Python
Published by yxdyc about 1 year ago
https://github.com/modelscope/data-juicer - Release v1.2.2
Major Updates
- ๐งช Add document for API service. Add parameter transmission using
json.dumpsto support API calls for arbitrary registration functions and classes. #613 - ๐ Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.
New OPs
llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
Others
- Fix config in LLaVa pretrain recipe. #610
- Update news for MindGYM and fix doc. #615
- Fix decode error through UTF-8 decoding. #618
- Python
Published by BeachWang about 1 year ago
https://github.com/modelscope/data-juicer - Release v1.2.1
Major Updates
DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025! - Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive
@unittest.skipand removeSKIPPED_TESTS. #586 - upload test coverage reports to GitHub artifacts. #586
New OPs
image_remove_background_mapper: remove the background of images. #589
Others
- add missing LOADEDAUDIOS to ALLINTER_VARS to enable OP fusion and context sharing. #585
- only build doc for py3.10. #586
- move dependency on
rayto minimal requirements. #586 #594 #595 - allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
- fix undefined
filenobug of the logger. #594
Acknowledgement
- @liuyuhanalex helps simplify the code logic of OP fusion, add a new OP
image_remove_background_mapper, and fix some minor bugs. #581 #585 #589 - @co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
- @danielhjz helps to fix the implicit memory leak problem in
image_nsfw_filter. #590
- Python
Published by HYLcool over 1 year ago
https://github.com/modelscope/data-juicer - v1.2.0 Doc refactored; New algorithm proposed
What's New
- ๐ The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
- ๐ More unit-tests added.
- ๐ The data pre-split and export are improved.
- ๐ฎ A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.
Detailed PRs
- fix export error when export_stats columns is null in https://github.com/modelscope/data-juicer/pull/557
- Resplit input dataset in ray mode in https://github.com/modelscope/data-juicer/pull/549
- Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in https://github.com/modelscope/data-juicer/pull/561
- Resolve most skipped unit-tests by in https://github.com/modelscope/data-juicer/pull/559
- fix translation error in https://github.com/modelscope/data-juicer/pull/562
- Add unittest for ray text dedup in https://github.com/modelscope/data-juicer/pull/540
- [Typo]correct a small typo in https://github.com/modelscope/data-juicer/pull/563
- update the 2.0 paper link & the DaaR news in https://github.com/modelscope/data-juicer/pull/566
- Fix typos in https://github.com/modelscope/data-juicer/pull/571
- Optimization for sdxlprompt2promptmapper dependency importing by in https://github.com/modelscope/data-juicer/pull/570
- Fix typos in https://github.com/modelscope/data-juicer/pull/572
Acknowledgment
- @liuyuhanalex @co63oc made their first PRs
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.1.0...v1.2.0
- Python
Published by yxdyc over 1 year ago
https://github.com/modelscope/data-juicer - Release v1.1.0
Major Updates
- ๐งช User now can run ray-based distributed data processing under the guidance of added docs. #523
- ๐งช The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
- ๐ฅ Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
- ๐ Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
- ๐ Automatically update relevant documents when adding OPs to reduce the development burden. #527
- ๐ Add usability tags for OPs:
alphatag for OPs in which only the basic OP implementations are finished;betatag for OPs in which unittests are added based on thealphaversion;stabletag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on thebetaversion.
New OPs
image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550mllm_mapper: Mapper to use MLLMs to generate texts for images. #550sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550sentence_augmentation_mapper: Augment sentences using LLMs. #550text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550
Bug Fixed
- Add global
skip_op_errorparam to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528 - Fix model force download bug. #529
- Fix
IndexErrorif the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536 - Fix missing field meta tag on ray mode. #538
- Update
max_tokensormax_new_tokensfor vllm-based OPs to avoid too short generation. #544 - Fix bug in the role playing data generation demo. #545
Others
- Enhance unit test for API calling OPs. #528
- Remove sandbox requirements installation from Dockerfile. #530
- Update the
datasourcerelated APIs to be compatible with the latest version of Ray. #532 - Limit the generated qa num for each text in
generate_qa_from_text_mapper. #541 - Update docs for preparing DJ2.0 release. #542
- Update a quick cdn link for arch figure. #543
- Add a video demo for role playing data generation. #545
- Optimize op doc for global textual search. #552
- Use a more stable and fast translator than google translator for automatic OP doc building. #554
Acknowledgement
- @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550
- Python
Published by BeachWang over 1 year ago
https://github.com/modelscope/data-juicer - Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs
Major Updates
- ๐ฅ Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
- ๐ฅ Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (
meta,stats) #514 #518 - Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
- ๐ Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
- ๐ Support Ray Actor mode for GPU-based OPs. #511
New OPs
Post-tuning OPs for fine-grained analysis of dialog data. #513
Mapper
dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default prompt) in feed back dialog data.dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.query_intent_detection_mapper: Mapper to predict user's Intent label in a query.query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and 'positive') in a query.query_topic_detection_mapper: Mapper to predict user's topic label in a query. ### Aggregatormeta_tags_aggregator: Merge similar meta tags to one tag. ### Selectortags_specified_field_selector: Select samples based on the tags of specified field. ### Groupernaive_reverse_grouper: Split bathed sample to samples.
Bug Fixed
- Fix the wrong argument passing in
generate_qa_from_example_mapper. #517 - Update the out-of-date Dingding QR code on the main page. #513
Acknowledgement
- @jackylee-ch made their first contribution to help fix several invalid links in the document. #521
Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.0.2...v1.0.3
- Python
Published by HYLcool over 1 year ago
https://github.com/modelscope/data-juicer - Release v1.0.2
Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.
DJ-Operators
extract_support_text_mapper,relation_identity_mapper,python_file_mapper, https://github.com/modelscope/data-juicer/pull/500naive_grouper,key_value_grouper, https://github.com/modelscope/data-juicer/pull/500nested_aggregator,entity_attribute_aggregator,most_relavant_entities_aggregator, https://github.com/modelscope/data-juicer/pull/500video_extract_frames_mapper, https://github.com/modelscope/data-juicer/pull/507
Performance
- Optimize ray mode performance, https://github.com/modelscope/data-juicer/pull/442
- Patch for Performance Benchmark in CI/CD workflows, https://github.com/modelscope/data-juicer/pull/506
- DJ Ray mode supports streaming loading of
jsonlfiles, https://github.com/modelscope/data-juicer/pull/515
Usability and Analysis
- support dj-install in recipe-level, https://github.com/modelscope/data-juicer/pull/508
- support dj-analyze with --auto mode, https://github.com/modelscope/data-juicer/pull/512
- support op-wise insight auto mining, https://github.com/modelscope/data-juicer/pull/516
Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!
- Python
Published by yxdyc over 1 year ago
https://github.com/modelscope/data-juicer - Release v1.0.1
Major Updates
- ๐ Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
- ๐ [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
- ๐ฅ Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493
OPs
Text OPs
pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491 ## Script OPspython_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492python_file_mapper: Mapper for executing customized Python functions on data samples. #493
Bugs Fixed
- Add an argument to control whether to open
Monitorfor data processing. It's True by default. #483 - For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
- Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
- Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504
Others
- Pin the PyAV version to prevent inconsistent updates. #504
- Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
- Remove unnecessary UNFORKABLE marks for some OPs. #491
- Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501
Acknowledgment
Here we thank public contributors for their PRs and issues to make Data-Juicer better!
- Python
Published by BeachWang over 1 year ago
https://github.com/modelscope/data-juicer - Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!
Major Updates
- ๐ Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
- ๐งช [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
- ๐ Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
- ๐ Support adaptive resource management:
- Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. #270 #329 #354
- Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. #429
- ๐ฅ We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
- ๐ฅ A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
- ๐ A playground for Data-Juicer is opened for user trial. #277 #368
OPs
Text
ray_document_deduplicator: supports Ray-based distributed exact-match deduplication for text-only datasets. #263- Support sentencepiece tokenizer for MinHash deduplicators. #269
generate_qa_from_text_mapper: generates question and answer pairs from input texts. #333 #454generate_qa_from_examples_mapper: generates question and answer pairs based on examples. #338 #454optimize_qa_mapper: optimizes the question-answer pairs in question-answering samples. #338 #454optimize_query_mapper: optimizes the query in question-answering samples. #338 #454optimize_response_mapper: optimizes the response in question-answering samples. #454calibrate_qa_mapper: calibrates question-answer pairs based on reference text. #463calibrate_query_mapper: calibrates query in question-answer pairs based on reference text. #463calibrate_response_mapper: calibrates response in question-answer pairs based on reference text. #463text_chunk_mapper: splits input text to chunks. #481extract_entity_attribute_mapper: extracts attributes for given entities from the text. #481extract_entity_relation_mapper: extracts entities and relations in the text for knowledge graph. #481extract_event_mapper: extracts events and relevant characters in the text. #481extract_keyword_mapper: generates keywords for the text. #481extract_nickname_mapper: extracts nickname relationship in the text.. #481
Image
image_face_blur_mapper: blurs faces detected in images. #249image_nsfw_filter: keeps samples containing images with NSFW scores below the threshold. #252image_watermark_filter: keeps samples containing images with predicted watermark probabilities below the threshold. #256ray_image_deduplicator: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263image_pair_similarity_filter: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393image_tagging_mapper: generates image tags from the input images. #423image_face_count_filter: keeps samples containing images with face counts within the specified range. #446
Video
video_face_blur_mapper: blurs faces detected in videos. #253video_remove_watermark_mapper: removes the watermarks in given regions from the videos. #236video_nsfw_filter: keeps samples containing videos with NSFW scores below the threshold. #252video_watermark_filter: keeps samples containing videos with predicted watermark probabilities below the threshold. #256ray_video_deduplicator: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263video_tagging_from_frames_filter: keeps samples containing videos with given tags. #260video_captioning_from_frames_mapper: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257video_captioning_from_summarizer_mapper: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250video_motion_score_raft_filter: keeps samples with video motion scores (based on RAFT model) within a specific range. #478- Enhance the
video_motion_score_filterto support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361
Misc.
- Switch face detection used in 3 OPs (
image_face_ratio_filter,image_face_blur_mapper,video_face_blur_mapper) fromdlibtoOpenCVto avoid dependency problems. #320 - Deduplicators for multimodal datasets are allowed to consider text information as well. #313
- Support batched processing for some OPs. #406 #435
Others (Engine, Job Control and Tools)
- Support more multimodal (video) dataset conversion tools: MSR-VTT #248
- Support distributed processing script for Slurm. #242
- Support Minhash-LSH deduplication tools based on Spark. #290
- Enable GPU usage for Ray executor. #274
- Add debug mode for Data-Juicer. #303
- Add video generation tools for several metrics. #273 #312
- Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
- Add sampled frames from videos for video OPs to support OP fusion. #271
- Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
- Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
- Support
turbomode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402 - Update type annotations from
jsonargparsetoPydantic. #422 - Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
- Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
- Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
- Enable unit test coverage report. #460
- Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479
Document Updates
- Refine documentation system based on Sphinx. #245
- Regular document updates. #234 #246
- Update the class importing and document building logics for better automation. #299
- Reorganize the operator documents for better reading. #472
Bugs Fixed
- Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
- Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
- Fix some problems in demos. #244
- Fix "Undefined punctuation_pattern" error in two OPs. #301
- Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
- Fix the bug of out-of-work type hint checking for config files. #302
- Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
- Fix the memory leaking of video OPs. #374
- Fix the bug of two OPs (
video_aesthetics_filterandimage_diffusion_mapper) that can not make use of GPUs. #389 - Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!
- @chg0901 helps to fix typos in documents. #237
- @lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
- @shiweijiezero helps fix the bugs in updating the data keys. #300
- @seanzhang-zhichen helps to support multiple patterns for
replace_content_mapper. #319 - @simplaj helps to fix a bug of a non-predefined attribute for
video_captioning_from_summarizer_mapper. #343 - @zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
- @2108038773 helps to add
trust_remote_codeargument for some public models on HuggingFace. #382 #385 - @TobyJasper helps to fix typos in documents and contribute a new OP
image_face_count_filter. #392 #452 - @co63oc helps to fix some typos in documents and code. #427
- Python
Published by yxdyc over 1 year ago
https://github.com/modelscope/data-juicer - Release v0.2.0: Multimodal Support & DJ-SORA
New Features
- ๐ We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
- ๐ We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
- ๐ฅ Our paper has been accepted by SIGMOD'24 industrial track! #211
- ๐ฅ "BetterMixture" โ Our second data-centric LLM competition has kicked off and is about to end soon. #174
New OPs
Multimodal
video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227video_captioning_from_audio_mapper: captions a video according to its audio streams. #227image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200Video
Filter
video_duration_filter: keeps samples whose videos' durations are within a specified range. #227video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227video_resolution_filter: filters samples according to the resolution of videos in them. #227video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227video_motion_score_filter: keeps samples with video motion scores within a specific range. #227Mapper
video_split_by_scene_mapper: splits videos into scene clips. #227video_split_by_duration_mapper: splits videos by specified duration interval. #227video_split_by_key_frame_mapper: splits videos by their keyframes. #227video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227Deduplicator
video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227Audio
audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227Image
image_blur_mapper: adds random noises to images to blur them. #180image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227Document Updates
"Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
OP Insight Visualization Demo code: adds a demo to visualize how each OP works.
Bugs Fixed
Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
Fix the bug that some images will be lost when converting their paths to absolute paths. #178
Fix the dependency problems of OPs who depend on other OPs. #181
Fix the bug that the
predict.pytool gets stuck on the help page. #183Fix
face_area_filter: constrains the detection coordinates within the image. #202Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
Fix or update invalid links in Data-Juicer. #201 #219
Others
Optimize the model management module. #196 #227
Optimize the unit test actions. #195 #196 #216 #227
Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
Update the docker image with JDK. #208
Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
Optimize the generated multimodal data storage. #227
Support running data-juicer process jobs on Aliyun PAI-DLC. #227
Better support for multi-machine distributed data processing in Ray mode. #227
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!
@liuyanyi helps to fix a bug in quality classifier tools. #183
@co63oc helps to fix some typos. #215
@liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
@zhenqincn helps to add more papers to the Awesome LLM Data doc. #226
- Python
Published by HYLcool about 2 years ago
https://github.com/modelscope/data-juicer - Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed
New Features
- Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named
simhash-pybindto solve the Python version limitation problem. - We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
- We released a pybind version of simhash-py library named
- Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
- Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
- Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP
replace_content_mapper. #143 - Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160 # New OPs ## Text
chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51text_action_filter: keeps samples containing action verbs in their texts. #122text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149 ## Imageimage_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73face_area_filter: keeps samples containing images with face area ratios within the specified range. #110image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72 ## Multimodalimage_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139 # Bugs fixed- Fix the
pandas==2.0.0 fsspec==2023.3.0to avoid unexpected errors from third-party dependencies. #38 #42 - Fix the bug when OPs
nlpaug_en_mapperandnlpcda_zh_mappergenerate indefinite numbers of augmented samples. #76 - Fix the bug of
maximum_line_length_filtermight generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147 - Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
- Fix the bug of commandline arguments parsing error in some cases. #108 #165
- Store simhash value as string type to avoid errors from PyArrow. #168 #170 # Others
- Dependency importing optimization: only require and import some dependencies when using. #35 #82
- Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
- Optimize the cache directory selection logic. #43
- Support limiting the number of samples when mixing datasets. #86
- Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
- OP
language_id_score_filtersupports keeping samples in multiple languages now. #125 #151 # Acknowledgement Here we thank public contributors for their PRs to make Data-Juicer better! - @JONGSKY helps to remove some unnecessary code. #85
- @xuruidong helps to fix several broken links in the README doc. #142
- Python
Published by HYLcool over 2 years ago
https://github.com/modelscope/data-juicer - Release v0.1.2: more core functions are available now.
New OPs
nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24
New features
- OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
- Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
- Distributed data processing with Ray is supported now. #21
- Config sys optimization:
- Only keep
text_keysand remove previous misleading argtext_key(s)_to_process/load. #13 - A new argument
export_in_parallelis added to control whether export the result datasets in parallel. #17 - Display the config table after config parsing is ready. #17
- Only keep
Others
- Replace original string constants with constant enums. #13
- Expand the checkpoint protection range to cover the exporting process. #14
- Remove extra intermediate variables storage in
document_simhash_deduplicatorto save more memory. #14 - Docs updates. #15 #16
- PyPi package is available. You can install data-juicer by
pip install py-data-juicernow. #23 - Docker building is available now. The official docker image for Docker Hub is in progress. #23
- Deploy the unit tests for Data-Juicer. #29
- Python
Published by HYLcool over 2 years ago
https://github.com/modelscope/data-juicer - Release v0.1.0, the first internal version for open-source
Summarization - Table of Contents
- Data-Juicer: A Data-Centric Text Processing System for Large Language Models
- Table of Contents
- Features
- Prerequisites
- Installation
- Quick Start
- Data Processing
- Data Analysis
- Data Visualization
- Build Up Config Files
- Preprocess raw data (Optional)
- Documentation | ๆๆกฃ
- Data Recipes
- Demos
- License
- Contributing
- References
Features
Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.
Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.
Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.
Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.
User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.
- Python
Published by yxdyc almost 3 years ago