Recent Releases of https://github.com/modelscope/data-juicer

https://github.com/modelscope/data-juicer - Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Major Updates

  • ๐Ÿ’ช๐Ÿป Data-Juicer now is compatible with Python 3.11 & 3.12. #749
  • ๐Ÿงฉ 5 OPs for data attribution are added. #735
  • ๐Ÿค Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
  • ๐Ÿ”ง "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

New Operators

Filter

  • Validation-free
    • llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
    • instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
  • Validation-based
    • in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
    • llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
    • text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

Enhancements

  • A new environment variable DATAJUICEREXTERNALMODELSHOME is added to allow to specify some private or read-only paths to store external and extra models. #740
  • Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
  • Support custom save_dir for OPs that produce extra multimodal data. #751
  • Add official and detailed docs about Data-Juicer Agent. #759
  • Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
  • Refining developer guide for better practice on building new OPs. #760

Bugs Fixed

  • Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
  • Fix some test cases. #754

Acknowledgement

  • @ShenQianli made their first contribution to 5 new OPs. #735

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.1...v1.4.2

- Python
Published by HYLcool 10 months ago

https://github.com/modelscope/data-juicer - Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

  • ๐Ÿ”ง Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
  • ๐Ÿ’ช๐Ÿป Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
  • ๐Ÿค Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
  • ๐Ÿงฉ RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
  • ๐ŸŽฅ Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

  • download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

  • New analysis method: correlation analysis among stats is added. #663
  • Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
  • The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
  • Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
  • Support store and process bytes data of images in the dataset. #725

Bugs Fixed

  • The wheel & docker image building bug is fixed. #706
  • Fix bugs in log_summarization. #710
  • Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

  • @fanronghai helps to fix the param error in datasetsplittingby_language tool. #713
  • @ayushdg helps to support a GPU-version Minhash deduplicator. #644
  • @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.0...v1.4.1

- Python
Published by HYLcool 11 months ago

https://github.com/modelscope/data-juicer - v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.


๐Ÿ”ง Major Refactors & Improvements

  • ๐Ÿ”„ Sandbox Usability (#686):

    • Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
    • Includes the InternVL example as a showcase.
  • ๐Ÿ“˜ DJ-Doc Redesign (https://github.com/modelscope/data-juicer/pull/675):

    • Now with multilingual support (English / Chinese) and a modernized style.
  • ๐Ÿ“ฆ Dependency Management Update (#660, #680):

    • Migrated to uv for faster dependency resolution.
    • Added sub-groups for better organization.

๐ŸŒ New Features & Integrations (#683, #688, #692)

  • ๐Ÿ†• Additional Repo Supported:

  • ๐Ÿ“œ DJ-Awesome-List:

    • A survey paper accepted by TPAMI'25!
  • ๐Ÿงช Synthetic Benchmark Added:

    • DetailMaster โ€“ a new benchmark for synthetic data evaluation.
  • ๐Ÿ› ๏ธ New Operators Introduced (#673, #701):

    • llm_analysis_filter
    • general_field_filter

๐Ÿš€ Core Optimizations & Bug Fixes

  • โœ… Ray Executor Enhancements (#697):

    • File extension detection added.
    • Support for more data formats.
  • โฑ๏ธ Startup Time Optimization:

    • Improved startup performance. (#684)
  • ๐Ÿง  Text Embedding Support:

    • Added support for text embedding via API and local model. (#681)
  • ๐Ÿณ Docker Build Improvement:

    • Ignore installed distutils libraries during Docker image building. (#668)
  • ๐Ÿ› ๏ธ Mapper Module Fix:

    • Fixed issue with module initialization. (#700)
  • ๐Ÿ—‘๏ธ Warning Suppression:

    • Suppressed unnecessary warnings from fasttext. (#696)

๐Ÿ“š Full Changelog

View all changes since v1.3.3 โ†’

- Python
Published by yxdyc 12 months ago

https://github.com/modelscope/data-juicer - Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

  • ๐ŸŽ‰ Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
  • Add new OPs and recipes for Img-Diff. #658

Enhancements

  • Support HF llm for two llmxxxscore_filter OPs. #655
  • Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
  • Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

  • Address possibly missing cfg in unify_format. #653
  • Improve clarity & fix bad links for some docs. #659

Acknowledgement

  • @co63oc helps to fix some typos. #654

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.2...v1.3.3

- Python
Published by HYLcool about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

  • Human OP enhancements, in https://github.com/modelscope/data-juicer/pull/642 https://github.com/modelscope/data-juicer/pull/645
    • update label-studio version
    • make service script more robust
    • add documentation
    • optimizing fields mapping
  • OP efficiency optimization of document_minhash_deduplicator, in https://github.com/modelscope/data-juicer/pull/639
  • set temp_parser.usage to argparse.SUPPRESS, skip too much help log in https://github.com/modelscope/data-juicer/pull/643
  • fix date typo by in https://github.com/modelscope/data-juicer/pull/648
  • Fix docker building failure in https://github.com/modelscope/data-juicer/pull/650
  • Fix StreamToLoguru compatibility issue with torch._dynamo in https://github.com/modelscope/data-juicer/pull/651
  • add init file for annotation module, fix dj-process command error in https://github.com/modelscope/data-juicer/pull/652

New Contributor

  • @cmgzn made their first contribution in https://github.com/modelscope/data-juicer/pull/651

- Python
Published by yxdyc about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

  • ๐Ÿ’ฅ prototype Implementation for HumanOps (annotation). #617 Included features:
    • boilerplate code for supporting label studio powered human annotation ops
    • a human preference annotation reference implementation is provided
    • label studio service script; can start up local instance using docker or pip, whichever is available
    • reference configs and data
    • event driven and notification mixins framework for ops

New OPs

  • extract_tables_from_html_mapper: extract tables from html texts. #634
  • general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

  • fix dataset builder initialization failure #630
  • update Executor references from Executor to DefaultExecutor #632 #633
  • switch the backend of plt to avoid sub-process/thread error #633
  • fix some boundary condition bugs in several deduplicators #635 #637

Others

  • check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
  • update docs to highlight light env installation part. #636

Acknowledgement

  • @liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.3.0...v1.3.1

- Python
Published by HYLcool about 1 year ago

https://github.com/modelscope/data-juicer - Refactor of dataset builder and executor!

The Big Change ๐Ÿš€

Refactor of dataset builder and executor, see https://github.com/modelscope/data-juicer/pull/537, @cyruszhang ๐Ÿ“œ YAML explicitly defines different sources of datasets; local and remote are defined separately. ๐Ÿ”ง More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations. ๐Ÿ”Œ Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders. ๐Ÿš€ Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc. ๐Ÿ” Add data format validation to ensure consistency and correctness. ๐ŸŒ Expanded data source support: a. ๐Ÿ“ฆ ModelScope integration. b. ๐Ÿ“š ArXiv dataset import (download, decompress, ingest). c. ๐Ÿ“š Wikipedia dataset support (download, decompress, ingest). d. ๐ŸŒ Common Crawl integration (download, decompress, ingest). ๐Ÿ”— Backward compatibility with existing dataset_path command-line syntax. ๐Ÿ”€ Support for data mixtures to combine multiple datasets dynamically. ๐Ÿ”ง Support for empty formatters/generated datasets without pre-defined config files.

Others ๐Ÿ’ก

๐Ÿ”Š New audio processing operator: audioaddgaussian_noise (PR #622), @liuyuhanalex ๐Ÿ“Š Added dynamic coverage rate badge to the README for transparency (PR #625)

- Python
Published by yxdyc about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.2.2

Major Updates

  • ๐Ÿงช Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
  • ๐Ÿš€ Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
  • new A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

  • llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
  • llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

  • Fix config in LLaVa pretrain recipe. #610
  • Update news for MindGYM and fix doc. #615
  • Fix decode error through UTF-8 decoding. #618

- Python
Published by BeachWang about 1 year ago

https://github.com/modelscope/data-juicer - Release v1.2.1

Major Updates

  • new DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
  • new Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
  • Unit test optimization:
    • split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
    • use primitive @unittest.skip and remove SKIPPED_TESTS. #586
    • upload test coverage reports to GitHub artifacts. #586

New OPs

  • image_remove_background_mapper: remove the background of images. #589

Others

  • add missing LOADEDAUDIOS to ALLINTER_VARS to enable OP fusion and context sharing. #585
  • only build doc for py3.10. #586
  • move dependency on ray to minimal requirements. #586 #594 #595
  • allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
  • fix undefined fileno bug of the logger. #594

Acknowledgement

  • @liuyuhanalex helps simplify the code logic of OP fusion, add a new OP image_remove_background_mapper, and fix some minor bugs. #581 #585 #589
  • @co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
  • @danielhjz helps to fix the implicit memory leak problem in image_nsfw_filter. #590

- Python
Published by HYLcool over 1 year ago

https://github.com/modelscope/data-juicer - v1.2.0 Doc refactored; New algorithm proposed

What's New

Detailed PRs

  • fix export error when export_stats columns is null in https://github.com/modelscope/data-juicer/pull/557
  • Resplit input dataset in ray mode in https://github.com/modelscope/data-juicer/pull/549
  • Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in https://github.com/modelscope/data-juicer/pull/561
  • Resolve most skipped unit-tests by in https://github.com/modelscope/data-juicer/pull/559
  • fix translation error in https://github.com/modelscope/data-juicer/pull/562
  • Add unittest for ray text dedup in https://github.com/modelscope/data-juicer/pull/540
  • [Typo]correct a small typo in https://github.com/modelscope/data-juicer/pull/563
  • update the 2.0 paper link & the DaaR news in https://github.com/modelscope/data-juicer/pull/566
  • Fix typos in https://github.com/modelscope/data-juicer/pull/571
  • Optimization for sdxlprompt2promptmapper dependency importing by in https://github.com/modelscope/data-juicer/pull/570
  • Fix typos in https://github.com/modelscope/data-juicer/pull/572

Acknowledgment

  • @liuyuhanalex @co63oc made their first PRs

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.1.0...v1.2.0

- Python
Published by yxdyc over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.1.0

Major Updates

  • ๐Ÿงช User now can run ray-based distributed data processing under the guidance of added docs. #523
  • ๐Ÿงช The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
  • ๐Ÿ’ฅ Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
  • ๐Ÿš€ Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
  • ๐Ÿš€ Automatically update relevant documents when adding OPs to reduce the development burden. #527
  • ๐Ÿ› Add usability tags for OPs:
    • alpha tag for OPs in which only the basic OP implementations are finished;
    • beta tag for OPs in which unittests are added based on the alpha version;
    • stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

  • image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
  • mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
  • sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
  • sentence_augmentation_mapper: Augment sentences using LLMs. #550
  • text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

  • Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
  • Fix model force download bug. #529
  • Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
  • Fix missing field meta tag on ray mode. #538
  • Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
  • Fix bug in the role playing data generation demo. #545

Others

  • Enhance unit test for API calling OPs. #528
  • Remove sandbox requirements installation from Dockerfile. #530
  • Update the datasource related APIs to be compatible with the latest version of Ray. #532
  • Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
  • Update docs for preparing DJ2.0 release. #542
  • Update a quick cdn link for arch figure. #543
  • Add a video demo for role playing data generation. #545
  • Optimize op doc for global textual search. #552
  • Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

  • @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550

- Python
Published by BeachWang over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Major Updates

  • ๐Ÿ’ฅ Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
  • ๐Ÿ’ฅ Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
    • Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
    • Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (meta, stats) #514 #518
    • Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
  • ๐Ÿš€ Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
  • ๐Ÿš€ Support Ray Actor mode for GPU-based OPs. #511

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

  • dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.
  • dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.
  • dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default prompt) in feed back dialog data.
  • dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.
  • query_intent_detection_mapper: Mapper to predict user's Intent label in a query.
  • query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and 'positive') in a query.
  • query_topic_detection_mapper: Mapper to predict user's topic label in a query. ### Aggregator
  • meta_tags_aggregator: Merge similar meta tags to one tag. ### Selector
  • tags_specified_field_selector: Select samples based on the tags of specified field. ### Grouper
  • naive_reverse_grouper: Split bathed sample to samples.

Bug Fixed

  • Fix the wrong argument passing in generate_qa_from_example_mapper. #517
  • Update the out-of-date Dingding QR code on the main page. #513

Acknowledgement

  • @jackylee-ch made their first contribution to help fix several invalid links in the document. #521

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.0.2...v1.0.3

- Python
Published by HYLcool over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.2

Major Updates

  • Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
  • Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators

  • extract_support_text_mapper, relation_identity_mapper, python_file_mapper, https://github.com/modelscope/data-juicer/pull/500
  • naive_grouper, key_value_grouper, https://github.com/modelscope/data-juicer/pull/500
  • nested_aggregator, entity_attribute_aggregator, most_relavant_entities_aggregator, https://github.com/modelscope/data-juicer/pull/500
  • video_extract_frames_mapper, https://github.com/modelscope/data-juicer/pull/507

Performance

  • Optimize ray mode performance, https://github.com/modelscope/data-juicer/pull/442
  • Patch for Performance Benchmark in CI/CD workflows, https://github.com/modelscope/data-juicer/pull/506
  • DJ Ray mode supports streaming loading of jsonl files, https://github.com/modelscope/data-juicer/pull/515

Usability and Analysis

  • support dj-install in recipe-level, https://github.com/modelscope/data-juicer/pull/508
  • support dj-analyze with --auto mode, https://github.com/modelscope/data-juicer/pull/512
  • support op-wise insight auto mining, https://github.com/modelscope/data-juicer/pull/516

Acknowledgment

Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

- Python
Published by yxdyc over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.1

Major Updates

  • ๐Ÿš€ Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
  • ๐Ÿš€ [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
  • ๐Ÿ’ฅ Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493

OPs

Text OPs

  • pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491 ## Script OPs
  • python_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492
  • python_file_mapper: Mapper for executing customized Python functions on data samples. #493

Bugs Fixed

  • Add an argument to control whether to open Monitor for data processing. It's True by default. #483
  • For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
  • Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
  • Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504

Others

  • Pin the PyAV version to prevent inconsistent updates. #504
  • Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
  • Remove unnecessary UNFORKABLE marks for some OPs. #491
  • Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501

Acknowledgment

Here we thank public contributors for their PRs and issues to make Data-Juicer better!

- Python
Published by BeachWang over 1 year ago

https://github.com/modelscope/data-juicer - Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Major Updates

  • ๐Ÿš€ Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
  • ๐Ÿงช [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
  • ๐Ÿš€ Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
  • ๐Ÿš€ Support adaptive resource management:
    • Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. #270 #329 #354
    • Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. #429
  • ๐Ÿ’ฅ We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
  • ๐Ÿ’ฅ A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
  • ๐Ÿ› A playground for Data-Juicer is opened for user trial. #277 #368

OPs

Text

  • ray_document_deduplicator: supports Ray-based distributed exact-match deduplication for text-only datasets. #263
  • Support sentencepiece tokenizer for MinHash deduplicators. #269
  • generate_qa_from_text_mapper: generates question and answer pairs from input texts. #333 #454
  • generate_qa_from_examples_mapper: generates question and answer pairs based on examples. #338 #454
  • optimize_qa_mapper: optimizes the question-answer pairs in question-answering samples. #338 #454
  • optimize_query_mapper: optimizes the query in question-answering samples. #338 #454
  • optimize_response_mapper: optimizes the response in question-answering samples. #454
  • calibrate_qa_mapper: calibrates question-answer pairs based on reference text. #463
  • calibrate_query_mapper: calibrates query in question-answer pairs based on reference text. #463
  • calibrate_response_mapper: calibrates response in question-answer pairs based on reference text. #463
  • text_chunk_mapper: splits input text to chunks. #481
  • extract_entity_attribute_mapper: extracts attributes for given entities from the text. #481
  • extract_entity_relation_mapper: extracts entities and relations in the text for knowledge graph. #481
  • extract_event_mapper: extracts events and relevant characters in the text. #481
  • extract_keyword_mapper: generates keywords for the text. #481
  • extract_nickname_mapper: extracts nickname relationship in the text.. #481

Image

  • image_face_blur_mapper: blurs faces detected in images. #249
  • image_nsfw_filter: keeps samples containing images with NSFW scores below the threshold. #252
  • image_watermark_filter: keeps samples containing images with predicted watermark probabilities below the threshold. #256
  • ray_image_deduplicator: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263
  • image_pair_similarity_filter: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393
  • image_tagging_mapper: generates image tags from the input images. #423
  • image_face_count_filter: keeps samples containing images with face counts within the specified range. #446

Video

  • video_face_blur_mapper: blurs faces detected in videos. #253
  • video_remove_watermark_mapper: removes the watermarks in given regions from the videos. #236
  • video_nsfw_filter: keeps samples containing videos with NSFW scores below the threshold. #252
  • video_watermark_filter: keeps samples containing videos with predicted watermark probabilities below the threshold. #256
  • ray_video_deduplicator: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263
  • video_tagging_from_frames_filter: keeps samples containing videos with given tags. #260
  • video_captioning_from_frames_mapper: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257
  • video_captioning_from_summarizer_mapper: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250
  • video_motion_score_raft_filter: keeps samples with video motion scores (based on RAFT model) within a specific range. #478
  • Enhance the video_motion_score_filter to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361

Misc.

  • Switch face detection used in 3 OPs (image_face_ratio_filter, image_face_blur_mapper, video_face_blur_mapper) from dlib to OpenCV to avoid dependency problems. #320
  • Deduplicators for multimodal datasets are allowed to consider text information as well. #313
  • Support batched processing for some OPs. #406 #435

Others (Engine, Job Control and Tools)

  • Support more multimodal (video) dataset conversion tools: MSR-VTT #248
  • Support distributed processing script for Slurm. #242
  • Support Minhash-LSH deduplication tools based on Spark. #290
  • Enable GPU usage for Ray executor. #274
  • Add debug mode for Data-Juicer. #303
  • Add video generation tools for several metrics. #273 #312
  • Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
  • Add sampled frames from videos for video OPs to support OP fusion. #271
  • Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
  • Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
  • Support turbo mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402
  • Update type annotations from jsonargparse to Pydantic. #422
  • Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
  • Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
  • Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
  • Enable unit test coverage report. #460
  • Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479

Document Updates

  • Refine documentation system based on Sphinx. #245
  • Regular document updates. #234 #246
  • Update the class importing and document building logics for better automation. #299
  • Reorganize the operator documents for better reading. #472

Bugs Fixed

  • Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
  • Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
  • Fix some problems in demos. #244
  • Fix "Undefined punctuation_pattern" error in two OPs. #301
  • Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
  • Fix the bug of out-of-work type hint checking for config files. #302
  • Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
  • Fix the memory leaking of video OPs. #374
  • Fix the bug of two OPs (video_aesthetics_filter and image_diffusion_mapper) that can not make use of GPUs. #389
  • Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

  • @chg0901 helps to fix typos in documents. #237
  • @lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
  • @shiweijiezero helps fix the bugs in updating the data keys. #300
  • @seanzhang-zhichen helps to support multiple patterns for replace_content_mapper. #319
  • @simplaj helps to fix a bug of a non-predefined attribute for video_captioning_from_summarizer_mapper. #343
  • @zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
  • @2108038773 helps to add trust_remote_code argument for some public models on HuggingFace. #382 #385
  • @TobyJasper helps to fix typos in documents and contribute a new OP image_face_count_filter. #392 #452
  • @co63oc helps to fix some typos in documents and code. #427

- Python
Published by yxdyc over 1 year ago

https://github.com/modelscope/data-juicer - Release v0.2.0: Multimodal Support & DJ-SORA

New Features

  • ๐Ÿš€ We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
  • ๐Ÿš€ We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
  • ๐Ÿ’ฅ Our paper has been accepted by SIGMOD'24 industrial track! #211
  • ๐Ÿ’ฅ "BetterMixture" โ€” Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

  • video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
  • video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
  • video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
  • video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
  • video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
  • image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
  • image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
  • image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

    Video

    Filter

  • video_duration_filter: keeps samples whose videos' durations are within a specified range. #227

  • video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227

  • video_resolution_filter: filters samples according to the resolution of videos in them. #227

  • video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227

  • video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227

  • video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

    Mapper

  • video_split_by_scene_mapper: splits videos into scene clips. #227

  • video_split_by_duration_mapper: splits videos by specified duration interval. #227

  • video_split_by_key_frame_mapper: splits videos by their keyframes. #227

  • video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227

  • video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227

  • video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

    Deduplicator

  • video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

    Audio

  • audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177

  • audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184

  • audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189

  • audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

    Image

  • image_blur_mapper: adds random noises to images to blur them. #180

  • image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

    Document Updates

  • "Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.

  • Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.

  • Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220

  • OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

    Bugs Fixed

  • Fix stats computation error in the ray mode due to the inappropriate initialization method. #173

  • Fix the bug that some images will be lost when converting their paths to absolute paths. #178

  • Fix the dependency problems of OPs who depend on other OPs. #181

  • Fix the bug that the predict.py tool gets stuck on the help page. #183

  • Fix face_area_filter: constrains the detection coordinates within the image. #202

  • Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195

  • Fix or update invalid links in Data-Juicer. #201 #219

    Others

  • Optimize the model management module. #196 #227

  • Optimize the unit test actions. #195 #196 #216 #227

  • Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227

  • Update the docker image with JDK. #208

  • Support more multimodal (video) dataset conversion tools: #227

    • InternVid: 234M video-caption data
    • Youku-mPLUG: 36TB video-caption data
    • Video-ChatGPT: 100k video-instruction data
  • Optimize the generated multimodal data storage. #227

  • Support running data-juicer process jobs on Aliyun PAI-DLC. #227

  • Better support for multi-machine distributed data processing in Ray mode. #227

    Acknowledgment

    Here we thank public contributors for their PRs to make Data-Juicer better!

  • @liuyanyi helps to fix a bug in quality classifier tools. #183

  • @co63oc helps to fix some typos. #215

  • @liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208

  • @zhenqincn helps to add more papers to the Awesome LLM Data doc. #226

- Python
Published by HYLcool about 2 years ago

https://github.com/modelscope/data-juicer - Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

  • Data-Juicer now supports Python3.7-3.10!
    • We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
    • We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
  • Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
    • A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
    • Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
    • Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
  • Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
  • Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
  • Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160 # New OPs ## Text
  • chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
  • remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
  • text_action_filter: keeps samples containing action verbs in their texts. #122
  • text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
  • replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
  • remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149 ## Image
  • image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
  • image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
  • image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
  • face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
  • image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72 ## Multimodal
  • image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
  • image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
  • phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139 # Bugs fixed
  • Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
  • Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
  • Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
  • Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
  • Fix the bug of commandline arguments parsing error in some cases. #108 #165
  • Store simhash value as string type to avoid errors from PyArrow. #168 #170 # Others
  • Dependency importing optimization: only require and import some dependencies when using. #35 #82
  • Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
  • Optimize the cache directory selection logic. #43
  • Support limiting the number of samples when mixing datasets. #86
  • Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
  • OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151 # Acknowledgement Here we thank public contributors for their PRs to make Data-Juicer better!
  • @JONGSKY helps to remove some unnecessary code. #85
  • @xuruidong helps to fix several broken links in the README doc. #142

- Python
Published by HYLcool over 2 years ago

https://github.com/modelscope/data-juicer - Release v0.1.2: more core functions are available now.

New OPs

  • nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
  • nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
  • token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

  • OP Fusion #14
    • Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
  • Cache management #19
    • Cache management works now for our Data-Juicer due to the new serialization method being applied.
    • Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
  • Distributed data processing with Ray is supported now. #21
  • Config sys optimization:
    • Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
    • A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
    • Display the config table after config parsing is ready. #17

Others

  • Replace original string constants with constant enums. #13
  • Expand the checkpoint protection range to cover the exporting process. #14
  • Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
  • Docs updates. #15 #16
  • PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
  • Docker building is available now. The official docker image for Docker Hub is in progress. #23
  • Deploy the unit tests for Data-Juicer. #29

- Python
Published by HYLcool over 2 years ago

https://github.com/modelscope/data-juicer - Release v0.1.0, the first internal version for open-source

Summarization - Table of Contents

  • Data-Juicer: A Data-Centric Text Processing System for Large Language Models
  • Table of Contents
    • Features
    • Prerequisites
    • Installation
    • Quick Start
      • Data Processing
      • Data Analysis
      • Data Visualization
      • Build Up Config Files
      • Preprocess raw data (Optional)
    • Documentation | ๆ–‡ๆกฃ
    • Data Recipes
    • Demos
    • License
    • Contributing
    • References

Features

  • Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.

  • Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.

  • Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.

  • Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.

  • Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.

  • User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

  • Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

  • Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.

- Python
Published by yxdyc almost 3 years ago