Recent Releases of https://github.com/OpenDCAI/DataFlow
https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.5 Release Note
DataFlow v1.0.5 Key Feature Updates
- Add General Reasoning Pipeline : add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/137
- Add Batch Wrapper : Upload batch_wrapper for batching a operator in a pipeline. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/157
- Pandas Operator Release : Release GeneralFilter for pandas by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/170
- Add Multiturn Function Call Operators add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/136
- Add Math Problem Extractor : Add VQAServing, Add mathbookpromblemextractor to KBC Pipeline by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/152
- Refine General Text Operators : Customizable prompt for sft generators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/139
- Fix Local Serving Bug : Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/158
- Speed Up Text2SQL Pipeline Recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/174
Notable Changes
- Add Dataflow WebUI : Add Gradio WebUI for all operators by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/169
- Add Dataflow-Agent WebUI :Add agent gradio UI by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/175
- Add MinerU for KBCPipeline : @Niujunbo2002 add MinerU2.0 in https://github.com/OpenDCAI/DataFlow/pull/132 and support for fetching arxiv pdf links by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/171
- Add Sglang Support : Add
tensor_parallelanddata_paralleltoLocalLLMServing_sglangby @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/147
What's Changed
- add get_desc for all general text operators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/133
- [Feature] GeneralFilter for GeneralText release! by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/135
- fix problem by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/138
- add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/136
- add examples to get_desc by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/134
- 可定制prompt的sft生成器 by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/139
- (new) add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/137
- Support MinerU2 for KnowledgeCleaning by @Niujunbo2002 in https://github.com/OpenDCAI/DataFlow/pull/132
- [serving] set default
vllm_seedparam forLocalModelLLMServing_vllmtoNoneto avoid warning by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/143 - 修复gpu reasoning pipeline bug by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/145
- refine the get_desc func for each operator for text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/142
- [Serving] Add
tensor_parallelanddata_paralleltoLocalLLMServing_sglangby @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/147 - text的所有算子加get_desc函数 by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/146
- 修复storage列解析错误展开data字段到dataframe,调整版本,修复AnswerNgramFilter算子的bug by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/115
- add medical pipeline, generated by agent by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/148
- [serving] add sglang for all scripts for option by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/150
- implement kbc batch process operators and pipeline by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/151
- [Serving, KBC]Add VQAServing, Add mathbookpromblemextractor to KBC Pipeline. by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/152
- fix bug for RemoveEmojiRefiner by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/153
- fix bugs in batch_kbc by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/156
- [batchwrapper] upload batchwrapper for batching a operator in a pipeline. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/157
- [Serving] Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/158
- add API-based languagefilter & customized MetaScorer by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/161
- fix quickstart bug by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/162
- 统一embedding的属性名,调整SQLVariationGenerator算子填充逻辑补充进原始数据 by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/160
- add publications by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/163
- 修复reasoning流水线上其他算子向前兼容问题 by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/165
- add desc for func call & add statics for meta score by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/168
- [webui] Add Gradio WebUI for experience all operators. by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/169
- [Feature] PandasOperator release! [Update] GeneralFilter updated by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/170
- Support for fetching arxiv pdf links by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/171
- add new reasoning operator “answermodeljudge” , to check reference answer via llm by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/172
- [WebUI] Add API Pipeline UI by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/173
- Add agent gradio UI by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/175
- recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/174
- fix bug by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/176
- Update Gardio and Bug Fix by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/177
- add operator in readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/178
- change kbc script in playground & manage kbc pipelines by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/179
- Unify backend and fronted by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/180
- add mathbook extract to playground by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/181
- add gradio in readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/182
- add safety checks in fetching pdf by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/184
- 增加了多轮对话中,对部分user生成缺少assistant的情况修复 by @Arunshmily in https://github.com/OpenDCAI/DataFlow/pull/185
New Contributors
- @Niujunbo2002 made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/132
- @Arunshmily made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/185
Full Changelog: https://github.com/OpenDCAI/DataFlow/compare/v1.0.4...v1.0.5
- Python
Published by haolpku 10 months ago
https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.4 Release Notes
DataFlow v1.0.4 Key Feature Updates
- Automatic Operator Code Generation: Introduced new features for automatic operator code generation by @DeepMindLiuZhou (PR #61).
- Myscale Storage Support: Added support for myscale storage by @leaderwolfpipi (PR #60).
- Dialogue Function Generation: Implemented a function to generate from conversations by @MOLYHECI (PR #59).
- QA Generator and Translator: Added a QA generator and translation feature by @haolpku (PR #65).
- Text2SQL Pipeline Update: Refactored the text2sql pipeline by @TechNomad-ds (PR #113).
- AgenticRAG Pipeline Enhancements: Enhanced the AgenticRAG pipeline to fully support embedding models by @wongzhenhao (PR #86).
- Lazy Load Framework Support: @MOLYHECI The entire framework now supports lazy loading, significantly improving loading speeds. https://github.com/OpenDCAI/DataFlow/pull/87
- GeneralText Optimization: @zzy1127 optimized information related to GeneralText. #102 #112 #125
- Removal of Legacy Code: @HeRunming removed outdated code logic from the repository. #118 ## Notable Changes
- Operator Naming Rules: Renamed all operators naming rules by @SunnyHaze (PR #81).
- FuncCall Pipeline: Introduced a new FuncCall Pipeline by @MOLYHECI (PR #88).
- Batch PDF Extractor: Added functionality for batch PDF extraction by @haolpku (PR #111).
- Bug Fixes and Improvements: Various contributors, including @YqjMartin and @ZhaoyangHan04, worked on code refactoring, dependency fixes, and bug resolutions.
What's Changed
- Dataflow agent new features for automatic operator code generation by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/61
- 支持myscale storage by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/60
- Add function generate from conversations (dialogue) by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/59
- add QA generator and translator by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/65
- change face and add acknowledgements by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/68
- change face by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/69
- delete the api aisuite (fix #32) by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/70
- Rename all operators naming rules. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/81
- adding missing numpy import by @JimmyAwoe in https://github.com/OpenDCAI/DataFlow/pull/76
- [Rename] unused file deleted by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/82
- rename RARE operators by @mi-iro in https://github.com/OpenDCAI/DataFlow/pull/83
- [Update] APILLMServing_request now support embedding model & AgenticRAG pipeline fully support API request by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/86
- Support litellm by @Sucran in https://github.com/OpenDCAI/DataFlow/pull/84
- Add Lazyloader feature for GeneralText by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/87
- Dataflow agent by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/91
- fix dependency conficts in kbc pipeline by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/89
- solve issue #92 and #85 by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/94
- add TYPE_CHECKING if-else for VSCode static check by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/93
- [oper] rename
promptgeneratortopromptedgeneratorby @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/95 - [Update] AgenticRAG pipeline now support APILLMServing for embedding by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/96
- [Update] AgenticRAG pipeline now support APILLMServing for embedding models by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/97
- reduce logger content by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/98
- Add auto generate importstructure function & fix import issues for dataflow/statics/ by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/99
- Add FuncCall Pipeline by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/88
- add prompts for consistentchat and fix some bugs by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/102
- Add local QA generation and translation by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/104
- Dataflow agent update, with demo for writing some operators by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/105
- fix translation bug and add data by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/107
- fix agentic RAG problem and add eval operators by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/106
- add abbreviation module by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/108
- [storage] add error logging when don't call step before first run. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/110
- add batch pdf extractor by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/111
- modify code position by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/109
- [register] update register which could return type of operators by
get_type_of_operatorby @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/112 - update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/114
- update readme about agent by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/117
- fix import bugs for sub-folder used operators by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/116
- remove out-of-time fuction in dataflow/utils/utils.py by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/118
- modift file path and redundant file by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/121
- Delete Operator.json by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/120
- add sft syn pipeline by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/122
- new rename generators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/125
- 把sft合成放到playground里面了 by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/126
- [Update] Improve AgenticRAG code readability by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/129
- update text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/113
- fix the db not exist bug by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/131
New Contributors
- @JimmyAwoe made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/76
- @Sucran made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/84
Full Changelog: https://github.com/OpenDCAI/DataFlow/compare/v1.0.3...v1.0.4
- Python
Published by SunnyHaze 11 months ago
https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.3 Release Notes
What's changed
- Update more scorers (operators) to
GeneralText pipeline. (#38 and #48 ). Thanks @zzy1127 @MOLYHECI - Update more operators to
AgenticRAG pipeline. (#50 , #41). Thanks @wongzhenhao @YqjMartin - Revise APIKEY env variable passing logic in the
APIServingclass. The default variable is `DFAPI_KEY` to avoid conflicts (#57 ). Thanks @SunnyHaze - Rename
llmservingtoservingfor future extension of other kinds of web services. #44 . Thanks @SunnyHaze - Update the Readme. (#40 , #52 , #53 ) Thanks @Qmeiyi
- Revise some bugs and parameter issues in
AgenticRAGpipeline. #49 . Thanks @TheRoadQaQ - Revise some bugs and parameter issues in
Knowledge base cleaning pipeline. #47 . Thanks @ZhaoyangHan04
Detailed list for all changed PRs
- update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/40
- [New Operators] A lite implementation of OPPO TaskCraft by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/41
- add scorers by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/38
- [update] rename
llmservingtoservingto fit future extension by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/44 - agentic rag para revise by @TheRoadQaQ in https://github.com/OpenDCAI/DataFlow/pull/49
- add remaining operators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/48
- normalize file path and params by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/47
- update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/52
- update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/53
- 增加了一些完善agenticRAG生成的方法 by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/50
- [serving] set default API serving key to
DF_API_KEYand this key ca… by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/57
Full Changelog: https://github.com/OpenDCAI/DataFlow/compare/v1.0.2...v1.0.3
- Python
Published by SunnyHaze 11 months ago
https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.2 Release Notes
New features
- Add implementation of Dataflow Agents #34 . Thanks @DeepMindLiuZhou # debug
- Fix get-desc issue #35 , Thanks @leaderwolfpipi
- Fix including bug for
/example/KBC/test.docand/example/KBC/test.pdfin manifest.ini. Thanks @SunnyHaze
- Python
Published by SunnyHaze 11 months ago
https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.1 Release Notes
New features
- add RARE pipeline (#33) @mi-iro
- add API calling to
text pipeline, i.e.test_sft_filter.py(#29) @zzy1127
Thanks for your contribution.
Debug
Fix the PyPI issue that makes pip install open-dataflow fail. @SunnyHaze . Thanks @leaderwolfpipi reported this bug.
- Python
Published by SunnyHaze 11 months ago
https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.0 Release Notes
🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉
Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.
🚀 Introduction
DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.
It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.
Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.
🧠 Core Features
- 🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
- 🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
- 🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
- ⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
- 💾 Built-in Storage Layer: Manage intermediate data and caching.
- 🔌 LLM Backend Support: Easily plug into GPT-style backends with
LLMServing.
🧱 Framework Overview
DataFlow consists of the following core modules:
| Module | Description |
|--------------|-----------------------------------------------------------------------------|
| operator | Basic data processing units, reusable across pipelines. |
| pipeline | Manages multi-step workflows by chaining multiple operators. |
| storage | Manages data cache, storage, and I/O between steps. |
| LLMServing | Integrates large models for reasoning, filtering, and generation. |
| Agent | Automatically generates, orchestrates, and manages data workflows. |
🛠️ Example Usage and Operators
To get started quickly with real examples, please refer to our documentation:
📘 Example Pipelines:
Text Pipeline Tutorial🧩 Available Operators:
Operator Reference for Text Evaluation
These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.
🔍 Why DataFlow?
| Feature | Benefit | |--------------------|-------------------------------------------| | PyTorch-style API | Easy to learn and integrate | | LLM + Rules + NN | Flexible and powerful hybrid workflows | | Auto Agent Support | Reduces manual data prep burden | | Storage Layer | Efficient checkpointing and result reuse | | Fully Modular | Easy to extend, test, and compose |
📫 Contact
For issues, contributions, or questions, feel free to reach out:
GitHub: https://github.com/OpenDCAI/DataFlow Email: hao.liang@stu.pku.edu.cn
- Python
Published by SunnyHaze 11 months ago
https://github.com/OpenDCAI/DataFlow - Dataflow v0.0.3 Release Notes
First Release for Dataflow system
- Now the Dataflow codespace has been fully implemented with all features.
- You can easily experience our powerful data-centric system with
pip install open-dataflowanddataflow initcommand.
- Python
Published by SunnyHaze 11 months ago