Recent Releases of https://github.com/OpenDCAI/DataFlow

https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.5 Release Note

DataFlow v1.0.5 Key Feature Updates

  • Add General Reasoning Pipeline : add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/137
  • Add Batch Wrapper : Upload batch_wrapper for batching a operator in a pipeline. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/157
  • Pandas Operator Release : Release GeneralFilter for pandas by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/170
  • Add Multiturn Function Call Operators add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/136
  • Add Math Problem Extractor : Add VQAServing, Add mathbookpromblemextractor to KBC Pipeline by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/152
  • Refine General Text Operators : Customizable prompt for sft generators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/139
  • Fix Local Serving Bug : Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/158
  • Speed Up Text2SQL Pipeline Recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/174

Notable Changes

  • Add Dataflow WebUI : Add Gradio WebUI for all operators by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/169
  • Add Dataflow-Agent WebUI :Add agent gradio UI by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/175
  • Add MinerU for KBCPipeline : @Niujunbo2002 add MinerU2.0 in https://github.com/OpenDCAI/DataFlow/pull/132 and support for fetching arxiv pdf links by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/171
  • Add Sglang Support : Add tensor_parallel and data_parallel to LocalLLMServing_sglang by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/147

What's Changed

  • add get_desc for all general text operators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/133
  • [Feature] GeneralFilter for GeneralText release! by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/135
  • fix problem by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/138
  • add example data for FuncCallPipeline & rename MultiTurnDialogueGenerator by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/136
  • add examples to get_desc by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/134
  • 可定制prompt的sft生成器 by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/139
  • (new) add new pipeline to support general reasoning data and diy prompt, and fix some bugs, reform some reasoning ops by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/137
  • Support MinerU2 for KnowledgeCleaning by @Niujunbo2002 in https://github.com/OpenDCAI/DataFlow/pull/132
  • [serving] set default vllm_seed param for LocalModelLLMServing_vllm to None to avoid warning by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/143
  • 修复gpu reasoning pipeline bug by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/145
  • refine the get_desc func for each operator for text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/142
  • [Serving] Add tensor_parallel and data_parallel to LocalLLMServing_sglang by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/147
  • text的所有算子加get_desc函数 by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/146
  • 修复storage列解析错误展开data字段到dataframe,调整版本,修复AnswerNgramFilter算子的bug by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/115
  • add medical pipeline, generated by agent by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/148
  • [serving] add sglang for all scripts for option by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/150
  • implement kbc batch process operators and pipeline by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/151
  • [Serving, KBC]Add VQAServing, Add mathbookpromblemextractor to KBC Pipeline. by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/152
  • fix bug for RemoveEmojiRefiner by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/153
  • fix bugs in batch_kbc by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/156
  • [batchwrapper] upload batchwrapper for batching a operator in a pipeline. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/157
  • [Serving] Fix Local Model Serving, apply chat_template to sys & user prompt by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/158
  • add API-based languagefilter & customized MetaScorer by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/161
  • fix quickstart bug by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/162
  • 统一embedding的属性名,调整SQLVariationGenerator算子填充逻辑补充进原始数据 by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/160
  • add publications by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/163
  • 修复reasoning流水线上其他算子向前兼容问题 by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/165
  • add desc for func call & add statics for meta score by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/168
  • [webui] Add Gradio WebUI for experience all operators. by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/169
  • [Feature] PandasOperator release! [Update] GeneralFilter updated by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/170
  • Support for fetching arxiv pdf links by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/171
  • add new reasoning operator “answermodeljudge” , to check reference answer via llm by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/172
  • [WebUI] Add API Pipeline UI by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/173
  • Add agent gradio UI by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/175
  • recontruct the database manager to improve the efficiency for text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/174
  • fix bug by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/176
  • Update Gardio and Bug Fix by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/177
  • add operator in readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/178
  • change kbc script in playground & manage kbc pipelines by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/179
  • Unify backend and fronted by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/180
  • add mathbook extract to playground by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/181
  • add gradio in readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/182
  • add safety checks in fetching pdf by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/184
  • 增加了多轮对话中,对部分user生成缺少assistant的情况修复 by @Arunshmily in https://github.com/OpenDCAI/DataFlow/pull/185

New Contributors

  • @Niujunbo2002 made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/132
  • @Arunshmily made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/185

Full Changelog: https://github.com/OpenDCAI/DataFlow/compare/v1.0.4...v1.0.5

- Python
Published by haolpku 10 months ago

https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.4 Release Notes

DataFlow v1.0.4 Key Feature Updates

  • Automatic Operator Code Generation: Introduced new features for automatic operator code generation by @DeepMindLiuZhou (PR #61).
  • Myscale Storage Support: Added support for myscale storage by @leaderwolfpipi (PR #60).
  • Dialogue Function Generation: Implemented a function to generate from conversations by @MOLYHECI (PR #59).
  • QA Generator and Translator: Added a QA generator and translation feature by @haolpku (PR #65).
  • Text2SQL Pipeline Update: Refactored the text2sql pipeline by @TechNomad-ds (PR #113).
  • AgenticRAG Pipeline Enhancements: Enhanced the AgenticRAG pipeline to fully support embedding models by @wongzhenhao (PR #86).
  • Lazy Load Framework Support: @MOLYHECI The entire framework now supports lazy loading, significantly improving loading speeds. https://github.com/OpenDCAI/DataFlow/pull/87
  • GeneralText Optimization: @zzy1127 optimized information related to GeneralText. #102 #112 #125
  • Removal of Legacy Code: @HeRunming removed outdated code logic from the repository. #118 ## Notable Changes
  • Operator Naming Rules: Renamed all operators naming rules by @SunnyHaze (PR #81).
  • FuncCall Pipeline: Introduced a new FuncCall Pipeline by @MOLYHECI (PR #88).
  • Batch PDF Extractor: Added functionality for batch PDF extraction by @haolpku (PR #111).
  • Bug Fixes and Improvements: Various contributors, including @YqjMartin and @ZhaoyangHan04, worked on code refactoring, dependency fixes, and bug resolutions.

What's Changed

  • Dataflow agent new features for automatic operator code generation by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/61
  • 支持myscale storage by @leaderwolfpipi in https://github.com/OpenDCAI/DataFlow/pull/60
  • Add function generate from conversations (dialogue) by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/59
  • add QA generator and translator by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/65
  • change face and add acknowledgements by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/68
  • change face by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/69
  • delete the api aisuite (fix #32) by @scuuy in https://github.com/OpenDCAI/DataFlow/pull/70
  • Rename all operators naming rules. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/81
  • adding missing numpy import by @JimmyAwoe in https://github.com/OpenDCAI/DataFlow/pull/76
  • [Rename] unused file deleted by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/82
  • rename RARE operators by @mi-iro in https://github.com/OpenDCAI/DataFlow/pull/83
  • [Update] APILLMServing_request now support embedding model & AgenticRAG pipeline fully support API request by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/86
  • Support litellm by @Sucran in https://github.com/OpenDCAI/DataFlow/pull/84
  • Add Lazyloader feature for GeneralText by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/87
  • Dataflow agent by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/91
  • fix dependency conficts in kbc pipeline by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/89
  • solve issue #92 and #85 by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/94
  • add TYPE_CHECKING if-else for VSCode static check by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/93
  • [oper] rename promptgenerator to promptedgenerator by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/95
  • [Update] AgenticRAG pipeline now support APILLMServing for embedding by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/96
  • [Update] AgenticRAG pipeline now support APILLMServing for embedding models by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/97
  • reduce logger content by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/98
  • Add auto generate importstructure function & fix import issues for dataflow/statics/ by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/99
  • Add FuncCall Pipeline by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/88
  • add prompts for consistentchat and fix some bugs by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/102
  • Add local QA generation and translation by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/104
  • Dataflow agent update, with demo for writing some operators by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/105
  • fix translation bug and add data by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/107
  • fix agentic RAG problem and add eval operators by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/106
  • add abbreviation module by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/108
  • [storage] add error logging when don't call step before first run. by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/110
  • add batch pdf extractor by @haolpku in https://github.com/OpenDCAI/DataFlow/pull/111
  • modify code position by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/109
  • [register] update register which could return type of operators by get_type_of_operator by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/112
  • update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/114
  • update readme about agent by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/117
  • fix import bugs for sub-folder used operators by @MOLYHECI in https://github.com/OpenDCAI/DataFlow/pull/116
  • remove out-of-time fuction in dataflow/utils/utils.py by @HeRunming in https://github.com/OpenDCAI/DataFlow/pull/118
  • modift file path and redundant file by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/121
  • Delete Operator.json by @DeepMindLiuZhou in https://github.com/OpenDCAI/DataFlow/pull/120
  • add sft syn pipeline by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/122
  • new rename generators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/125
  • 把sft合成放到playground里面了 by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/126
  • [Update] Improve AgenticRAG code readability by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/129
  • update text2sql pipeline by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/113
  • fix the db not exist bug by @TechNomad-ds in https://github.com/OpenDCAI/DataFlow/pull/131

New Contributors

  • @JimmyAwoe made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/76
  • @Sucran made their first contribution in https://github.com/OpenDCAI/DataFlow/pull/84

Full Changelog: https://github.com/OpenDCAI/DataFlow/compare/v1.0.3...v1.0.4

- Python
Published by SunnyHaze 11 months ago

https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.3 Release Notes

What's changed

  • Update more scorers (operators) to GeneralText pipeline. (#38 and #48 ). Thanks @zzy1127 @MOLYHECI
  • Update more operators to AgenticRAG pipeline. (#50 , #41). Thanks @wongzhenhao @YqjMartin
  • Revise APIKEY env variable passing logic in the APIServing class. The default variable is `DFAPI_KEY` to avoid conflicts (#57 ). Thanks @SunnyHaze
  • Rename llmserving to serving for future extension of other kinds of web services. #44 . Thanks @SunnyHaze
  • Update the Readme. (#40 , #52 , #53 ) Thanks @Qmeiyi
  • Revise some bugs and parameter issues in AgenticRAG pipeline. #49 . Thanks @TheRoadQaQ
  • Revise some bugs and parameter issues in Knowledge base cleaning pipeline. #47 . Thanks @ZhaoyangHan04

Detailed list for all changed PRs

  • update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/40
  • [New Operators] A lite implementation of OPPO TaskCraft by @wongzhenhao in https://github.com/OpenDCAI/DataFlow/pull/41
  • add scorers by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/38
  • [update] rename llmserving to serving to fit future extension by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/44
  • agentic rag para revise by @TheRoadQaQ in https://github.com/OpenDCAI/DataFlow/pull/49
  • add remaining operators by @zzy1127 in https://github.com/OpenDCAI/DataFlow/pull/48
  • normalize file path and params by @ZhaoyangHan04 in https://github.com/OpenDCAI/DataFlow/pull/47
  • update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/52
  • update readme by @Qmeiyi in https://github.com/OpenDCAI/DataFlow/pull/53
  • 增加了一些完善agenticRAG生成的方法 by @YqjMartin in https://github.com/OpenDCAI/DataFlow/pull/50
  • [serving] set default API serving key to DF_API_KEY and this key ca… by @SunnyHaze in https://github.com/OpenDCAI/DataFlow/pull/57

Full Changelog: https://github.com/OpenDCAI/DataFlow/compare/v1.0.2...v1.0.3

- Python
Published by SunnyHaze 11 months ago

https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.2 Release Notes

New features

  • Add implementation of Dataflow Agents #34 . Thanks @DeepMindLiuZhou # debug
  • Fix get-desc issue #35 , Thanks @leaderwolfpipi
  • Fix including bug for /example/KBC/test.doc and /example/KBC/test.pdf in manifest.ini. Thanks @SunnyHaze

- Python
Published by SunnyHaze 11 months ago

https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.1 Release Notes

New features

  • add RARE pipeline (#33) @mi-iro
  • add API calling to text pipeline, i.e. test_sft_filter.py (#29) @zzy1127

Thanks for your contribution.

Debug

Fix the PyPI issue that makes pip install open-dataflow fail. @SunnyHaze . Thanks @leaderwolfpipi reported this bug.

- Python
Published by SunnyHaze 11 months ago

https://github.com/OpenDCAI/DataFlow - Dataflow v1.0.0 Release Notes

🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉

Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.


🚀 Introduction

DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.

It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.

Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.


🧠 Core Features

  • 🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
  • 🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
  • 🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
  • ⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
  • 💾 Built-in Storage Layer: Manage intermediate data and caching.
  • 🔌 LLM Backend Support: Easily plug into GPT-style backends with LLMServing.

🧱 Framework Overview

DataFlow consists of the following core modules:

| Module | Description | |--------------|-----------------------------------------------------------------------------| | operator | Basic data processing units, reusable across pipelines. | | pipeline | Manages multi-step workflows by chaining multiple operators. | | storage | Manages data cache, storage, and I/O between steps. | | LLMServing | Integrates large models for reasoning, filtering, and generation. | | Agent | Automatically generates, orchestrates, and manages data workflows. |

🛠️ Example Usage and Operators

To get started quickly with real examples, please refer to our documentation:

These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.

🔍 Why DataFlow?

| Feature | Benefit | |--------------------|-------------------------------------------| | PyTorch-style API | Easy to learn and integrate | | LLM + Rules + NN | Flexible and powerful hybrid workflows | | Auto Agent Support | Reduces manual data prep burden | | Storage Layer | Efficient checkpointing and result reuse | | Fully Modular | Easy to extend, test, and compose |


📫 Contact

For issues, contributions, or questions, feel free to reach out:

GitHub: https://github.com/OpenDCAI/DataFlow Email: hao.liang@stu.pku.edu.cn

- Python
Published by SunnyHaze 11 months ago

https://github.com/OpenDCAI/DataFlow - Dataflow v0.0.3 Release Notes

First Release for Dataflow system

  • Now the Dataflow codespace has been fully implemented with all features.
  • You can easily experience our powerful data-centric system with pip install open-dataflow and dataflow init command.

- Python
Published by SunnyHaze 11 months ago