https://github.com/OpenDCAI/DataFlow

Easy Data Preparation with latest LLMs-based Operators and Pipelines.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (3.9%) to scientific vocabulary

Keywords

data data-agent data-cleaning data-pipelines data-processing data-science data-synthesis gradio-interface llms operators quick-data-processing sglang-bankend vllm-backend

Last synced: 5 months ago · JSON representation

Repository

Easy Data Preparation with latest LLMs-based Operators and Pipelines.

Basic Info

Host: GitHub
Owner: OpenDCAI
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://OpenDCAI.github.io/DataFlow-Doc/
Size: 74.4 MB

Statistics

Stars: 1,178
Watchers: 15
Forks: 77
Open Issues: 13
Releases: 7

Topics

data data-agent data-cleaning data-pipelines data-processing data-science data-synthesis gradio-interface llms operators quick-data-processing sglang-bankend vllm-backend

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme License

README-dev.md

DataFlow-Preview-开发文档

你可以先创建一个纯净的python==3.10的运行环境。

然后克隆本仓库后本地安装： shell pip install -e .

安装后可以使用如下指令检验是否正确安装： shell dataflow -v dataflow env

测试reasoning Pipeline的方式

目前测试用入口文件在/test/testreasoning.py中默认使用/dataflow/example/ReasoningPipeline/pipelinemath_short.json作为样例输入。

向系统export全局的key环境变量。 shell export API_KEY=<your key>

随后切换工作路径到/test下，直接执行即可体验一个超短的pipeline shell python test_reasoning.py

Owner

Name: OpenDCAI
Login: OpenDCAI
Kind: organization
Email: PKU_DCML@hotmail.com

Repositories: 1
Profile: https://github.com/OpenDCAI

Define the future of Data-centric AI together

GitHub Events

Total

Create event: 16
Issues event: 42
Release event: 5
Watch event: 571
Delete event: 5
Member event: 1
Issue comment event: 71
Push event: 132
Pull request review comment event: 30
Pull request review event: 48
Pull request event: 214
Fork event: 49

Last Year

Create event: 16
Issues event: 42
Release event: 5
Watch event: 571
Delete event: 5
Member event: 1
Issue comment event: 71
Push event: 132
Pull request review comment event: 30
Pull request review event: 48
Pull request event: 214
Fork event: 49

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 29
Total pull requests: 117
Average time to close issues: 3 days
Average time to close pull requests: about 10 hours
Total issue authors: 22
Total pull request authors: 27
Average comments per issue: 0.66
Average comments per pull request: 0.2
Merged pull requests: 72
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 29
Pull requests: 117
Average time to close issues: 3 days
Average time to close pull requests: about 10 hours
Issue authors: 22
Pull request authors: 27
Average comments per issue: 0.66
Average comments per pull request: 0.2
Merged pull requests: 72
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

SunnyHaze (7)
ElsaReedz (2)
ninjaX2o (1)
foxFallingSkies (1)
linzx3501 (1)
tpoisonooo (1)
samure1995 (1)
jansonheal (1)
miharakator (1)
ixeraby (1)
ackevjajameda (1)
miaode74 (1)
acie-1 (1)
haolpku (1)
lcysyzxdxc (1)

Pull Request Authors

haolpku (12)
zzy1127 (12)
ZhaoyangHan04 (11)
SunnyHaze (10)
Qmeiyi (8)
MOLYHECI (7)
wongzhenhao (7)
HeRunming (7)
scuuy (6)
TechNomad-ds (6)
gty1829 (5)
YqjMartin (4)
DeepMindLiuZhou (4)
leaderwolfpipi (3)
Yalin-Feng (2)

Top Labels

Issue Labels

bug (9) enhancement (8) question (4)

Pull Request Labels

enhancement (1)

Dependencies

requirements.txt pypi

PyYAML ==6.0.2
av ==12.3.0
decord ==0.6.0
einops ==0.8.0
fasttext ==0.9.3
filelock ==3.15.4
fsspec ==2024.6.1
ftfy ==6.2.3
google-api-core ==2.19.1
google-api-python-client ==2.140.0
google-auth ==2.33.0
google-auth-httplib2 ==0.2.0
googleapis-common-protos ==1.63.2
jsonargparse ==4.32.0
kenlm ==0.2.0
langkit ==0.0.33
loguru ==0.7.2
matplotlib ==3.9.2
multiprocess ==0.70.16
nltk ==3.8
numpy ==1.26.4
openai =1.44.1
pandas ==2.2.2
prettytable ==3.11.0
pyspark ==3.5.2
regex ==2024.7.24
safetensors ==0.4.4
scikit-learn ==1.5.1
scikit-video ==1.1.11
scipy ==1.13.1
sentencepiece ==0.2.0
setuptools ==72.1.0
timm ==1.0.8
torch ==2.4.0
torchvision ==0.19.0
tqdm ==4.66.5
transformers ==4.44.2
vendi-score ==0.0.3
vllm ==0.6.0
wget ==3.2

.github/workflows/python-publish.yml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/test.yml actions

actions/checkout v4 composite
actions/setup-python v3 composite

pyproject.toml pypi

requirements-kbc.txt pypi

chonkie *
fairy-doc *
trafilatura *

requirements-muxi.txt pypi

accelerate *
addict *
aisuite *
appdirs *
colorlog *
datasets *
datasketch *
math_verify *
modelscope *
numpy <2.0.0
pytest *
rapidfuzz *
scipy *
torch *
tqdm *
transformers *
word2number *

requirements-text.txt pypi

bert_score *
datasketch *
fasttext ==0.9.3
filelock ==3.15.4
gdown *
gensim *
google-api-core ==2.19.1
google-api-python-client ==2.140.0
google-auth ==2.33.0
google-auth-httplib2 ==0.2.0
googleapis-common-protos ==1.63.2
hlepor *
kenlm ==0.3.0
langkit ==0.0.33
loguru ==0.7.2
matplotlib ==3.9.2
multiprocess ==0.70.16
nltk *
nptyping *
openai =
pot *
presidio_analyzer *
presidio_anonymizer *
prettytable ==3.11.0
pyspark ==3.5.2
sacrebleu *
sentencepiece ==0.2.0
simhash *
vendi-score ==0.0.3
wget ==3.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science