VeridicalFlow

VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS - Published in JOSS (2022)

https://github.com/yu-group/veridical-flow

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    7 of 9 committers (77.8%) from academic institutions
  • Institutional organization owner
    Organization yu-group has institutional domain (www.stat.berkeley.edu)
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

ai data-science ensembling machine-learning ml pandas preprocessing python3 stability statistics tutorial workflow

Keywords from Contributors

explainable-ai interpretability supervised-learning

Scientific Fields

Sociology Social Sciences - 87% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Making it easier to build stable, trustworthy data-science pipelines based on the PCS framework.

Basic Info
  • Host: GitHub
  • Owner: Yu-Group
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: master
  • Homepage: https://vflow.csinva.io
  • Size: 13.4 MB
Statistics
  • Stars: 71
  • Watchers: 5
  • Forks: 7
  • Open Issues: 7
  • Releases: 5
Topics
ai data-science ensembling machine-learning ml pandas preprocessing python3 stability statistics tutorial workflow
Created almost 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Contributing License Citation

README.md

vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!

mit license python3.9+ tests tests joss PyPI - version

Why use vflow?

Using vflows simple wrappers facilitates many best practices for data science, as laid out in the predictability, computability, and stability (PCS) framework for veridical data science. The goal of vflow is to easily enable data science pipelines that follow PCS by providing intuitive low-code syntax, efficient and flexible computational backends via Ray, and well-documented, reproducible experimentation via MLflow.

| Computation | Reproducibility | Prediction | Stability | | ----------- | --------------- | ---------- | --------- | | Automatic parallelization and caching throughout the pipeline | Automatic experiment tracking and saving | Filter the pipeline by training and validation performance | Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results |

Here we show a simple example of an entire data-science pipeline with several perturbations (e.g. different data subsamples, models, and metrics) written simply using vflow.

```python import sklearn from sklearn.datasets import makeclassification from sklearn.linearmodel import LogisticRegression from sklearn.metrics import accuracyscore, balancedaccuracyscore from sklearn.modelselection import traintestsplit from sklearn.tree import DecisionTreeClassifier

from vflow import Vset, init_args

initialize data

X, y = makeclassification() Xtrain, Xtest, ytrain, ytest = initargs( traintestsplit(X, y), names=["Xtrain", "Xtest", "ytrain", "ytest"], # optionally name the args )

subsample data

subsamplingfuncs = [sklearn.utils.resample for _ in range(3)] subsamplingset = Vset( name="subsampling", vfuncs=subsamplingfuncs, outputmatching=True ) Xtrains, ytrains = subsamplingset(Xtrain, y_train)

fit models

models = [LogisticRegression(), DecisionTreeClassifier()] modelingset = Vset(name="modeling", vfuncs=models, vfunckeys=["LR", "DT"]) modelingset.fit(Xtrains, ytrains) predstest = modelingset.predict(Xtest)

get metrics

binarymetricsset = Vset( name="binarymetrics", vfuncs=[accuracyscore, balancedaccuracyscore], vfunckeys=["Acc", "BalAcc"], ) binarymetrics = binarymetricsset.evaluate(predstest, y_test) ```

Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.

Documentation

See the docs for reference on the API

Notebook examples

Note that some of these require more dependencies than just those required for vflow. To install all, run pip install vflow[nb].

Synthetic classification

Enhancer genomics

fMRI voxel prediction

Fashion mnist classification

Feature importance stability

Clinical decision rule vetting

Installation

Stable version

bash pip install vflow

Development version (unstable)

bash pip install vflow@git+https://github.com/Yu-Group/veridical-flow

References

r @software{duncan2020vflow, author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin}, doi = {10.21105/joss.03895}, month = {1}, title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}}, url = {https://doi.org/10.21105/joss.03895}, year = {2022} }

Owner

  • Name: Yu-Group
  • Login: Yu-Group
  • Kind: organization
  • Email: chandan_singh@berkeley.edu
  • Location: Berkeley, CA

Bin Yu Group at UC Berkeley

JOSS Publication

VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS
Published
January 12, 2022
Volume 7, Issue 69, Page 3895
Authors
James Duncan ORCID
Statistics Department, University of California, Berkeley
Rush Kapoor
EECS Department, University of California, Berkeley
Abhineet Agarwal
Physics Department, University of California, Berkeley
Chandan Singh ORCID
EECS Department, University of California, Berkeley
Bin Yu
Statistics Department, University of California, Berkeley, EECS Department, University of California, Berkeley
Editor
Mehmet Hakan Satman ORCID
Tags
python stability reproducibility data science caching

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Duncan"
  given-names: "James"
- family-names: "Kapoor"
  given-names: "Rush"
- family-names: "Agarwal"
  given-names: "Abhineet"
- family-names: "Singh"
  given-names: "Chandan"
- family-names: "Yu"
  given-names: "Bin"
title: "VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS"
journal: "Journal of Open Source Software"
doi: 10.21105/joss.03895
date-released: 2022-01-12
url: "https://doi.org/10.21105/joss.03895"

GitHub Events

Total
  • Watch event: 3
  • Fork event: 1
Last Year
  • Watch event: 3
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 247
  • Total Committers: 9
  • Avg Commits per committer: 27.444
  • Development Distribution Score (DDS): 0.623
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
James Duncan j****n@b****u 93
Chandan Singh c****h@b****u 82
Rushk014 r****r@b****u 57
Abhineet a****7@b****u 5
Sahil Saxena s****8@b****u 3
Michał Kuźba k****8@g****m 2
Mehmet Hakan Satman m****n@g****m 2
Daniel S. Katz d****z@i****g 2
Matthew Feickert m****t@c****h 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 21
  • Total pull requests: 37
  • Average time to close issues: 17 days
  • Average time to close pull requests: 5 days
  • Total issue authors: 7
  • Total pull request authors: 8
  • Average comments per issue: 1.19
  • Average comments per pull request: 0.86
  • Merged pull requests: 32
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kmichael08 (9)
  • rushk014 (3)
  • richrobe (3)
  • GiannisPikoulis (2)
  • jpdunc23 (2)
  • csinva (1)
  • ssaxena00 (1)
Pull Request Authors
  • jpdunc23 (20)
  • rushk014 (8)
  • ssaxena00 (5)
  • csinva (3)
  • kmichael08 (2)
  • matthewfeickert (2)
  • jbytecode (1)
  • danielskatz (1)
Top Labels
Issue Labels
question (1) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 44 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 7
  • Total maintainers: 2
pypi.org: vflow

A framework for doing stability analysis with PCS.

  • Versions: 7
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 44 Last month
Rankings
Stargazers count: 9.5%
Dependent packages count: 10.1%
Forks count: 16.9%
Average: 20.7%
Dependent repos count: 21.6%
Downloads: 45.3%
Maintainers (2)
Last synced: 4 months ago

Dependencies

.github/workflows/python-package.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v2 composite
pyproject.toml pypi
requirements.txt pypi
  • joblib *
  • matplotlib *
  • mlflow *
  • networkx *
  • numpy *
  • pandas >=2.0.0
  • pytest *
  • ray *
  • scipy *