VeridicalFlow
VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS - Published in JOSS (2022)
Science Score: 100.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
✓Committers with academic emails
7 of 9 committers (77.8%) from academic institutions -
✓Institutional organization owner
Organization yu-group has institutional domain (www.stat.berkeley.edu) -
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Keywords from Contributors
Scientific Fields
Repository
Making it easier to build stable, trustworthy data-science pipelines based on the PCS framework.
Basic Info
- Host: GitHub
- Owner: Yu-Group
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://vflow.csinva.io
- Size: 13.4 MB
Statistics
- Stars: 71
- Watchers: 5
- Forks: 7
- Open Issues: 7
- Releases: 5
Topics
Metadata Files
README.md
A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!
Why use vflow?
Using vflows simple wrappers facilitates many best practices for data science,
as laid out in the predictability, computability, and stability (PCS) framework
for veridical data science. The goal
of vflow is to easily enable data science pipelines that follow PCS by
providing intuitive low-code syntax, efficient and flexible computational
backends via Ray,
and well-documented, reproducible experimentation via
MLflow.
| Computation | Reproducibility | Prediction | Stability | | ----------- | --------------- | ---------- | --------- | | Automatic parallelization and caching throughout the pipeline | Automatic experiment tracking and saving | Filter the pipeline by training and validation performance | Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results |
Here we show a simple example of an entire data-science pipeline with several
perturbations (e.g. different data subsamples, models, and metrics) written
simply using vflow.
```python import sklearn from sklearn.datasets import makeclassification from sklearn.linearmodel import LogisticRegression from sklearn.metrics import accuracyscore, balancedaccuracyscore from sklearn.modelselection import traintestsplit from sklearn.tree import DecisionTreeClassifier
from vflow import Vset, init_args
initialize data
X, y = makeclassification() Xtrain, Xtest, ytrain, ytest = initargs( traintestsplit(X, y), names=["Xtrain", "Xtest", "ytrain", "ytest"], # optionally name the args )
subsample data
subsamplingfuncs = [sklearn.utils.resample for _ in range(3)] subsamplingset = Vset( name="subsampling", vfuncs=subsamplingfuncs, outputmatching=True ) Xtrains, ytrains = subsamplingset(Xtrain, y_train)
fit models
models = [LogisticRegression(), DecisionTreeClassifier()] modelingset = Vset(name="modeling", vfuncs=models, vfunckeys=["LR", "DT"]) modelingset.fit(Xtrains, ytrains) predstest = modelingset.predict(Xtest)
get metrics
binarymetricsset = Vset( name="binarymetrics", vfuncs=[accuracyscore, balancedaccuracyscore], vfunckeys=["Acc", "BalAcc"], ) binarymetrics = binarymetricsset.evaluate(predstest, y_test) ```
Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.
Documentation
See the docs for reference on the API
Notebook examples
Note that some of these require more dependencies than just those required for
vflow. To install all, runpip install vflow[nb].
Installation
Stable version
bash
pip install vflow
Development version (unstable)
bash
pip install vflow@git+https://github.com/Yu-Group/veridical-flow
References
- interface: easily build on scikit-learn and dvc (data version control)
- computation: integration with ray and caching with joblib
- tracking: mlflow
- pull requests very welcome! (see contributing.md)
r
@software{duncan2020vflow,
author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin},
doi = {10.21105/joss.03895},
month = {1},
title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}},
url = {https://doi.org/10.21105/joss.03895},
year = {2022}
}
Owner
- Name: Yu-Group
- Login: Yu-Group
- Kind: organization
- Email: chandan_singh@berkeley.edu
- Location: Berkeley, CA
- Website: https://www.stat.berkeley.edu/~yugroup/
- Repositories: 19
- Profile: https://github.com/Yu-Group
Bin Yu Group at UC Berkeley
JOSS Publication
VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS
Authors
EECS Department, University of California, Berkeley
Physics Department, University of California, Berkeley
Statistics Department, University of California, Berkeley, EECS Department, University of California, Berkeley
Tags
python stability reproducibility data science cachingCitation (citation.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Duncan" given-names: "James" - family-names: "Kapoor" given-names: "Rush" - family-names: "Agarwal" given-names: "Abhineet" - family-names: "Singh" given-names: "Chandan" - family-names: "Yu" given-names: "Bin" title: "VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS" journal: "Journal of Open Source Software" doi: 10.21105/joss.03895 date-released: 2022-01-12 url: "https://doi.org/10.21105/joss.03895"
GitHub Events
Total
- Watch event: 3
- Fork event: 1
Last Year
- Watch event: 3
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| James Duncan | j****n@b****u | 93 |
| Chandan Singh | c****h@b****u | 82 |
| Rushk014 | r****r@b****u | 57 |
| Abhineet | a****7@b****u | 5 |
| Sahil Saxena | s****8@b****u | 3 |
| Michał Kuźba | k****8@g****m | 2 |
| Mehmet Hakan Satman | m****n@g****m | 2 |
| Daniel S. Katz | d****z@i****g | 2 |
| Matthew Feickert | m****t@c****h | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 21
- Total pull requests: 37
- Average time to close issues: 17 days
- Average time to close pull requests: 5 days
- Total issue authors: 7
- Total pull request authors: 8
- Average comments per issue: 1.19
- Average comments per pull request: 0.86
- Merged pull requests: 32
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- kmichael08 (9)
- rushk014 (3)
- richrobe (3)
- GiannisPikoulis (2)
- jpdunc23 (2)
- csinva (1)
- ssaxena00 (1)
Pull Request Authors
- jpdunc23 (20)
- rushk014 (8)
- ssaxena00 (5)
- csinva (3)
- kmichael08 (2)
- matthewfeickert (2)
- jbytecode (1)
- danielskatz (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 44 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 7
- Total maintainers: 2
pypi.org: vflow
A framework for doing stability analysis with PCS.
- Homepage: https://vflow.csinva.io/
- Documentation: https://vflow.readthedocs.io/
- License: MIT
-
Latest release: 0.1.4
published almost 2 years ago
Rankings
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- codecov/codecov-action v2 composite
- joblib *
- matplotlib *
- mlflow *
- networkx *
- numpy *
- pandas >=2.0.0
- pytest *
- ray *
- scipy *
