mlinspect
Inspect ML Pipelines in Python in the form of a DAG
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.2%) to scientific vocabulary
Repository
Inspect ML Pipelines in Python in the form of a DAG
Basic Info
Statistics
- Stars: 70
- Watchers: 5
- Forks: 17
- Open Issues: 19
- Releases: 0
Metadata Files
README.md
mlinspect
Inspect ML Pipelines in Python in the form of a DAG
Run mlinspect locally
Prerequisite: Python 3.10
- Clone this repository
Set up the environment
cd mlinspect
python -m venv venv
source venv/bin/activateIf you want to use the visualisation functions we provide, install graphviz which can not be installed via pip
Linux:apt-get install graphviz
MAC OS:brew install graphvizInstall pip dependencies
SETUPTOOLS_USE_DISTUTILS=stdlib pip install -e .[dev]To ensure everything works, you can run the tests (without graphviz, the visualisation test will fail)
python setup.py test
How to use mlinspect
mlinspect makes it easy to analyze your pipeline and automatically check for common issues. ```python from mlinspect import PipelineInspector from mlinspect.inspections import MaterializeFirstOutputRows from mlinspect.checks import NoBiasIntroducedFor
IPYNB_PATH = ...
inspectorresult = PipelineInspector\ .onpipelinefromipynbfile(IPYNBPATH)\ .addrequiredinspection(MaterializeFirstOutputRows(5))\ .add_check(NoBiasIntroducedFor(['race']))\ .execute()
extracteddag = inspectorresult.dag dagnodetoinspectionresults = inspectorresult.dagnodetoinspectionresults checktocheckresults = inspectorresult.checktocheckresults ```
Detailed Example
We prepared a demo notebook to showcase mlinspect and its features.
Supported libraries and API functions
mlinspect already supports a selection of API functions from pandas and scikit-learn. Extending mlinspect to support more and more API functions and libraries will be an ongoing effort. However, mlinspect won't just crash when it encounters functions it doesn't recognize yet. For more information, please see here.
Notes
- For debugging in PyCharm, set the pytest flag
--no-cov(Link)
Publications
- Stefan Grafberger, Paul Groth, Julia Stoyanovich, Sebastian Schelter (2022). Data Distribution Debugging in Machine Learning Pipelines. The VLDB Journal — The International Journal on Very Large Data Bases (Special Issue on Data Science for Responsible Data Management).
- Stefan Grafberger, Shubha Guha, Julia Stoyanovich, Sebastian Schelter (2021). mlinspect: a Data Distribution Debugger for Machine Learning Pipelines. ACM SIGMOD (demo).
- Stefan Grafberger, Julia Stoyanovich, Sebastian Schelter (2020). Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines. Conference on Innovative Data Systems Research (CIDR).
License
This library is licensed under the Apache 2.0 License.
Owner
- Name: Stefan Grafberger
- Login: stefan-grafberger
- Kind: user
- Location: Amsterdam
- Company: University of Amsterdam
- Website: https://stefan-grafberger.com
- Twitter: SGrafberger
- Repositories: 2
- Profile: https://github.com/stefan-grafberger
I am a Ph.D. student at the University of Amsterdam, conducting research at the intersection of data management and machine learning.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Grafberger"
given-names: "Stefan"
orcid: "https://orcid.org/0000-0002-9884-9517"
- family-names: "Groth"
given-names: "Paul"
orcid: "https://orcid.org/0000-0003-0183-6910"
- family-names: "Stoyanovich"
given-names: "Julia"
- family-names: "Schelter"
given-names: "Sebastian"
title: "Data Distribution Debugging in Machine Learning Pipelines"
doi: 10.1007/s00778-021-00726-w
url: "https://github.com/stefan-grafberger/mlinspect"
preferred-citation:
type: article
authors:
- family-names: "Grafberger"
given-names: "Stefan"
orcid: "https://orcid.org/0000-0002-9884-9517"
- family-names: "Groth"
given-names: "Paul"
orcid: "https://orcid.org/0000-0003-0183-6910"
- family-names: "Stoyanovich"
given-names: "Julia"
- family-names: "Schelter"
given-names: "Sebastian"
title: "Data Distribution Debugging in Machine Learning Pipelines"
doi: 10.1007/s00778-021-00726-w
date-released: 2022-01-31
GitHub Events
Total
- Watch event: 3
- Fork event: 1
Last Year
- Watch event: 3
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 46
- Total pull requests: 54
- Average time to close issues: about 2 months
- Average time to close pull requests: 24 days
- Total issue authors: 3
- Total pull request authors: 5
- Average comments per issue: 0.3
- Average comments per pull request: 1.19
- Merged pull requests: 35
- Bot issues: 0
- Bot pull requests: 15
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- stefan-grafberger (38)
- sscdotopen (7)
- adrianlut (1)
Pull Request Authors
- stefan-grafberger (33)
- dependabot[bot] (15)
- adrianlut (4)
- PiStefania (1)
- shubhaguha (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- gensim ==3.8.3 development
- importnb ==0.6.2 development
- jupyter ==1.0.0 development
- keras ==2.4.3 development
- pylint ==2.6.0 development
- pytest ==6.1.2 development
- pytest-cov ==2.10.1 development
- pytest-mock ==3.3.1 development
- pytest-pycharm ==0.7.0 development
- pytest-pylint ==0.17.0 development
- pytest-runner ==5.2 development
- seaborn ==0.11.0 development
- tensorflow ==2.5.0 development
- astmonkey ==0.3.6
- astpretty ==2.0.0
- astunparse ==1.6.3
- gorilla ==0.4.0
- ipython ==7.25.0
- matplotlib ==3.4.2
- more-itertools ==8.6.0
- nbconvert ==6.4.5
- nbformat ==5.0.8
- networkx ==2.5
- numpy ==1.19.5
- pandas ==1.2.3
- protobuf ==3.20.1
- pygraphviz ==1.7
- scikit-learn ==0.23.2
- scipy ==1.7.0
- setuptools ==57.0.0
- six ==1.15.0
- statsmodels ==0.12.2
- testfixtures ==6.17.1
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- codecov/codecov-action v1 composite
- ts-graphviz/setup-graphviz v1 composite