mlflow2prov

Extract provenance graphs compliant with W3C PROV from ML experiment projects that use Git repositories and MLflow tracking

https://github.com/mariusschlegel/mlflow2prov

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Extract provenance graphs compliant with W3C PROV from ML experiment projects that use Git repositories and MLflow tracking

Basic Info
  • Host: GitHub
  • Owner: mariusschlegel
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 294 KB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created about 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License Citation

README.md

MLflow2PROV

Made-with-Python W3C-PROV License Coverage Black

MLflow2PROV is a Python library and command line tool for extracting provenance graphs from ML experiment projects that use Git repositories and MLflow tracking. The underlying data model is compliant with the W3C PROV specification.

For a discussion of the ideas, please see the following paper:

If you publish work that uses MLflow2PROV, please cite MLflow2PROV and use the corresponding BibTeX entry below.

Installation

MLflow2PROV can currently be installed via Poetry (soon also available on PyPI). For instructions on installing Poetry, please see here. MLflow2PROV currently requires either Python 3.10 or 3.11. Thus, you may be required to install Python (using Pyenv) and tell Poetry to use this version (in the following, exemplarily for Python 3.10):

bash sudo dnf install -y openssl-devel libffi-devel bzip2-devel readline-devel sqlite-devel xz-devel tk-devel # exemplary installation of Python dependencies in Fedora 38 pyenv install 3.10.11 poetry env use 3.10.11

MLflow2PROV uses Graphviz for exporting provenance graphs in the dot format. Since Graphviz is not available as a Python package, the installation with the distribution's package manager may be required as follows:

bash sudo dnf install graphviz # exemplary installation in Fedora 38

Then, install MLflow2PROV and its dependencies with Poetry:

bash poetry install

To currently use all features of MLflow2PROV, the application of two minor patches to the MLflow installation is required. You can apply the patches locally as follows:

bash patch .venv/lib/python3.10/site-packages/mlflow/utils/search_utils.py < patches/mlflow-2.5.0-search_utils.patch patch .venv/lib/python3.10/site-packages/mlflow/store/model_registry/sqlalchemy_store.py < patches/mlflow-2.5.0-sqlalchemy_store.patch

Specifically, these patches adjust the FileStore and SQLAlchemyStore Model Registry backend implementations to also enable reading deleted ModelVersion objects. This is especially required to create instances of the RegisteredModelVersionDeletion provenance model. The issue has been already reported to the MLflow project (see https://github.com/mlflow/mlflow/issues/8225).

The dependencies for development can be installed via Poetry's --with option:

bash poetry install --with dev

Getting Started

The directory examples/quickstart-example/ provides a ready-to-run ML project including a prepared MLflow instance that can be used to try out MLflow2PROV. Please read examples/quickstart-example/README.md for detailed instructions.

Usage

MLflow2PROV can be currently run from within the virtual environment created by Poetry inside the project's root directory via

bash poetry run mlflow2prov [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

or

bash poetry shell mflow2prov [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

If the project's MLflow Tracking Server uses HTTP authentication, then it is possible to set the credentials via environment variables as follows:

bash poetry shell export MLFLOW_TRACKING_USERNAME="myusername" export MLFLOW_TRACKING_PASSWORD="mypassword" mflow2prov [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Alternatively, poetry run can be used together with a shell script containing the commands listed above.

Further MLflow environment variables can be set analogously (see documentation).

The command line interface of MLflow2PROV can be used either used with a chain of commands and options or, alternatively, by providing a configuration file in .yaml format.

Command Line Usage

The command line interface provides commands that can be chained together like a Unix pipeline.

``` Usage: mlflow2prov [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Extract provenance information from ML experiment projects that use Git repositories and MLflow tracking.

Options: --version Show the version and exit. --verbose Enable logging to stdout. --config FILE Read configuration from file. --validate FILE Validate configuration file and exit. --help Show this message and exit.

Commands: extract Extract a provenance document from an ML experiment project... load Load provenance documents from one or more file(s). merge Merge one or more given provenance documents into a single... save Save one or more provenance documents to file(s). statistics Print statistics for one or more provenance documents. transform Apply a set of transformations to one or more given... ```

MLflow2PROV can be invoked as follows:

bash mlflow2prov extract --repository_path "/home/user/dev/mlproject-foo" --mlflow_url "http://localhost-foo:5000" \ extract --repository_path "/home/user/dev/mlproject-bar" --mlflow_url "http://localhost-bar:5000" \ load --input example.rdf \ transform --use_pseudonyms --eliminate_duplicates \ merge \ save --output result --format json --format rdf --format xml --format provn --format dot \ statistics --resolution fine --format table

Configuration File Usage

MLflow2PROV supports configuration files in .yaml format that are functionally equivalent to command line invocations. To read configuration details from a file instead of specifying on the command line, use the --config option:

bash mlflow2prov --config examples/config/example.yaml

You can validate your configuration file (e.g. to check for syntactical errors) before as follows:

bash mlflow2prov --validate examples/config/example.yaml

A configuration file functionally equivalent to the above command line invocation example is specified as follows (see also examples/config/example.yaml):

yaml - extract: repository_path: "/home/user/dev/mlproject-foo" mlflow_url: "http://localhost-foo:5000" - extract: repository_path: "/home/user/dev/mlproject-bar" mlflow_url: "http://localhost-bar:5000" - load: input: [example.rdf] - transform: use_pseudonyms: true eliminate_duplicates: true - merge: - save: output: result format: [json, rdf, xml, provn, dot] - statistics: fine: true format: table

Provenance Output Formats

MLflow2PROV supports multiple output formats provided by the prov library:

Integrations

For further processing and usage of the output files, some exemplary helpers are provided in the directory examples/integrations/, which demonstrate and simplify the integration of MLflow2PROV with other systems:

  • the graph DBMS Neo4J supporting Cypher queries,
  • the NoSQL DBMS MongoDB supporting MQL queries,
  • the RDF triple store Apache Jena Fuseki supporting SPARQL queries, and
  • the visualization software Graphviz for DOT file processing.

Citing

If you publish work that uses MLflow2PROV, please cite MLflow2PROV as follows:

BibTeX @inproceedings{Schlegel23, author = {Schlegel, Marius and Sattler, Kai-Uwe}, title = {{MLflow2PROV: Extracting Provenance from Machine Learning Experiments}}, booktitle = {{Proceedings of the 7th Workshop on Data Management for End-to-End Machine Learning (DEEM@SIGMOD '23)}}, year = {2023}, publisher = {{ACM}}, doi = {10.1145/3595360.3595859}, url = {https://doi.org/10.1145/3595360.3595859}, }

Contributing

Contributions and pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Further information on contributing can be found in the document CONTRIBUTING.md.

License

This project is Apache 2.0 licensed. Copyright © 2023–2024 by Marius Schlegel.

Owner

  • Name: Marius Schlegel
  • Login: mariusschlegel
  • Kind: user
  • Location: Germany

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Schlegel"
  given-names: "Marius"
  orcid: "https://orcid.org/0000-0001-6596-2823"
title: "MLflow2PROV"
url: "https://github.com/mariusschlegel/mlflow2prov"
preferred-citation:
    type: conference-paper
    authors:
    - family-names: "Schlegel"
      given-names: "Marius"
      orcid: "https://orcid.org/0000-0001-6596-2823"
    - family-names: "Sattler"
      given-names: "Kai-Uwe"
      orcid: "https://orcid.org/0000-0003-1608-7721"
    title: "MLflow2PROV: Extracting Provenance from Machine Learning Experiments"
    collection-title: "Proceedings of the 7th Workshop on Data Management for End-to-End Machine Learning (DEEM@SIGMOD '23)"
    collection-type: proceedings
    doi: 10.1145/3595360.3595859
    date-published: 2023-06-18

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1