csc-mlops

Framework for building ML apps

https://github.com/gstt-csc/mlops

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.7%) to scientific vocabulary

Keywords

artificial-intelligence data-science machine-learning mlops
Last synced: 6 months ago · JSON representation ·

Repository

Framework for building ML apps

Basic Info
  • Host: GitHub
  • Owner: GSTT-CSC
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 8.44 MB
Statistics
  • Stars: 11
  • Watchers: 3
  • Forks: 6
  • Open Issues: 32
  • Releases: 64
Topics
artificial-intelligence data-science machine-learning mlops
Created about 5 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md


Logo

A continuous integration and deployment framework for healthcare AI projects
View on PyPI »

View repo · Report Bug · Request Feature

Overview

This project aims to build an effective MLOps framework for the development of AI models in a healthcare setting. The application development framework has three major components:

1. MLOps server

The MLOps server hosts the ML lifecycle management services. An MLFlow instance serves as the management platform, providing experiment tracking and model serving.

2. Project Template

The project template is the starting point for any project using this development framework. This template is flexible enough for any projects and facilitates communication with other parts of the development framework. The figure below illustrates a high level overview of the template and supporting components provided by the MLOps server.

3. csc-mlops package

The csc-mlops python package is available on PyPI and installed by default by the project template. This package handles communication between the project and the server, performs automated tasks, and includes helper functions and classes to streamline development.

These components work together to simplify and automate many of the processes required for controlled app development. A high level schematic of the framework is illustrated below. In this case XNAT is used as a data archive platform, the framework can be adapted to use other data stores.

This repository contains the source code for the server and csc-mlops components of the development framework. For further details on the project template component see the project template repository.

Guiding Principles

This is an open source project and all contributions are welcome. Please see the contribution guidelines.

The MLOps server

Server components

  • MLFlow Open source platform to manage the ML lifecycle
  • MINIO High performance object storage suite
  • NGINX Reverse proxy server

It's not essential to have a complete understanding of all of these, but a high-level understanding of MLFlow in particular will be useful!

Getting Started

The production version of this project is intended to run on a dedicated remote machine on an isolated network. This documentation will often describe the MLOps server, development machine and runner as separate machines, but there is no reason these cannot be the same machine if the network locations point to the localhost.

Prerequisites

First follow the instructions to install Docker and docker-compose.

Check docker and docker-compose are working by calling passing the help argument on the command line. If the help information is not returned, or an error is given, revisit the docker installation docs. sh docker --help docker-compose --help

Setting up the MLOps server

  1. Clone and enter the repository sh git clone https://github.com/GSTT-CSC/MLOps.git cd MLOps

  2. The server should be configured by creating an environment file at /mlflow_server/.env. The environment variable shown are given as an example, and should not be used for a production deployment.

Setting these variables is a requirement, the server will fail to start if they are undefined.

Please do not use shown values. Consider Writing you own usernames and passwords.

```shell

Example env file - fill all required values before using

AWSACCESSKEYID=minioUsername AWSSECRETACCESSKEY=minioPassword MLFLOWS3IGNORETLS=true POSTGRESUSER=use POSTGRESPASSWORD=pass POSTGRESDB=db ```

  1. Navigate to the mlflow_server directory and start the service. Any docker images that are not present on your local system will be pulled from dockerhub (which might take a while).

shell cd mlflow_server docker-compose up -d --build

  1. To enable access to the minio artifact storage the host machine needs to be authenticated. Any of the methods supported by boto3 should be compatible, the recommended authentication method is to create an aws credentials file. e.g. for ubuntu/linux

[default] AWS_ACCESS_KEY_ID=minioUsername AWS_SECRET_ACCESS_KEY=minioPassword

Upon a successful build the server should now be up and running locally. By default, the mlflow user interface can be accessed at http:/localhost:85 and minio can be accessed at https:/localhost:8002.

To check if the server is up and running successfully running docker ps in the terminal lists the running containers, and we should see something like:

angular2html CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3d51a7580b6f mlflow_nginx "nginx -g 'daemon of…" About an hour ago Up About an hour 0.0.0.0:80->80/tcp, 0.0.0.0:8002->8002/tcp mlflow_nginx 1baa8ff12814 mlflow_app "mlflow server --bac…" About an hour ago Up About an hour 5000/tcp mlflow_server a397b4149c5f minio/minio:RELEASE.2021-03-17T02-33-02Z "/usr/bin/docker-ent…" About an hour ago Up About an hour (healthy) 9000/tcp, 9002/tcp mlflow_server_s3_1 65374369fe4d postgres:13.1 "/docker-entrypoint.…" About an hour ago Up About an hour (healthy) 5432/tcp, mlflow_db

Server components overview

When we ran docker-compose up we started 4 networked containers, each of which serves a purpose within the MLOps framework. 1. NGINX: The nginx container acts as a reverse proxy to control network traffic. 2. MLflow: The MLflow container hosts our MLflow server instance. This server is responsible for tracking and logging the MLOps events sent to it. 3. MINIO: The MINIO container hosts our MINIO server. Here we are using MINIO as a self hosted S3 storage location. The MLflow container interfaces well with S3 storage locations for logging artifacts (models, images, plots etc) 4. postgres: The database server container is visible only to the MLflow container, which logs MLflow entities to the postgres database hosted on this container. MLFlow entities should not be confused with artifacts (stored on MINIO), and are simple values such as metrics, parameters and configuration options which can be efficiently stored in a database.

There are two bridge networks which connect these containers, named 'frontend' and 'backend'. The backend is used for communication between containers and is not accessible from the host (or remote), the frontend is accesible from the host (or remote) through the NGINX reverse proxy. NGINX will act as our gatekeeper and all requests will pass through it. This enables us to take advantage of NGINX load balancing and authentication in production versions.

Experiment tracking with MLflow

MLflow is a framework for managing the full lifecycle of AI models. It contains tools to cover each stage of AI model lifecycle it contains 4 major component Tracking, Projects, Models, and a Model Registry. The endpoint for these tools is an MLflow server that cun run on local or remote hardware and handles all aspects of the lifecycle.

Currently, we will focus primarily on the tracking and projects components.

  • Tracking refers to tools used to track experiments to record and compare parameters and results. This is done by adding logging snippets to the ML code to record things like hyper-parameters, metrics and artifacts. These entities are then associated with a particular run with a specific git commit. This git commit points to a specific version of the project files. This means that by using MLflow tracking we are able to identifiy the code used to train an AI model and make comparisons following changes to code structure and hyperparameter choices.

  • MLflow uses projects to encapsulate AI tools in a reusable and reproducible way, based primarily on conventions. It also enables us to chain together project workflows meaning we are able to automate a great deal of the model development process.

csc-mlops package

The csc-mlops package can be installed using pip: angular2html pip install csc-mlops

Experiment

The Experiment class is the primary interface between the developers project code and the MLOps processes. By using Experiment a number of important processes are automated: - Project configuration and registration - Communication with the MLOps server - Ensures all project code is committed and current with repository - Docker image built if it can't be found locally - Project logger configured

To use the Experiment class the project must be run using a syntax such as:

```python from mlops.Experiment import Experiment

config_path = 'config/config.cfg'

exp = Experiment(configpath=configpath) exp.run(dockerargs={}, entrypoint='main') ```

When using the project template this process is performed when executing the run_project.py script.

For more information on how to define the project configuration using a config.cfg file see the project template documentation

Additional Tools

Additional tools designed to be used with MLOps are located in the tools folder.

  • Data toolkit
    • Tools for collecting information about large data stores.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

  1. Fork or clone the Project
  2. Since all code changes are staged on the develop branch before releases you will need to checkout this branch first (git checkout -b develop)
  3. Create your Feature Branch off of develop (git checkout -b feature/AmazingFeature)
  4. Commit your Changes (git commit -m 'Add some AmazingFeature')
  5. Push to the remote (git push origin feature/AmazingFeature)
  6. Open a Pull Request and specify that you want to merge your feature branch into the develop branch

Testing

When contributing, you are strongly encouraged to write tests for any functions or classes you add. Please uses pytest and add your tests to an appropriate location in the tests directory, which also contains some examples to get you started.

Warning!

Please be aware of unsafe deserialisation when utilising MLFlow, do not download models from publically hosted MLFlow instances and then load them locally as this can allow potentially malicious code to be run on your machine.

https://github.com/advisories/GHSA-cwgg-w6mp-w9hg

Acknowledgements

Owner

  • Name: Clinical Scientific Computing
  • Login: GSTT-CSC
  • Kind: organization
  • Location: United Kingdom

Clinical Scientific Computing @ Guy's & St. Thomas' NHS Foundation Trust

Citation (CITATION.cff)

# YAML 1.2
---
authors: 
  -
    family-names: Jackson
    given-names: Laurence
    orcid: "https://orcid.org/0000-0002-5904-8012"
  -
    family-names: Deng
    given-names: Alexander T
    orcid: "https://orcid.org/0000-0002-8121-817X"
  -
    family-names: Malashchuk
    given-names: Igor
  -
    family-names: Nayal
    given-names: Virender
  -
    family-names: Zou
    given-names: Jason
  -
    family-names: Shuaib
    given-names: Haris
    orcid: "https://orcid.org/0000-0001-6975-5960"
cff-version: "1.1.0"
date-released: 2021-09-17
message: "If you use MLOps in your work, please cite it using these metadata."
repository-code: "https://github.com/GSTT-CSC/MLOps"
title: MLOps
version: "0.9.18"
...

GitHub Events

Total
  • Create event: 3
  • Release event: 1
  • Issues event: 2
  • Watch event: 4
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 8
  • Pull request review comment event: 3
  • Pull request review event: 3
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Create event: 3
  • Release event: 1
  • Issues event: 2
  • Watch event: 4
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 8
  • Pull request review comment event: 3
  • Pull request review event: 3
  • Pull request event: 2
  • Fork event: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 61
  • Total pull requests: 67
  • Average time to close issues: 2 months
  • Average time to close pull requests: 4 days
  • Total issue authors: 9
  • Total pull request authors: 6
  • Average comments per issue: 0.44
  • Average comments per pull request: 0.87
  • Merged pull requests: 66
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 22 hours
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • laurencejackson (41)
  • mikewoodward94 (5)
  • sophie22 (4)
  • hshuaib90 (3)
  • AnilMistry (2)
  • dangerdika (2)
  • tomaroberts (2)
  • Alexiszcv (1)
  • helghast79 (1)
Pull Request Authors
  • laurencejackson (56)
  • mikewoodward94 (13)
  • sophie22 (2)
  • AnilMistry (2)
  • tomaroberts (2)
  • Alexiszcv (1)
Top Labels
Issue Labels
enhancement (14) bug (7) good first issue (5) documentation (4)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 426 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 2
  • Total versions: 52
  • Total maintainers: 3
pypi.org: csc-mlops

An MLOps framework for development of clinical applications

  • Versions: 52
  • Dependent Packages: 0
  • Dependent Repositories: 2
  • Downloads: 426 Last month
Rankings
Downloads: 7.6%
Dependent packages count: 10.0%
Dependent repos count: 11.6%
Average: 12.2%
Forks count: 13.3%
Stargazers count: 18.5%
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • GitPython *
  • boto3 *
  • colorlog *
  • docker *
  • fsspec *
  • itk *
  • matplotlib *
  • minio >=7.0.3
  • mlflow ==1.26.0
  • monai *
  • pandas *
  • pytest >=6.2
  • scikit-build *
  • tqdm *
  • xnat *
tests/data/requirements.txt pypi
  • numpy *
tools/datatoolkit/setup.py pypi
  • gitpython *
  • jinja2 >=2.7
  • prettytable *
  • pyyaml *
.github/workflows/master-develop-test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • jwalton/gh-docker-logs v2 composite
  • rbialon/flake8-annotations v1 composite
  • schneegans/dynamic-badges-action v1.0.0 composite
.github/workflows/pull_request_tests.yml actions
  • MishaKav/pytest-coverage-comment main composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • jwalton/gh-docker-logs v2 composite
  • rbialon/flake8-annotations v1 composite
.github/workflows/python-publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
.github/workflows/test_cli.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
mlflow_server/docker-compose.yml docker
  • minio/mc latest
  • minio/minio RELEASE.2022-11-08T05-27-07Z
  • mlflow_nginx latest
  • mlflow_server latest
  • postgres 13.1
mlflow_server/mlflow/Dockerfile docker
  • python 3.9-slim build
mlflow_server/nginx/Dockerfile docker
  • nginx 1.17.6 build
tests/data/Dockerfile docker
  • python 3.9-slim build
mlflow_server/mlflow/requirements_mlflow.txt pypi
  • boto3 *
  • mlflow ==2.0.1
  • psycopg2 *
  • psycopg2-binary *
  • pymysql *
  • sqlalchemy *