zaira-chem

Automated QSAR based on multiple small molecule descriptors

https://github.com/ersilia-os/zaira-chem

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.0%) to scientific vocabulary

Keywords

automl machine-learning qsar
Last synced: 6 months ago · JSON representation ·

Repository

Automated QSAR based on multiple small molecule descriptors

Basic Info
  • Host: GitHub
  • Owner: ersilia-os
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1020 MB
Statistics
  • Stars: 40
  • Watchers: 5
  • Forks: 13
  • Open Issues: 12
  • Releases: 3
Topics
automl machine-learning qsar
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Contributor Covenant License: GPL v3 DOI

documentation Python 3.7 Code style: black

ZairaChem: Automated ML-based (Q)SAR

ZairaChem is the first library of Ersilia's family of tools devoted to providing out-of-the-box machine learning solutions for biomedical problems. In this case, we have focused on (Q)SAR models. (Q)SAR models take chemical structures as input and give as output predicted properties, typically pharmacological properties such as bioactivity against a certain target.

Both Ersilia and Zaira are cities described in Italo Calvino's book 'Invisible Cities' (1972). Ersilia is a "trading city" where inhabitants stretch strings from the corners of the houses to establish the relationships that sustain the life of the city. When the strings become too numerous, they rebuild Ersilia elsewhere, and their network of relationships remains. Zaira is a "city of memories". It contains its own past written in every corner, scratched in every pole, window and bannister.

Installation

Clone the repository in your local system git clone https://github.com/ersilia-os/zaira-chem.git cd zaira-chem

From the terminal, run the installation script: bash install_linux.sh

By default, a Conda enviroment named zairachem will be created. Activate it:

conda activate zairachem

Usage

ZairaChem can be run as a command line interface. To learn more about the ZairaChem commands, see the help command_

bash zairachem --help

Quick start

ZairaChem expects a comma- or tab-separated file containing two columns: a "smiles" column with the molecules in SMILES format and an "activity" column with the activity values.

To get started, let's load an example classification task from Therapeutic Data Commons.

bash zairachem example --file_name input.csv

This file can be split into train and test sets.

bash zairachem split -i input.csv

The command above will generate two files your working directory, named train.csv and test.csv. By default, the train:test ratio is 80:20.

Fit

You can train a model as follows:

bash zairachem fit -i train.csv -m model

This command will run the full ZairaChem pipeline and produce a model folder with processed data, model checkpoints, and reports. If no cut-off is specified for the classification, ZairaChem will establish an internal cut-off to determine Category 0 and category 1. The output results will always provide the probability of a molecule being Category 1. Alternatively, you can set your preferred cuto-off with the following command: bash zairachem fit -i train.csv -c 0.1 -d low -m model Where the '-c' indicates the cut-off of the activity values and the '-d' specifies the direction. If set to 'low', values <= c will be considered 1 and if set to 'high', values => c will be considered 1.

Predict

You can then run predictions on the test set:

bash zairachem predict -i test.csv -m model -o test

ZairaChem will run predictions using the checkpoints stored in model and store results in the test directory. Several performance plots will be generated alongside prediction outputs.

Distill

You can distill a more compact version of the model with the built-in Olinda[https://github.com/ersilia-os/olinda] pipeline:

bash zairachem distill -m path_to_zairachem_model -o model.onnx

You can then run predictions through the new Olinda ONNX model with the same ZairaChem cli command: bash zairachem predict -i test.csv -m model.onnx -o test

Additional Information

For further technical details, please read the ZairaChem page of the Ersilia gitbook, which describes each major step in the ZairaChem pipeline. The corresponding publication for the ZairaChem pipeline is available here.

Citation

If you use ZairaChem, please cite us: @article{Turon2023, author = {Turon, G. and Hlozek, J. and Woodland, J.G. and et al.}, title = {First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa}, journal = {Nat Commun}, volume = {14}, pages = {5736}, year = {2023}, doi = {10.1038/s41467-023-41512-2}, url = {https://doi.org/10.1038/s41467-023-41512-2} }

About us

Learn about the Ersilia Open Source Initiative!

Owner

  • Name: Ersilia Open Source Initiative
  • Login: ersilia-os
  • Kind: organization
  • Email: hello@ersilia.io
  • Location: United Kingdom

Ersilia is a charity developing open source tools to facilitate global health drug discovery, with a focus on neglected diseases, for equal healthcare

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Turon"
  given-names: "Gemma"
  orcid: "https://orcid.org/0000-0001-6798-0275"
- family-names: "Duran-Frigola"
  given-names: "Miquel"
  orcid: "https://orcid.org/0000-0002-9906-6936"
title: "ZairaChem: automated ML modelling for chemistry datasets"
version: 1.0.0
doi: 10.5281/zenodo.7352287
date-released: 2022-11-23
url: "https://github.com/ersilia-os/zaira-chem"

GitHub Events

Total
  • Issues event: 11
  • Watch event: 10
  • Issue comment event: 28
  • Push event: 1
  • Pull request event: 2
  • Fork event: 2
Last Year
  • Issues event: 11
  • Watch event: 10
  • Issue comment event: 28
  • Push event: 1
  • Pull request event: 2
  • Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 42
  • Total pull requests: 9
  • Average time to close issues: 3 months
  • Average time to close pull requests: 6 days
  • Total issue authors: 11
  • Total pull request authors: 4
  • Average comments per issue: 4.55
  • Average comments per pull request: 0.44
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 2
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 5 days
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 3.8
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • GemmaTuron (17)
  • miquelduranfrigola (8)
  • JHlozek (6)
  • marcostorrework (4)
  • Femme-js (1)
  • sooheon (1)
  • gdreiman-insitro (1)
  • sistar2020 (1)
  • leoank (1)
  • nataliyah123 (1)
  • paulinebanye (1)
Pull Request Authors
  • JHlozek (4)
  • GemmaTuron (2)
  • HellenNamulinda (2)
  • marcostorrework (1)
Top Labels
Issue Labels
bug (1) documentation (1) enhancement (1)
Pull Request Labels
enhancement (1)