zaira-chem
Automated QSAR based on multiple small molecule descriptors
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.0%) to scientific vocabulary
Keywords
Repository
Automated QSAR based on multiple small molecule descriptors
Basic Info
Statistics
- Stars: 40
- Watchers: 5
- Forks: 13
- Open Issues: 12
- Releases: 3
Topics
Metadata Files
README.md
ZairaChem: Automated ML-based (Q)SAR
ZairaChem is the first library of Ersilia's family of tools devoted to providing out-of-the-box machine learning solutions for biomedical problems. In this case, we have focused on (Q)SAR models. (Q)SAR models take chemical structures as input and give as output predicted properties, typically pharmacological properties such as bioactivity against a certain target.
Both Ersilia and Zaira are cities described in Italo Calvino's book 'Invisible Cities' (1972). Ersilia is a "trading city" where inhabitants stretch strings from the corners of the houses to establish the relationships that sustain the life of the city. When the strings become too numerous, they rebuild Ersilia elsewhere, and their network of relationships remains. Zaira is a "city of memories". It contains its own past written in every corner, scratched in every pole, window and bannister.
Installation
Clone the repository in your local system
git clone https://github.com/ersilia-os/zaira-chem.git
cd zaira-chem
From the terminal, run the installation script:
bash install_linux.sh
By default, a Conda enviroment named zairachem will be created. Activate it:
conda activate zairachem
Usage
ZairaChem can be run as a command line interface. To learn more about the ZairaChem commands, see the help command_
bash
zairachem --help
Quick start
ZairaChem expects a comma- or tab-separated file containing two columns: a "smiles" column with the molecules in SMILES format and an "activity" column with the activity values.
To get started, let's load an example classification task from Therapeutic Data Commons.
bash
zairachem example --file_name input.csv
This file can be split into train and test sets.
bash
zairachem split -i input.csv
The command above will generate two files your working directory, named train.csv and test.csv. By default, the train:test ratio is 80:20.
Fit
You can train a model as follows:
bash
zairachem fit -i train.csv -m model
This command will run the full ZairaChem pipeline and produce a model folder with processed data, model checkpoints, and reports. If no cut-off is specified for the classification, ZairaChem will establish an internal cut-off to determine Category 0 and category 1. The output results will always provide the probability of a molecule being Category 1.
Alternatively, you can set your preferred cuto-off with the following command:
bash
zairachem fit -i train.csv -c 0.1 -d low -m model
Where the '-c' indicates the cut-off of the activity values and the '-d' specifies the direction. If set to 'low', values <= c will be considered 1 and if set to 'high', values => c will be considered 1.
Predict
You can then run predictions on the test set:
bash
zairachem predict -i test.csv -m model -o test
ZairaChem will run predictions using the checkpoints stored in model and store results in the test directory. Several performance plots will be generated alongside prediction outputs.
Distill
You can distill a more compact version of the model with the built-in Olinda[https://github.com/ersilia-os/olinda] pipeline:
bash
zairachem distill -m path_to_zairachem_model -o model.onnx
You can then run predictions through the new Olinda ONNX model with the same ZairaChem cli command:
bash
zairachem predict -i test.csv -m model.onnx -o test
Additional Information
For further technical details, please read the ZairaChem page of the Ersilia gitbook, which describes each major step in the ZairaChem pipeline. The corresponding publication for the ZairaChem pipeline is available here.
Citation
If you use ZairaChem, please cite us:
@article{Turon2023,
author = {Turon, G. and Hlozek, J. and Woodland, J.G. and et al.},
title = {First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa},
journal = {Nat Commun},
volume = {14},
pages = {5736},
year = {2023},
doi = {10.1038/s41467-023-41512-2},
url = {https://doi.org/10.1038/s41467-023-41512-2}
}
About us
Learn about the Ersilia Open Source Initiative!
Owner
- Name: Ersilia Open Source Initiative
- Login: ersilia-os
- Kind: organization
- Email: hello@ersilia.io
- Location: United Kingdom
- Website: ersilia.io
- Twitter: ersiliaio
- Repositories: 64
- Profile: https://github.com/ersilia-os
Ersilia is a charity developing open source tools to facilitate global health drug discovery, with a focus on neglected diseases, for equal healthcare
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Turon" given-names: "Gemma" orcid: "https://orcid.org/0000-0001-6798-0275" - family-names: "Duran-Frigola" given-names: "Miquel" orcid: "https://orcid.org/0000-0002-9906-6936" title: "ZairaChem: automated ML modelling for chemistry datasets" version: 1.0.0 doi: 10.5281/zenodo.7352287 date-released: 2022-11-23 url: "https://github.com/ersilia-os/zaira-chem"
GitHub Events
Total
- Issues event: 11
- Watch event: 10
- Issue comment event: 28
- Push event: 1
- Pull request event: 2
- Fork event: 2
Last Year
- Issues event: 11
- Watch event: 10
- Issue comment event: 28
- Push event: 1
- Pull request event: 2
- Fork event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 42
- Total pull requests: 9
- Average time to close issues: 3 months
- Average time to close pull requests: 6 days
- Total issue authors: 11
- Total pull request authors: 4
- Average comments per issue: 4.55
- Average comments per pull request: 0.44
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 5
- Pull requests: 2
- Average time to close issues: about 1 month
- Average time to close pull requests: 5 days
- Issue authors: 3
- Pull request authors: 1
- Average comments per issue: 3.8
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- GemmaTuron (17)
- miquelduranfrigola (8)
- JHlozek (6)
- marcostorrework (4)
- Femme-js (1)
- sooheon (1)
- gdreiman-insitro (1)
- sistar2020 (1)
- leoank (1)
- nataliyah123 (1)
- paulinebanye (1)
Pull Request Authors
- JHlozek (4)
- GemmaTuron (2)
- HellenNamulinda (2)
- marcostorrework (1)