chembl-binary-tasks
A repository to generate bioassay datasets from ChEMBL ready for downstream AI/ML modelling
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Repository
A repository to generate bioassay datasets from ChEMBL ready for downstream AI/ML modelling
Basic Info
- Host: GitHub
- Owner: ersilia-os
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 10.1 MB
Statistics
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 3
- Releases: 0
Metadata Files
README.md
Antimicrobial binary ML tasks from ChEMBL
IMPORTANT: THIS REPOSITORY IS THE OLD VERSION OF chembl-antimicrobial-tasks.
A repository to generate bioassay datasets from ChEMBL ready for downstream AI/ML modelling. This repository is based on the antimicrobial-ml-tasks and chembl-ml-tools, now Archived.
Installation
To install the package in a conda environment, please run:
conda create -n chemblml python=3.10
conda activate chemblml
pip install git+https://github.com/ersilia-os/chembl-binary-tasks.git
Requirements
These tools require access to a postgres database server containing the ChEMBL database. You may install ChEMBL in your own computer by following these instructions: How to install ChEMBL
You can use the following code to check that the package is working. This test assumes that there is a DB user called chembl_user with permissions to read the database.
Before running, make sure that the postgres service with the ChEMBL database is up.
``` import pandas as pd from chemblmltools import chemblactivitytarget
df1 = chemblactivitytarget( dbuser='chembluser', dbpassword='aaa', organismcontains='enterobacter', maxheavyatoms=100)
print(df1.head(5)) ```
Create datasets
1. Make sure that the PostgreSQL server containing the ChEMBL database is running. In case of doubt, review the requirements section.
By default, the programs assume that PostgreSQL is running in the local computer, and that the database user chembl_user with
password aaa has read access to the tables of ChEMBL. This can be changed in program scripts/defaults.py.
2. Revise src/default.py. This file contains several default settings, including the path where data will be stored, or the minimum number of assays to consider. Modify according to your needs.
3. Edit the file config/pathogens.csv to select the pathogens for which we need models.
This file has two columns:
- pathogen_code: Choose a short code to identify the pathogen, alphanumeric only, without spaces. Example: "efaecium".
- search_text: A search string, case insensitive, to search for the pathogen name in the
organismfield in the ChEMBL database. Example: "Enterococcus Faecium".
4. Run the script pathogens.py
cd src
python pathogens.py
This will create folders under the specified DATAPATH folder with all the available data for the selected pathogens. In this example, we will simply refer to it as /data folder by convention.
5. Create the datasets for each individual pathogen. It includes a binary classication of all assays pulled together as well as independent assays. Run the script main.py, passing the pathogen code as argument.
cd src
python main.py <pathogen_code>
Master files:
There are three master files in the /pathogenname folder:
- pathogenoriginal.csv: the original file pulled from ChEMBL
- pathogenprocessed.csv: the original file including the columns finalunit, transformer, finalvalue. The finalvalue column will contain the end result in the standardised unit as defined in config/ucum.csv
- pathogenbinary.csv: the processed file including the cut-offs for each assay (and unit type). Assay - unit combinations not selected in `config/sttypesummarymanual.csv` are not included here. Two columns are created: activitylc and activityhc, corresponding to the binary activity for the row if using the Low cut-off or the High cut-off. If there is a comment from the author (Active, Non Active) this determines the activity both in LC and HC. This file will be processed to create the following:
Outputs: - Two file containing all the molecules and its binary classification (regardless of assay or target (whole cell organism or protein)), both using a high confidence threshold (pathogenallhc) and low confidence threshold (pathogenalllc) for the binarization of activities. At this stage, if molecules are duplicated, the values will be averaged and if they are > 0.5, the molecule will be considered active (1), else inactive (0) - Two files containing all the molecules and its binary classification for whole cell assays, both using a high confidence threshold (pathogenorgallhc) and low confidence threshold (pathogenorgalllc) for the binarization of activities - Two files containing all the molecules and its binary classification for protein assays, both using a high confidence threshold (pathogenprotallhc) and low confidence threshold (pathogenprotalllc) for the binarization of activities - Files containing the top assays as determined by the thresholds in default.py (for example, an IC50 assay with over 250 molecules on it). Those are identified by pathogenorghctop{}, pathogenorglctop{}, pathogenprothctop{}, pathogenprothctop{}. The assay id and target protein can be found in the summary file. - Files containing all results for selected assays (specified in STTYPEs in default.py). Currently those include MIC, IC50, IZ, Activity, Inhibition. They all relate to whole cell assays. The files produced are named pathogensttypehc.csv and pathogensttype_lc.csv
A pathogen_summary.csv file is created containing a summary of the processing for each pathogen, and by running the following code snippet a full summary for all pathogens is created:
cd src
python summary.py
Parameters
The following parameters are specified in default.pyand can be modified according to the user needs:
MINSIZEASSAYTASK = 1000 #Top assays with at least this data size will get a specific task
MINSIZEPROTEINTASK = 250 #Top proteins with at least this data size will get a specific task
MINCOUNTPOSITIVECASES = 30 #Minimum number of positive hits per assay
TOPASSAYS = 3 #Max number of selected organism assays
TOPPROTEINS = 3 #Max number of selected protein assays
DATASETSIZELIMIT = 1e6 # Limit the largest dataset
STTYPES = ["MIC", 'IZ', "IC50", "Inhibition", "Activity"] #Organism Bioassays that are merged together
SPLIT_METHOD = 'random' #split mode for ZairaChem (see below)
Paths to several files, including the /data folder, can be specified as well
Train models with ZairaChem
Optionally, we offer the possibility of preparing the data directly for model training using ZairaChem. The datasets will automatically be split into train/test (80/20) split to perform model validation. To learn more about ZairaChem and install it, please see its own repository.
Please run:
conda activate zairachem
cd src
python split_datasets.py <pathogen>
This will create the following files and folders in the /data/<pathogen> folder:
- input: will contain, once the split is performed:
- input.csv: full input data
- train.csv: input data for training
- test.csv: input data for test
- input_rejected.csv: cases that ZairaChem has rejected (typically because the molecule's SMILES is not valid)
- model: Contains the model definition, in the format used by ZairaChem
- test: Predictions for the test data and assessment reports of the model
- log: The log files resulting from the split, test and predict runs of ZairaChem
Once you are ready to train the models, you can run directly the files "split_<pathogen>.sh and fit_predict_<pathogen>.sh (be aware of the memory and time necessary to build ZairaChem models) or copy line by line the commands from the .sh files to fit and predict one model at a time:
conda activate zairachem
cd src
bash split_<pathogen>.sh
bash fit_predict_<pathogen>.sh
Owner
- Name: Ersilia Open Source Initiative
- Login: ersilia-os
- Kind: organization
- Email: hello@ersilia.io
- Location: United Kingdom
- Website: ersilia.io
- Twitter: ersiliaio
- Repositories: 64
- Profile: https://github.com/ersilia-os
Ersilia is a charity developing open source tools to facilitate global health drug discovery, with a focus on neglected diseases, for equal healthcare
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: ChEMBL Binary Tasks
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Marcos
family-names: de la Torre
email: marcostorrework@gmail.com
affiliation: Ersilia Open Source Initiative
- given-names: Gemma
family-names: Turon
email: gemma@ersilia.io
affiliation: Ersilia Open Source Initiative
orcid: 'https://orcid.org/0000-0001-6798-0275'
- given-names: Miquel
family-names: Duran-Frigola
email: miquel@ersilia.io
affiliation: Ersilia Open Source Initiative
orcid: 'https://orcid.org/0000-0002-9906-6936'
repository-code: 'https://github.com/ersilia-os/chembl-binary-tasks/'
license: GPL-3.0+
GitHub Events
Total
- Issues event: 1
- Watch event: 1
- Member event: 1
- Push event: 11
- Fork event: 1
Last Year
- Issues event: 1
- Watch event: 1
- Member event: 1
- Push event: 11
- Fork event: 1
Dependencies
- pandas ==2.2.1
- psycopg2-binary ==2.9.9
- rdkit ==2023.9.5