chembl-antimicrobial-tasks
Get antimicrobial tasks from ChEMBL framed as binary classifications
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Repository
Get antimicrobial tasks from ChEMBL framed as binary classifications
Basic Info
- Host: GitHub
- Owner: ersilia-os
- License: gpl-3.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 12.7 MB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
🦠 Antimicrobial binary ML tasks from ChEMBL 💊
Get antimicrobial tasks from ChEMBL framed as binary classifications. This repository is the updated version of chembl-binary-tasks.
This repository is currently WORK IN PROGRESS. ⚠️🚧
Setup 🛠️
To get started, first clone this repository, avoiding large LFS-stored files:
sh
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/ersilia-os/chembl-antimicrobial-tasks.git
cd chembl-antimicrobial-tasks
We recommend creating a Conda environment to run this code. Dependencies are minimal. 🐍
sh
conda create -n camt python=3.10
conda activate camt
pip install -r requirements.txt
Installing ChEMBL 🗃️
Access to a postgreSQL database server containing the ChEMBL database is required. You may install ChEMBL in your own computer by following these instructions. To check if the postgreSQL service with the ChEMBL database is up and accessible, you can run the following code with your username, password and database name:
sh
sudo service postgresql start
PGPASSWORD=YOUR_PASSWORD psql -h localhost -p 5432 -U YOUR_USERNAME -d YOUR_DB_NAME -c "\dt"
✅ If a List of Relations is displayed, checks have been successfull! ⚠️ Make sure to adapt the variables CHEMBLUSR, CHEMBLPWD and DATABASENAME in `src/defaultparameters.py` with your username, password and database name, respectively.
Downloading configuration data ⚙️
Several configuration data files are needed before gathering and binarizing ChEMBL data, all of them documented here. You can pull such data using Git LFS:
bash
git lfs pull --include="data"
Alternatively, we provide the code to generate these data. To do it, simply execute:
sh
bash scripts/00_prepare_config.sh
This bash script consecutively executes 4 Python files extensively described in our documentation.
Specifying parameters 🧾
We set many parameters to process and binarize ChEMBL bioactivity data, all of which are defined in src/default_parameters.py.
The following scripts assume that PostgreSQL is running locally, with the username, password, and database name configured in the same file. Parameters for binarization are also specified herein.
Creating datasets 🔍
The primary goal of this repository is to automatically get microbial tasks from ChEMBL framed as a binary classification. To do it, for each pathogen of interest, execute:
bash
bash scripts/01_fetch_pathogen_data_from_chembl.sh --pathogen_code YOUR_PATHOGEN_CODE --output_dir YOUR_OUTPUT_DIR
Note that available pathogen codes are listed in data/pathogens.csv, which can be edited manually. The bash script consecutevely executes 6 Python scripts briefly described as follows:
011_pathogen_getter.py: Retrieves pathogen-related bioactivity data from the ChEMBL database, processes and filters the data, and saves it into structured CSV files for further analysis.012_clean_fetched_pathogen_data.py: Reads raw data, applies unit conversions, standardizes activity values, filters relevant information, computes pChEMBL values, and outputs a cleaned dataset in CSV format for further analysis.013a_binarize_fetched_pathogen_data_ORG.py: Processes phenotypic-based pathogen assay data and organizes it into datasets that are binarized using different criteria for machine learning models (e.g. pChEMBL, %inhibition, etc). Datasets may correspond to specific assays or targets (i.e. the organism itself), global pChEMBL values, % of activity or comprehensive percentiles (sorted by priority). Datasets are created with six different strategies:
1. Compounds grouped by assays: fixed assay ID. If the assay has multiple activity types, it's split into several datasets.
2. Compounds grouped by targets: fixed target ID and activity type and units. Assays may differ.
3. Compounds grouped by pChEMBL: assumes the target ID is fixed (i.e. the organism) and integrates all pChEMBL data.
4. Compounds grouped by percentage: assumes the target ID is fixed (i.e. the organism) and integrates all percentage data.
5. Compounds grouped by percentiles: fixed target ID (i.e. the organism) - integrates percentile data taking all units into account.
6. Compounds grouped by activity labels: assumes the target ID is fixed (i.e. the organism) and integrates data using the corresponding activity flag.
Datasets are binarized following 4 different approaches:
1. pChEMBL cut-offs
2. pChEMBL percentiles
3. Percentage cut-offs
4. Percentage percentiles
Datasets not satifying the requirements specified in `src/default_parameters.py` or having a proportion of positives > 0.5 are discarded and not reported.
013b_binarize_fetched_pathogen_data_SP.py: Processes single protein-based pathogen assay data (both "Binding" and "Functional", separately) and organizes it into datasets that are binarized using different criteria for machine learning models (e.g. pChEMBL, %inhibition, etc). Datasets may correspond to specific assays or targets (e.g. a given protein), global pChEMBL values against a speficic protein, % of activity or comprehensive percentiles (sorted by priority). For further information on dataset creation and binarization please see the previous point. IMPORTANT: in this step strategies (1) and (2) are analogous to013a_binarize_fetched_pathogen_data_ORG.py. However, strategies (3), (4), (5) and (6) have been adapted to report results in a target-centric manner (i.e. targets are no longer full organisms but single proteins).014_datasets_modelability.py: Computes molecular fingerprints, trains a Random Forest classifier using stratified cross-validation, and evaluates dataset modelability by calculating AUROC scores for each task (i.e. discriminate active compounds from inactives). Additionally, store a Random Forest classifier for each task, trained with all task data.015_datasets_distinguishability.py: Analogous to dataset modelability, but negative compounds are randomly sampled from ChEMBL. Additionally, store a Random Forest classifier for each task, trained with all task data.
Output 📊
Many files will be generated when creating the ChEMBL tasks/datasets. Overall, the most important files are:
011_{YOUR_PATHOGEN_CODE}_original.csv: Compounds extracted from ChEMBL and associated to the pathogen of interest. Includes compound information, bioactivity data, assay details, and related metadata. Each row corresponds to a given bioactivity measurement.011_{YOUR_PATHOGEN_CODE}_cleaned.csv: A cleaned and processed version of the original dataset. Includes pChEMBL values, %Inhibition, etc.013a_raw_tasks_ORG_summary.csv: Raw list of phenotypic-based tasks (datasets) created for the pathogen of interest.013a_raw_tasks_ORG directory: For each phenotypic-based task (dataset), list of compounds and associated binarized bioactivities.013b_raw_tasks_SP_summary_B.csv: Raw list of target-based (binding) tasks (datasets) created for the pathogen of interest.013b_raw_tasks_SP_summary_F.csv: Raw list of target-based (functional) tasks (datasets) created for the pathogen of interest.013a_raw_tasks_SP directory: For each target-based (both binding and functional) task (dataset), list of compounds and associated binarized bioactivities.014_modelability.csv: Modelability for each task. Includes AUROC scores to evaluate how well a binary classification model discriminates actives from inactives. Higher AUROCs indicate higher modelability.014_models_MOD.csv: For each task, performance of a binary classification model trained and tested on the full task data.014_models_MOD directory: For each task, joblib file including the binary classification model mentioned in the immediately preceding file.015_distinguishability.csv: Distinguishability for each task. Includes AUROC scores to evaluate how well a binary classification model using randomly sampled ChEMBL compounds as inactives discriminates actives from inactives. Higher AUROCs indicate higher distinguishability.015_models_DIS.csv: For each task, performance of a binary classification model trained and tested on the full task data (negatives are randomly sampled from ChEMBL compounds).015_models_DIS directory: For each task, joblib file including the binary classification model mentioned in the immediately preceding file.
TL;DR 🚩
Bla bla
About the Ersilia Open Source Initiative 🌍🤝
This repository is developed by the Ersilia Open Source Initiative. Ersilia develops AI/ML tools to support drug discovery research in the Global South. To learn more about us, please visit our GitBook Documentation and our GitHub profile.
Owner
- Name: Ersilia Open Source Initiative
- Login: ersilia-os
- Kind: organization
- Email: hello@ersilia.io
- Location: United Kingdom
- Website: ersilia.io
- Twitter: ersiliaio
- Repositories: 64
- Profile: https://github.com/ersilia-os
Ersilia is a charity developing open source tools to facilitate global health drug discovery, with a focus on neglected diseases, for equal healthcare
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: ChEMBL Binary Tasks
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Marcos
family-names: de la Torre
email: marcostorrework@gmail.com
affiliation: Ersilia Open Source Initiative
- given-names: Gemma
family-names: Turon
email: gemma@ersilia.io
affiliation: Ersilia Open Source Initiative
orcid: 'https://orcid.org/0000-0001-6798-0275'
- given-names: Miquel
family-names: Duran-Frigola
email: miquel@ersilia.io
affiliation: Ersilia Open Source Initiative
orcid: 'https://orcid.org/0000-0002-9906-6936'
repository-code: 'https://github.com/ersilia-os/chembl-binary-tasks/'
license: GPL-3.0+
GitHub Events
Total
- Issues event: 5
- Watch event: 2
- Issue comment event: 6
- Member event: 1
- Push event: 76
- Create event: 2
Last Year
- Issues event: 5
- Watch event: 2
- Issue comment event: 6
- Member event: 1
- Push event: 76
- Create event: 2
Dependencies
- oidsha256 *
- size123 *