https://github.com/dptech-corp/uni-pka

The official repository of Uni-pKa

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: acs.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary

Last synced: 4 months ago · JSON representation

Repository

The official repository of Uni-pKa

Basic Info

Host: GitHub
Owner: dptech-corp
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 3.5 MB

Statistics

Stars: 48
Watchers: 2
Forks: 8
Open Issues: 5
Releases: 0

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme License

Uni-pK_a

The official implementation of the model Uni-pK_a in the paper Bridging Machine Learning and Thermodynamics for Accurate pK_a Prediction.

Interactive demo with available model weights at https://bohrium.dp.tech/notebooks/38543442597

Published paper at [JACS Au] | Relevant preprint at [ChemRxiv] | Small molecule protonation state ranking demo at [Bohrium App] | Full datasets at [AISSquare]

This machine-learning-based pK_a prediction model achieves the state-of-the-art accuracy on several drug-like small molecule macro-pK_a datasets. Uni-p*K*<sub>a</sub>'s performance

Two core components of Uni-pK_a framework are

A microstate enumerator to systematically build the protonation ensemble from a single structure.
A molecular machine learning model to predict the free energy of each single structure.

The model reaches the expected accuracy in the inference stage after the comprehensive data preparation by the enumerator, pretraining on the ChemBL dataset and finetuning on our Dwar-iBond dataset.

Alt text

Microstate Enumerator

Introduction

It uses iterated template-matching algorithm to enumerate all the microstates in adjacent macrostates of a molecule's protonation ensemble from at least one microstate stored as SMILES.

The protonation template smarts_pattern.tsv modifies and augments the one in the paper MolGpka: A Web Server for Small Molecule pKa Prediction Using a Graph-Convolutional Neural Network and its open source implementation (MIT license) in the Github repository MolGpKa.

Usage

The recommended environment is yaml python = 3.8.13 rdkit = 2021.09.5 numpy = 1.20.3 pandas = 1.5.2

Reconstruct a plain pK_a dataset to the Uni-pK_a standard macro-pK_a format with fully enumerated microstates

shell cd enumerator python main.py reconstruct -i <input> -o <output> -m <mode>

The <input> dataset is assumed be a csv-like file with a column storing SMILES. There are two cases allowed for each entry in the dataset.

It contains only one SMILES. The Enumerator helps to build the protonated/deprotonated macrostate and complete the original macrostate.
- When <mode> is "A", it will be considered as an acid (thrown into A pool).
- When <mode> is "B", it will be considered as a base (thrown into B pool).
It contains a string like "A1,...,Am>>B1,...Bn", where A1,...,Am are comma-separated SMILES of microstates in the acid macrostate (all thrown into A pool), and B1,...,Bn are comma-separated SMILES of microstates in the base macrostate(all thrown into B pool). The Enumerator helps to complete the both.

A/B mode of the microstate enumerator

The <mode> "A" (default) or "B" determines which pool (A/B) is the reference structures and the starting point of the enumeration.

The <output> dataset is then constructed after the enumeration.

Build protonation ensembles from single molecules

Example: shell cd enumerator python main.py ensemble -i ../dataset/sampl6.tsv -o example_out.tsv -u 2 -l -2 -t simple_smarts_pattern.tsv

The input dataset is SAMPL6 dataset as example. Reconstructed pK_a dataset, or just any molecular dataset with an "SMILES" column with single molecular SMILES is supported as the input. In the output file, like example_out.tsv, columns include the original SMILES, and macrostates of total charge between the upper bound set by -u (default +2) and the lower bound set by -l (default -2). A simpler template is prepared as simple_smarts_pattern.tsv here for cleaner protonation ensembles which discard some rare structure motifs in the aqueous solution.

Machine Learning Model

Introduction

It is a Uni-Mol-based neural network. By embedding the neural network into thermodynamic relationship between the free energy and pK_a throughout the training and inference stages, the framework preserves physical consistency and adapts to multiple tasks.

Alt text

Usage

Dependencies

The dependencies of Uni-pK_a are the same as those of Uni-Mol.

Uni-Core, check its Installation Documentation.
rdkit==2022.9.3, install via pip install rdkit-pypi==2022.9.3

The recommended environment is the docker image.

docker pull dptechnology/unimol:latest-pytorch1.11.0-cuda11.3

See details in Uni-Mol repository.

Ready-to-run training workflow

Data

The raw data can be downloaded from [AISSquare].

Pretrain with ChemBL

First, preprocess the ChemBL training and validation sets, and then pretrain the model:

```bash

Preprocess training set

python ./scripts/preprocesspka.py --raw-csv-file Datasets/tsv/chembltrain.tsv --processed-lmdb-dir chembl --task-name train

Preprocess validation set

python ./scripts/preprocesspka.py --raw-csv-file Datasets/tsv/chemblvalid.tsv --processed-lmdb-dir chembl --task-name valid

Copy the necessary dict file

cp -r unimol/examples/* chembl

Pretrain the model

bash pretrain_pka.sh ```

Note: The head_name in the subsequent scripts must match the task_name in pretrain_pka.sh.

Finetune with dwar-iBond

Next, preprocess the dwar-iBond dataset and finetune the model:

```bash

Preprocess

python ./scripts/preprocess_pka.py --raw-csv-file Datasets/tsv/dwar-iBond.tsv --processed-lmdb-dir dwar --task-name dwar-iBond

Copy the necessary dict file

cp -r unimol/examples/* dwar

Finetune the model

bash finetune_pka.sh ```

Infer pK_a

Infer with the finetuned model, taking novartis_acid as an example:

```bash

Preprocess

python ./scripts/preprocesspka.py --raw-csv-file Datasets/tsv/novartisacid.tsv --processed-lmdb-dir novartisacid --task-name novartisacid

Copy the necessary examples from unimol

cp -r unimol/examples/* novartis_acid

Run inference

bash inferpka.sh ``To test with other external test datasets, it may be necessary to modifydatapath,infertask, andresultspathininfer_pka.sh`.

Obtain the result files and calculate the metrics

After inference, extract the results to CSV files and calculate the performance metrics (e.g., MAE, RMSE) on the results:

bash python ./scripts/infer_mean_ensemble.py --task pka --nfolds 5 --results-path novartis_acid_results

The metrics are calculated using the average of the 5-fold model predictions.

Owner

Name: DP Technology
Login: dptech-corp
Kind: organization
Location: China

Website: https://www.dp.tech/en
Repositories: 9
Profile: https://github.com/dptech-corp

GitHub Events

Total

Issues event: 2
Watch event: 38
Issue comment event: 11
Push event: 3
Pull request event: 4
Fork event: 10

Last Year

Issues event: 2
Watch event: 38
Issue comment event: 11
Push event: 3
Pull request event: 4
Fork event: 10

https://github.com/dptech-corp/uni-pka

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Uni-pKa

Microstate Enumerator

Introduction

Usage

Reconstruct a plain pKa dataset to the Uni-pKa standard macro-pKa format with fully enumerated microstates

Build protonation ensembles from single molecules

Machine Learning Model

Introduction

Usage

Dependencies

Ready-to-run training workflow

Data

Pretrain with ChemBL

Preprocess training set

Preprocess validation set

Copy the necessary dict file

Pretrain the model

Finetune with dwar-iBond

Preprocess

Copy the necessary dict file

Finetune the model

Infer pKa

Preprocess

Copy the necessary examples from unimol

Run inference

Obtain the result files and calculate the metrics

Owner

GitHub Events

Total

Last Year

Uni-pK_a

Reconstruct a plain pK_a dataset to the Uni-pK_a standard macro-pK_a format with fully enumerated microstates

Infer pK_a