rxnaamapper

Reaction SMILES-AA mapping via language modelling

https://github.com/rxn4chemistry/rxnaamapper

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Reaction SMILES-AA mapping via language modelling

Basic Info

Host: GitHub
Owner: rxn4chemistry
License: mit
Language: Python
Default Branch: main
Size: 1.16 MB

Statistics

Stars: 29
Watchers: 5
Forks: 3
Open Issues: 0
Releases: 0

Created over 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

RXNAAMapper

RXNAAMapper is a tool designed to identify binding sites in protein sequences by leveraging language models trained on biochemical reactions. The tool can capture the signal characterizing amino acid (AA) binding sites using linguistic representations for proteins and their molecular substrates, performing unsupervised binding site prediction from protein sequences and reaction SMILES.

setup

To set up the environment, use the following commands: console conda env create -f conda.yml conda activate rxn_aa_mapper

In the following we consider the examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

Create a vocabulary compatible with the enzymatic reaction tokenizer:

console create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

The example below shows how to use the LMEnzymaticReactionTokenizer with the vocabulary previously created and the tokenizer:

```python from rxnaamapper.tokenization import LMEnzymaticReactionTokenizer

tokenizer = LMEnzymaticReactionTokenizer( vocabularyfile="./examples/vocabularytoken75Kmin600max750500K.txt", aasequencetokenizerfilepath="./examples/token75Kmin600max750500K.json", aasequencetokenizertype="generic" ) tokenizer.tokenize("NC(=O)c1cccn+c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]") ```

train the model

We use WandB for logging, if you don't have a mode configured you can simply disable it by setting:

console export WANDB_MODE=offline

The mlm-trainer script can be used to train a model via MTL:

console mlm-trainer \ ./examples/data-samples/biochemical \ # just a sample train folder ./examples/data-samples/biochemical \ # just a sample validation folder ./examples/vocabulary_token_75K_min_600_max_750_500K.txt \ /tmp/mlm-trainer-log \ ./examples/sample-config.json \ # for a more realistic config see ./examples/config.json "*.csv" \ 1 \ ./examples/data-samples/organic \ # just a sample train folder ./examples/data-samples/organic \ # just a sample validation folder ./examples/token_75K_min_600_max_750_500K.json \ "generic"

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

These checkpoints can be converted into a HuggingFace model with:

console checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

Once trained, the RXNAAMapper model can predict reactant atoms and map them to AA sequence locations, indicating potential binding sites:

```python from rxnaamapper.aa_mapper import RXNAAMapper

configmapper = { "vocabularyfile": "./examples/vocabularytoken75Kmin600max750500K.txt", "aasequencetokenizerfilepath": "./examples/token75Kmin600max750500K.json", "aasequencetokenizertype": "generic", "modelpath": "/tmp/rxnaamapper-pretrained-model", "head": 3, "layers": [11], "topk": 1, } mapper = RXNAAMapper(config=configmapper) mapper.getreactantaasequenceattentionguidedmaps(["NC(=O)c1cccn+c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"]) ``` NOTE: The model path should contain both the model binary file and the config.json. These files are generated from the model trained and converted to a HuggingFace model using the script provided in the previous section.

citation

bib @article{teukam2024language, title={Language models can identify enzymatic binding sites in protein sequences}, author={Teukam, Yves Gaetan Nana and Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Laino, Teodoro}, journal={Computational and Structural Biotechnology Journal}, volume={23}, pages={1929--1937}, year={2024}, publisher={Elsevier} }

Owner

Name: rxn4chemistry
Login: rxn4chemistry
Kind: organization

Repositories: 14
Profile: https://github.com/rxn4chemistry

Citation (CITATIONS.bib)

@article{teukam2024language,
  title={Language models can identify enzymatic binding sites in protein sequences},
  author={Teukam, Yves Gaetan Nana and Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Laino, Teodoro},
  journal={Computational and Structural Biotechnology Journal},
  volume={23},
  pages={1929--1937},
  year={2024},
  publisher={Elsevier}
}

GitHub Events

Total

Last Year

Dependencies

dev_requirements.txt pypi

black ==20.8b1 development
flake8 ==3.8.4 development
isort ==5.10.1 development
mypy ==0.800 development

requirements.txt pypi

biopython ==1.77
click ==8.0.1
loguru ==0.5.3
numpy >=1.19.1
pandas >=0.2.4
pytorch-lightning >=1.3
rdkit-pypi ==2021.9.2.1
scipy >=1.4.1
statsmodels >=0.12.2
torch >=1.0
transformers >=4.5.1
wandb >=0.10.30
xmltodict >=0.12.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

rxnaamapper

Science Score: 18.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

RXNAAMapper

setup

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

use the tokenizer

train the model

predict active site

citation

Owner

Citation (CITATIONS.bib)

GitHub Events

Total

Last Year

Dependencies

rxnaamapper

Science Score: 18.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

RXNAAMapper

setup

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

use the tokenizer

train the model

predict active site

citation

Owner

Citation (CITATIONS.bib)

GitHub Events

Total

Last Year

Dependencies

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`