https://github.com/bioinfomachinelearning/transpro

1D transformer for predicting protein structural features (secondary structure, solvent accessibility)

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

1D transformer for predicting protein structural features (secondary structure, solvent accessibility)

Basic Info

Host: GitHub
Owner: BioinfoMachineLearning
Language: Python
Default Branch: main
Size: 33.4 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 4 years ago · Last pushed almost 4 years ago

https://github.com/BioinfoMachineLearning/TransPro/blob/main/



# TransPross: 1D transformer for predicting protein secondary structure prediction

![TransPross Architecture](https://github.com/BioinfoMachineLearning/TransPro/blob/main/img/TransPross_Architecture.png)



## Description
1D transformer for predicting protein structural features (secondary structure)


## Installation
```bash
git clone https://github.com/BioinfoMachineLearning/TransPro.git
cd TransPro
mkdir env
python3.6 -m venv env/ss_virenv
source env/ss_virenv/bin/activate
pip install --upgrade pip
pip install -r requirments.txt
```

## Training data
The training protein targets were extracted from the Protein Data Bank(PDB) before May 2019 with the the sequence identity < 90%. The sequence length range: [50, 500]

All the required data for training are provided as below and avaiable at [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6762376.svg)](https://doi.org/10.5281/zenodo.6762376):
* Protein sequences in fasta file (fasta.tar.gz)
* Target id list for training
* MSA in a3m file (a3m.tar.gz is too large, stored at /bml/TransPro/a3m.tar.gz)
* True ss labels in 3 states (ss_3.tar.gz)
* True 3D structures in pdb file (atom.tar.gz)
* 5 trained TransPross models (model.tar.gz)

## Testing data
All the testing data for evaluation are provided as below:
* CASP test sets(CASP13, CASP14)

## Training
```bash
python MSA_transformer2_train.py --model_num 1 --N 6 --max_positions 1500  --BATCH_SIZE 5 --data_dir  --dataset 

model_num: training list model
N: number of attention layers
max_positions: maximum number of sequences allowed in the input MSA
BATCH_SIZE: batch size
data_dir: folder path for storing data
dataset: training set name
```
## Inference
**Predicting with the single a3m file as the input:**
```bash
python MSA_transformer2_predict_batch.py -i 
e.g. python MSA_transformer2_predict_batch.py -i T1026.a3m
```

**Predicting multiple targets in one time:**
```bash
python MSA_transformer2_predict_batch.py --data_dir  --dataset 

If you want to predict multiple targets, you can create a test.lst file under the path /data_dir/dataset/test.lst in the format:  length
e.g test/casp13/test.lst

data_dir: folder path for storing data
dataset: testing set name
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bioinfomachinelearning/transpro

Science Score: 23.0%

Repository

Basic Info

Statistics

https://github.com/BioinfoMachineLearning/TransPro/blob/main/

Owner

GitHub Events

Total

Last Year