https://github.com/bzhanglab/autort

AutoRT: Peptide retention time prediction using deep learning

Keywords

deep-learning lc-msms peptide-retention-time proteomics

Last synced: 5 months ago · JSON representation

Repository

AutoRT: Peptide retention time prediction using deep learning

Basic Info

Host: GitHub
Owner: bzhanglab
License: gpl-3.0
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 1.3 GB

Statistics

Stars: 17
Watchers: 5
Forks: 6
Open Issues: 2
Releases: 1

Topics

deep-learning lc-msms peptide-retention-time proteomics

Created over 6 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

AutoRT

AutoRT is a peptide retention time prediction tool using deep learning. It supports peptide retention prediction for tryptic peptides (global proteome experiments), MHC bound peptides (immunopeptidomics experiment) and PTM peptides (such as phosphoproteomics experiment, ubiquitome or acetylome experiment).

Performance of AutoRT

The performance of models trained using more than 100,000 peptides: Figure 1a; Supplementary Figure 1.
The performance of models trained using 700~10,000 peptides with transfer learning strategy: Figure 1b; Supplementary Figure 3; Supplementary Figure 8-10;
The performance of models trained using label free, iTRAQ or TMT data: Figure 1a; Figure 1b; Supplementary Figure 3; Supplementary Figure 8-10.
The performance of models trained using immunopeptidomics data: Supplementary Figure 6;
The performance of models trained using public iRT data: Figure 1a; Supplementary Figure 1.

Installation

Download AutoRT

shell $ git clone https://github.com/bzhanglab/AutoRT

Installation

AutoRT is a python3 package. TensorFlow (>=2.6) is supported. Its dependencies can be installed via shell $ pip install -r requirements.txt

AutoRT has been tested on both Linux and Windows systems. It supports training and prediction on both CPU and GPU, but GPU is recommended for model training. Multiple GPUs are also supported.

AutoRT can also be used through docker:

shell docker pull proteomics/autort

Usage

Training and prediction

AutoRT supports training models from scratch as well as transfer learning. We recommend to train models using GPU.

Training:

$ python autort.py train -h optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Input data for training -o OUT_DIR, --out_dir OUT_DIR Output directory -e EPOCHS, --epochs EPOCHS The number of epochs, default is 40. -b BATCH_SIZE, --batch_size BATCH_SIZE Batch size for training, default is 64. -r2 MAX_RT, --max_rt MAX_RT The maximum retention time. If the value is 0 (default), the maximum retention time will be automatically infered from the input training data. -l MAX_LENGTH, --max_length MAX_LENGTH The length of the longest peptide to consider for modeling. If the value is 0 (default), it will be automatically infered from the input model file. -u UNIT, --unit UNIT The unit of retention time in training data, s: second, m: minute (default). -sm SCALE_METHOD, --scale_method SCALE_METHOD Scaling method for RT tranformation: min_max (default), mean_std and single_factor. This is used in training. Default is 'min_max'. The default method works well in most of cases. This should not be changed unless users know well about the meaning of these methods. -sf SCALE_FACTOR, --scale_factor SCALE_FACTOR This is only useful when 'single_factor' is set for '-sm'. -m MODEL_FILE, --model_file MODEL_FILE Trained model file. Only useful when perform transfer learning and RT prediction. -r, --add_reverse Add reversed peptide in peptide encoding. This parameter will be removed in a future version. -n EARLY_STOP_PATIENCE, --early_stop_patience EARLY_STOP_PATIENCE Number of epochs with no improvement after which training will be stopped. -rlr, --add_ReduceLROnPlateau Reduce learning rate when a metric has stopped improving. -g GPU, --gpu GPU Set gpu IDs that can be used. For example: 1,2,3. -d, --do_evaluation_after_each_epoch Do evaluation after each epoch during model training. -f OUTLIER_RATIO, --outlier_ratio OUTLIER_RATIO The percentage of data (outlier) will be removed from the training data using a two-step training.

Prediction:

```shell $ python autort.py predict -h usage: autort.py [-h] [-t TEST] [-o OUT_DIR] [-p PREFIX] [-s ENSEMBLE] [-a]

AutoRT

optional arguments: -h, --help show this help message and exit -t TEST, --test TEST Input data for testing -o OUTDIR, --outdir OUT_DIR Output directory -p PREFIX, --prefix PREFIX -s ENSEMBLE, --ensemble ENSEMBLE -a, --radam ```

An example training data is available in the example/data folder as shown below. The training data should have at least 2 columns x and y in a tab-delimited text format. The x column contains peptide sequences and y column contains retention time. The unit of retention time can be minute or second. If the unit of retention time is minute, users should set "-u m" and if the unit of retention time is second, users should set "-u s".

shell $ head example/data/PXD006109_Cerebellum_rt_add_mox_all_rt_range_3_train.tsv x y AAAAAAAAAAAAAAAGAAGK 58.056936170212765 AAAAAAAAAK 4.394 AAAAAAAAATEQQGSNGPVK 26.975 AAAAAAAAQ1HTK 8.2107 AAAAAAAAQMHTK 15.553 AAAAAASLLANGHDLAAAMAVDK 73.801 AAAAADLANR 18.85030769230769 AAAAAGAGLK 13.3316 AAAAALSGSPPQTEK 27.125375000000002

For a peptide with modification(s) (modified peptide), it must be formatted as described below:

Each of the modified amino acids must be represented using a different integer number (0-9) in the peptide sequence and keep consistent in both training and testing (or prediction) data. For example, if there are four modifications in a dataset: oxidation on M, phosphorylation on S,T and Y, then we can use 1 represents oxidation of M, 2 represents phosphorylation of S, 3 represents phosphorylation of T and 4 represents phosphorylation of Y.

Then a peptide like this: AS(phosphorylation)FDFADFST(phosphorylation)PY(phosphorylation)STM(oxidation)AAK must be formatted as A2FDFADFS3P4ST1AAK

An example testing data is available in the data folder as shown below. The first column "x" is required. The second column "y" is optional. $ head example/data/PXD006109_Cerebellum_rt_add_mox_all_rt_range_3_test.tsv x y DSQFLAPDVTSTQVNTVVSGALDR 84.633 LLDLYASGER 51.07025 EMPQNVAK 16.787 GSPTGSSPNNASELSLASLTEK 62.60036363636364 LGLDEYLDK 57.118833333333335 FNGAQVNPK 20.01075 AVTISGTPDAIIQCVK 66.29124999999999 SSPVEYEFFWGPR 81.73833333333333 AIPSYSHLR 29.95825

An example to show how to train a highly accurate RT model using a small dataset from a global proteome experiment:

Please note that the length of the longest peptide in the training data must be <= 60 which is the maximum length supported by the based models in the github folder models/generalbasemodel/.

``` $ cd example $ cat transfer_learning.sh

training

python ../autort.py train -i data/28CPTACCOprospectiveWVU2015081005CO037f01normaltrain.tsv -o tfmodel/ -e 40 -b 64 -u m -m ../models/generalbase_model/model.json -rlr -n 10

prediction

python ../autort.py predict -t data/28CPTACCOprospectiveWVU2015081005CO037f01normaltest.tsv -s tfmodel/model.json -o tfprediction/ -p test ```

The training took less than 15 minutes using one Titan Xp GPU on a Linux server.

The parameter setting for -e, -b, -sm, -rlr and -n for the above examples worked well for many data.

An example to show how to train models from scratch:

```shell cd example

training

python ../autort.py train -e 100 -b 64 -m ../models/basemodel/model.json -u m -i data/PXD006109Cerebellumrtaddmoxallrtrange3train.tsv -sm minmax -rlr -n 20 -o PXD006109models/ ``After the training is finished, the trained model files are saved in the folderPXD006109_models/`. If users want to train a model to support peptides longer than 60 (e.g., 100 amino acids), set the parameter -l to a number longer than 60, for example, -l 100.

Then we can use the trained models for prediction as shown below: ```

prediction:

python ../autort.py predict -t data/PXD006109Cerebellumrtaddmoxallrtrange3test.tsv -s PXD006109models/model.json -o PXD006109prediction/ -p test ``The prediction result will be saved in filePXD006109prediction/test.csvand looks like below. The columnypredcontains the predicted retention time.`` $ head PXD006109prediction/test.csv x y y_pred DSQFLAPDVTSTQVNTVVSGALDR 84.633 87.11121 LLDLYASGER 51.07025 51.25605 EMPQNVAK 16.787 17.024113 GSPTGSSPNNASELSLASLTEK 62.60036363636364 61.83924 LGLDEYLDK 57.11883333333334 57.66608 FNGAQVNPK 20.01075 19.809322 AVTISGTPDAIIQCVK 66.29124999999999 68.5102 SSPVEYEFFWGPR 81.73833333333333 82.20954 AIPSYSHLR 29.95825 31.344095 ```

The training took less than 12 hours using one Titan Xp GPU on a Linux server.

Run AutoRT on Google Colab:

Example: example/ExperimentspecificRTpredictionusing_AutoRT.ipynb

Example: example/ExperimentspecificRTpredictionusingMaxQuantEcoli.ipynb

Example: example/PhosphorylationexperimentspecificRTpredictionusingAutoRT_Colab.ipynb

Example: example/RTpredictionusing_AutoRT.ipynb

How to cite:

Wen, B., Li, K., Zhang, Y. et al. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nature Communications 11, 1759 (2020). https://doi.org/10.1038/s41467-020-15456-w

Applications:

A list of applications of AutoRT could be found at: AutoRT applications.

Owner

Name: Zhang Lab
Login: bzhanglab
Kind: organization
Location: Houston, TX

Website: https://www.zhang-lab.org
Repositories: 15
Profile: https://github.com/bzhanglab

Translating omics data into biological insights.

GitHub Events

Total

Watch event: 1
Issue comment event: 1

Last Year

Watch event: 1
Issue comment event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bzhanglab/autort

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

AutoRT

Table of contents:

Performance of AutoRT

Installation

Download AutoRT

Installation

Usage

Training and prediction

Training:

Prediction:

An example to show how to train a highly accurate RT model using a small dataset from a global proteome experiment:

training

prediction

An example to show how to train models from scratch:

training

prediction:

Run AutoRT on Google Colab:

How to cite:

Applications:

Owner

GitHub Events

Total

Last Year