https://github.com/bioinfomachinelearning/cdpred

Deep transformer for predicting interchain residue-residue distances of protein complexes

Last synced: 10 months ago · JSON representation

Repository

Deep transformer for predicting interchain residue-residue distances of protein complexes

Basic Info

Host: GitHub
Owner: BioinfoMachineLearning
License: mit
Language: Python
Default Branch: main
Size: 666 MB

Statistics

Stars: 12
Watchers: 4
Forks: 3
Open Issues: 0
Releases: 3

Created over 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

CDPred

Function of CDPred

CDPred is a deep transformer tool for predicting interchain residue-residue distances of protein dimers or any two interacting chains in protein complexes. For a homodimer consiting of two identical chains, it takes the tertiary structure and multiple sequence alignment (MSA) of a chain as input to predict the residue-residue distances between the two chains. For a heterodimer consisting of two different chains, it takes the tertiary structures and MSA of the two chains as input to predict residue-residue distances between the two chains. The tertiary structure input for a chain can be generated by any tertiary structure prediction tool such as AlphaFold. The MSA can also be prepared by a third-party tool. If necessary, users can also use a custom script in this package to generate a MSA as input. The input is converted into numerical features that are used by 2D attention-based transformer networks to predict residue-residue distances between two chains in a dimer.

aria2c -x 10 https://zenodo.org/record/7650566/files/uniref90_01_2020.tar.xz?download=1 xz -d -T 4 uniref90_01_2020.tar.xz tar -xvf uniref90_01_2020.tar Modify the Uniref90 path in script ./lib/constants.py as /DownloadPath/uniref9001_2020/uniref90 The installation and configuration of the virtual environment lasts about 10 minutes (minor difference on different devices).
And the the Uniref90 database download will take about 40 minutes to 70 minutes, dependent on your network speed. If you encounter any errors related to the shared library while generating features, please add the corresponding libraries to your system path.

Running CDPred

Parameter Description of the CDPred prediction script.

Command:

python CDPred_Installation_Path/lib/Model_predict.py -n [name] -p [pdb_file_list] -a [a3m_file] -m [model_option] -o [out_path]

Parameters:

-n The name of the protein complex, can be protein ID or custom name.
-p The predicted monomer tertiary structure file or files with ".pdb" suffix. For homodimer inter-chain distance prediction, one predicted monomer structure file is enough. For heterodimer inter-chain distance prediction, both chains' predicted monomer structure files are required and needed to seperate by one space (Check the detail in Demo section).
-a Multiple sequence alignment (MSA) file in ".a3m" format. You can use your own or any third-party tool to generate MSA file, or you can follow the instruction in ZComplexMSA to install our custom MSA generation tool (Require large disk space and long time for dataset downloading).
-m Model option for different type prediction. Use "homodimer" for homodimer inter-chain distance prediction. Use "heterodimer" for heterodimer inter-chain distance prediction.
-o The custom output folder. It will be automaticly created if not exist.

Demo

Examples to make predictions on prepared input data

Demo1: Run CDPred on a homodimer target.

python lib/Model_predict.py -n T1084A_T1084B -p ./example/T1084A_T1084B.pdb -a ./example/T1084A_T1084B.a3m -m homodimer -o ./output/T1084A_T1084B/ The location of the pre-generated output files is ./example/expectionoutput/T1084AT1084B/, and the location of the output file generated by your run is ./output/T1084A_T1084B/. The whole prediction process will last about 5 minutis

Demo2: Run CDPred on a heterodimer target.

python lib/Model_predict.py -n H1017A_H1017B -p ./example/H1017A.pdb ./example/H1017B.pdb -a ./example/H1017A_H1017B.a3m -m heterodimer -o ./output/H1017A_H1017B/ The location of the pre-generated output files is ./example/expectionoutput/H1017AH1017B/, and the location of the output file generated by your run is ./output/H1017A_H1017B/. The whole prediction process will last about 5 minutis

Output files

The outputs will be saved in directory provided via the-o flag of Model_predict.py . The outputs include multiple sequence alignment files, feature files, and prediction inter-chain distance/contact maps. The --output_dir directory will have the following structure, "name" is provided via the -n flag of Model_predict.py:

<custom_output_name>/ feature/ "name"_pssm.txt "name".npy "name".mat "name".fasta "name".dist "name".aln "name".a3m predmap/ "name"_dist.rr "name"_con.rr "name".htxt "name".dist

The contents of each output file are as follows: * name_pssm.txt Position-specific scoring matrix (PSSM) feature. * name.npy Row attention map generated by ESM and used as one main co-evolutionary feture. * name.mat Co-evolutinary score matrix generate by CCMpred. * name.fasta Fasta sequence file of the input homodimer/heterodimer. * name.dist A combination distance map in shape LxL (L:the length of dimer) of tow monomer's carbon alpha distance map that extract from input prediction monomer structure. * name.aln Multiple sequence alignment in 'aln' . * name.a3m Multiple sequence alignment. * name_dist.rr Residue-Residue distance prediction in format i, j, dist. * i and j indicate specifying pairs of residues. * dist indicates the prediction Euclidean heavy-atom distance between i and j. * name_con.rr Residue-Residue contact prediction in format i, j, 0, 8, prob. * i and j indicate specifying pairs of residues. * 0 and 8 indicate the distance limits defining a contact. Here a pair of residues is defined to be in contact when the minimum distance of heavy atoms is less then 8 Angstroms. * prob indicate the prediction contact probability under above distance limits. * name.htxt Prediction inter-chain contact map. * name.dist Prediction inter-chain distance map.

Evaluation on a Small Dataset

Command:

python CDPred_Installation_Path/lib/distmap_evaluate.py -p [pred_map] -t [true_map] -f1 [fasta_file1] -f2 [fasta_file2]

Parameters:

-p The prediction contact map with '.htxt' suffix.
-t The nativate distance/contact map with '.htxt' suffix.
-f1 The fasta sequence file of chain 1 of dimer.
-f2 The fasta sequence file of chain 2 of dimer.

Demo1: Evaluate the homodimer target.

python ./lib/distmap_evaluate.py -p ./example/expection_output/T1084A_T1084B/predmap/T1084A_T1084B.htxt -t ./example/ground_truth/T1084A_T1084B.htxt -f1 ./example/ground_truth/T1084A.fasta -f2 ./example/ground_truth/T1084B.fasta Expection output of Demo1: NAME LEN_A LEN_B TOP5 TOP10 TOPL/10 TOPL/5 TOPL/2 TOPL T1084A_T1084B 71 71 100.0000 100.0000 100.0000 100.0000 94.2857 91.5493

Demo2: Evaluate the heterodimer target. python ./lib/distmap_evaluate.py -p ./example/expection_output/H1017A_H1017B/predmap/H1017A_H1017B.htxt -t ./example/ground_truth/H1017A_H1017B.htxt -f1 ./example/ground_truth/H1017A.fasta -f2 ./example/ground_truth/H1017B.fasta Expection output of Demo2: NAME LEN_A LEN_B TOP5 TOP10 TOPL/10 TOPL/5 TOPL/2 TOPL H1017A_H1017B 110 125 60.0000 60.0000 54.5455 50.0000 41.8182 36.3636

License

This project is covered under the MIT License.

Reference

[Guo, Z., Liu, J., Skolnick, J., & Cheng, J. (2022). Guo, Z., Liu, J., Skolnick, J., & Cheng, J. (2022). Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks. Nature Communications, 13(1), 6963]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bioinfomachinelearning/cdpred

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

CDPred

Function of CDPred

Contents

System Requirements

OS Requirements

Python Dependencies

Installation guide

Running CDPred

Parameter Description of the CDPred prediction script.

Demo

Examples to make predictions on prepared input data

Output files

Evaluation on a Small Dataset

License

Reference

Owner

GitHub Events

Total

Last Year