resnetppi
Predicting protein inter-chain residue distances from sequences irrespective of paired MSA.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.5%) to scientific vocabulary
Repository
Predicting protein inter-chain residue distances from sequences irrespective of paired MSA.
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
ResNetPPI: Predicting protein inter-chain residue distances from sequences irrespective of paired MSA
zhuzefeng@stu.pku.edu.cn (ORCID: 0000-0002-2761-3291)
A lab rotation project. Under development from 2021.12 to 2022.1.
Summary

Advantages
- accept variable input size
- protein of any length
- any number of homologous sequences
- does not rely on paired MSA, thus it can predict cross-species protein interaction
Training
Data Resources
- Sequence database e.g. UniRef30202006_hhsuite.tar.gz
- PDB structures
Input Features
- Protein Sequence * 2
- build MSA via HHblits if possible
- onehot encoding of amino acid type (20+gap+X)*2 + (hydrophoblic+hydrophilic)*2
Fitting Targets
Fitting Targets (i.e. Inter-chain Cβ-Cβ Distance Map) Examples
NOTE: Cα for GLY
| PDB | Inter-chain Cβ-Cβ Distance Map | pdb_id | human chain | virus chain | len(human chain) | len(virus chain) |
|
|
3wwt | A | B | 123 | 107 |
|
|
1im3 | E | H | 275 | 95 |
|
|
6bvv | A | B | 416 | 24 |
|
|
4rf1 | B | A | 75 | 321 |
|
|
6e5x | B | A | 13 | 127 |
Real-valued distances are discretely binned:
- from 2Å to 20Å
- with bin size of 0.5Å
- 36 bins + 1 bin for $[20, +\infty)$
Network Architecture
- ResNet1D
- 1d residual block: ((Conv1d + BatchNorm1d) + ELU + (Conv1d + BatchNorm1d)) (+) ELU
- 8 1d-blocks (64 channels)
- ResNet2D
- 2d residual block: ((Conv2d + BatchNorm2d) + ELU + (Conv2d + BatchNorm2d)) (+) ELU
- 16 2d-blocks (96 channels), cycling through dilations $(1,2,4,8)$
- (mini-)batch size: 1
- Cross-entropy Loss
Model Design
- for each protein sequence (of length $L$) search homologous sequences and input the MSA (of size $K$) if possible, otherwise input the single sequence
- use original MSA to calculate the weight $wk$ for each homologous sequence, $wk=\frac{1}{\text{count}(\text{sequence with identity}\ge 0.8)}$
- calculate $M\text{eff}=\sum{k}^{K}w_k$
- MSA Encoding: perform onehot-encoding for each pairwise alignment ($2\times Lk$, consider both insertion and deletion: $\rightarrow 48\times Lk$)
- onehot-encoding including 22+2 channels for the reference sequence, 22+2 channels for the homologous sequence
- 22: 20 amino acid types + 1 gap + 1 unknown type
- 2: 1 hydrophoblic + 1 hydrophilic
- for the single sequence input, all the homologous related channels are filled with the reference sequence's corresponding values
- hence we get ${48\times L_k, k\in K}$
- MSA Embedding: for each encoded pairwise alignment, feed into the
ResNet1Dand get embedded pairwise alignment ($64\times L_k$)- hence we get ${64\times L_k, k\in K}$
- omit the insertion region of the homologous sequences, thus we can get a $K\times 64 \times L$ tensor
- $x_k\in R^{64\times L}$
- $x_k(i) \in R^{64}$
- Paired Evolution Aggregation
- calculate one body term
- $f1(i)=\frac{1}{M{\text{eff}{1}}}\sum{k}^{K1}w{1k} x{1_k}(i)$
- $f2(j)=\frac{1}{M{\text{eff}{2}}}\sum{k}^{K2}w{2k} x{2_k}(j)$
- apply max function
- $A(i,c) = \max{x_k(i,c), k \in K}$, c: channel; $A \in R^{64\times L}$
- calculate two body term
- $s(i,j) = \frac{1}{\sqrt{M{\text{eff}{1}}\cdot M{\text{eff}{2}}}}[A1(i)\otimes A2(j)], s(i,j)\in R^{64\times 64}$
- concatenate
- $h(i,j) = \text{concat}(f1(i), f2(j), s(i,j)), h(i,j)\in R^{4224}$
- Inter-chain Distance Estimation
- feed the $R^{4224 \times L1\times L2}$ tensor into the
ResNet2D - convert the
ResNet2Doutputs into the discrete distance distribution $37 \times L1 \times L2$ throught a (Conv2d+BatchNorm2d+ELU) layer
- feed the $R^{4224 \times L1\times L2}$ tensor into the
During Training
- randomly sample 1000 homologous sequences if $K > 1000$
- randomly crop the distance matrix into $64 \times 64$ shape
- for those input sequences of length $\le 64$, keep their original length
- loss function ignore those missing residues of PDB structure
- definition of missing
- complete without any modeled atoms for a residue
- GLY without Cα atom
- non-GLY without Cβ atom and without any anchoring atoms (Cα, N, C) that can infer Cβ atom
Current Training and Validation Results

Dataset Preparation
- training dataset (362)
- validation dataset (100)
NOTE: the datasets have not been carefully curated, they just serve as demo inputs for training and validation. Thus the training and validation results are just a demonstration of the model's learning ability.
TODO
- increase mini-batch size
- multiple GPU training
- improve archi
Hardware & Software Prerequisites
- Hardware (tested)
- GPU: NVIDIA Tesla T4 (16G)
- CUDA Version: 11.2
- Software
- Python 3.8 or later in a conda environment
- PyTorch and other Python packages: see requirements
Acknowledgment
I would like to thank Jinyuan Guo and Prof. Huaiqiu Zhu for their helpful discussions and provide me with devices to develop this project.
Reference Papers
License
Owner
- Name: Zefeng Zhu
- Login: NatureGeorge
- Kind: user
- Company: PKU
- Website: https://naturegeorge.github.io
- Repositories: 5
- Profile: https://github.com/NatureGeorge
808017424794512875886459904961710757005754368000000000
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Zhu
given-names: Zefeng
orcid: https://orcid.org/0000-0002-2761-3291
title: "`ResNetPPI`: Predicting protein inter-chain residue distances from sequences irrespective of paired MSA"
version: 0.1.0
date-released: 2022-01-05
license: Apache-2.0
repository-code: "https://github.com/naturegeorge/ResNetPPI"