deepfri-explainability
Explainability methods employed for the DeepFRI protein function algorithm
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary
Repository
Explainability methods employed for the DeepFRI protein function algorithm
Basic Info
- Host: GitHub
- Owner: ScienceFair2018
- License: bsd-3-clause
- Language: Python
- Default Branch: main
- Size: 66.8 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DeepFRI Explainer
Deep functional residue identification

Citing
``` @article {Gligorijevic2019, author = {Gligorijevic, Vladimir and Renfrew, P. Douglas and Kosciolek, Tomasz and Leman, Julia Koehler and Cho, Kyunghyun and Vatanen, Tommi and Berenberg, Daniel and Taylor, Bryn and Fisk, Ian M. and Xavier, Ramnik J. and Knight, Rob and Bonneau, Richard}, title = {Structure-Based Function Prediction using Graph Convolutional Networks}, year = {2019}, doi = {10.1101/786236}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2019/10/04/786236}, journal = {bioRxiv} }
```
Dependencies
DeepFRI is tested to work under Python 3.7.
The required dependencies for DeepFRI are TensorFlow, Biopython and scikit-learn. To install all dependencies run:
pip install .
Protein Function Prediction: Running DeepFRI
To predict protein functions use predict.py script with the following options:
seqstr, Protein sequence as a stringcmapstr, Name of a file storing a protein contact map and sequence in*.npzfile format (with the following numpy array variables:C_alpha,seqres. Seeexamples/pdb_cmaps/)pdbstr, Name of a PDB file (cleaned)pdb_dirstr, Directory with cleaned PDB files (seeexamples/pdb_files/)cmap_csvstr, Filename of the catalogue (in*.csvfile format) containg mapping between protein names and directory with*.npzfiles (seeexamples/catalogue_pdb_chains.csv)fasta_fnstr, Fasta filename (seeexamples/pdb_chains.fasta)model_configstr, JSON file with model filenames (seetrained_models/)ontstr, Ontology (mf- Molecular Function,bp- Biological Process,cc- Cellular Component,ec- Enzyme Commission)output_fn_prefixstr, Output filename (sampe prefix for predictions/saliency will be used)verbosebool, Whether or not to print function prediction resultssaliencybool, Whether or not to compute GradCAM (outputs a*.jsonfile)ebbool, Whether or not to compute Excitation Backpropagation (outputs a*.jsonfile)pgexplainerbool, Whether or not to compute PGExplainer (outputs a*.jsonfile)
Generated files (see examples/outputs/):
* output_fn_prefix_MF_predictions.csv Predictions in the *.csv file format with columns: Protein, GO-term/EC-number, Score, GO-term/EC-number name
* output_fn_prefix_MF_pred_scores.json Predictions in the *.json file with keys: pdb_chains, Y_hat, goterms, gonames
* output_fn_prefix_MF_saliency_maps.json JSON file storing a dictionary of saliency maps for each predicted function of every protein
* output_fn_prefix_EB.json JSON file storing a dictionary of saliency maps for each predicted function of every protein
* output_fn_prefix_PGExplainer.json JSON file storing a dictionary of saliency maps for each predicted function of every protein
There are 6 options for input data for DeepFRI:
Option 1: predicting functions of a protein from its contact map
Example: predicting MF-GO terms for Parvalbumin alpha protein using its sequence and contact map (PDB: 1S3P):
```
python predict.py --cmap ./examples/pdb_cmaps/1S3P-A.npz -ont mf --verbose
```
Output:
txt
Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99824 calcium ion binding
Option 2: predicting functions of a protein from its sequence
Example: predicting MF-GO terms for Parvalbumin alpha protein using its sequence (PDB: 1S3P):
```
python predict.py --seq 'SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKDGFIDEDELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES' -ont mf --verbose
```
Output:
txt
Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99769 calcium ion binding
Option 3: predicting functions of proteins from a fasta file
```
python predict.py --fastafn examples/pdbchains.fasta -ont mf -v
```
Output:
txt
Protein GO-term/EC-number Score GO-term/EC-number name
1S3P-A GO:0005509 0.99769 calcium ion binding
2J9H-A GO:0004364 0.46937 glutathione transferase activity
2J9H-A GO:0016765 0.19910 transferase activity, transferring alkyl or aryl
(other than methyl) groups
2J9H-A GO:0097367 0.10537 carbohydrate derivative binding
2PE5-B GO:0003677 0.53502 DNA binding
2W83-E GO:0032550 0.99260 purine ribonucleoside binding
2W83-E GO:0001883 0.99242 purine nucleoside binding
2W83-E GO:0005525 0.99231 GTP binding
2W83-E GO:0019001 0.99222 guanyl nucleotide binding
2W83-E GO:0032561 0.99194 guanyl ribonucleotide binding
2W83-E GO:0032549 0.99149 ribonucleoside binding
2W83-E GO:0001882 0.99135 nucleoside binding
2W83-E GO:0017076 0.98687 purine nucleotide binding
2W83-E GO:0032555 0.98641 purine ribonucleotide binding
2W83-E GO:0035639 0.98611 purine ribonucleoside triphosphate binding
2W83-E GO:0032553 0.98573 ribonucleotide binding
2W83-E GO:0097367 0.98168 carbohydrate derivative binding
2W83-E GO:0003924 0.52355 GTPase activity
2W83-E GO:0016817 0.36863 hydrolase activity, acting on acid anhydrides
2W83-E GO:0016818 0.36683 hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
2W83-E GO:0017111 0.35465 nucleoside-triphosphatase activity
2W83-E GO:0016462 0.35303 pyrophosphatase activity
Option 4: predicting functions of proteins from contact map catalogue
```
python predict.py --cmapcsv examples/cataloguepdb_chains.csv -ont mf -v
```
Output:
txt
Protein GO-term/EC-number Score GO-term/EC-number name
1S3P-A GO:0005509 0.99824 calcium ion binding
2J9H-A GO:0004364 0.84826 glutathione transferase activity
2J9H-A GO:0016765 0.82014 transferase activity, transferring alkyl or aryl
(other than methyl) groups
2PE5-B GO:0003677 0.89086 DNA binding
2PE5-B GO:0017111 0.12892 nucleoside-triphosphatase activity
2PE5-B GO:0004386 0.12847 helicase activity
2PE5-B GO:0032553 0.12091 ribonucleotide binding
2PE5-B GO:0097367 0.11961 carbohydrate derivative binding
2PE5-B GO:0016887 0.11331 ATPase activity
2W83-E GO:0097367 0.97069 carbohydrate derivative binding
2W83-E GO:0019001 0.96842 guanyl nucleotide binding
2W83-E GO:0017076 0.96737 purine nucleotide binding
2W83-E GO:0001882 0.96473 nucleoside binding
2W83-E GO:0035639 0.96439 purine ribonucleoside triphosphate binding
2W83-E GO:0032555 0.96294 purine ribonucleotide binding
2W83-E GO:0016818 0.96181 hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
2W83-E GO:0032550 0.96142 purine ribonucleoside binding
2W83-E GO:0016817 0.96082 hydrolase activity, acting on acid anhydrides
2W83-E GO:0016462 0.95998 pyrophosphatase activity
2W83-E GO:0032553 0.95935 ribonucleotide binding
2W83-E GO:0032561 0.95930 guanyl ribonucleotide binding
2W83-E GO:0032549 0.95877 ribonucleoside binding
2W83-E GO:0003924 0.95453 GTPase activity
2W83-E GO:0001883 0.95271 purine nucleoside binding
2W83-E GO:0005525 0.94635 GTP binding
2W83-E GO:0017111 0.93942 nucleoside-triphosphatase activity
2W83-E GO:0044877 0.64519 protein-containing complex binding
2W83-E GO:0001664 0.31413 G protein-coupled receptor binding
2W83-E GO:0005102 0.20078 signaling receptor binding
Option 5: predicting functions of a protein from a PDB file
```
python predict.py -pdb ./examples/pdb_files/1S3P-A.pdb -ont mf -v
```
Output:
txt
Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99824 calcium ion binding
Option 6: predicting functions of a protein from a directory with PDB files
```
python predict.py --pdbdir ./examples/pdbfiles -ont mf --saliency --use_backprop
```
Output:
See files in: examples/outputs/
Predict with Explainability
The commands I used were ```
python predict.py --fastafn examples/pdbchains.fasta --ont mf --saliency python predict.py --fastafn examples/pdbchains.fasta --ont mf --eb python predict.py --fastafn examples/pdbchains.fasta --ont mf --pgexplainer
```
Depending on the method of explainability chosen, use one of the following 3 tags:
1. --saliency for GradCAM
2. --eb for Excitation Backpropogation
3. --pgexplainer for PGExplainer
These will produce jsons named DeepFRIsaliencymaps.json, DeepFRIeb.json, or DeepFRIPGExplainer.json respectively depending on which explainability method was chosen.
Explainability
GradCAM:
To visualize heatmaps use viz_gradCAM.py script with the following options:
saliency_fnstr, JSON filename with saliency maps generated bypredict.pyscriptlist_allbool, list all proteins and their predicted GO terms with corresponding class activation (saliency) mapsprotein_idstr, protein (PDB chain), saliency maps of which are to be visualized for each predicted functiongo_idstr, GO term, saliency maps of which are to be visualizedgo_namestr, GO name, saliency maps of which are to be visualized
Generated files:
* saliency_fig_PDB-chain_GOterm.png class activation (saliency) map profile over sequence (see fig below, right)
* pymol_viz.py pymol script for mapping salient residues onto 3D structure (pymol output is shown in fig below, left)
Example:
```
python vizgradCAM.py --saliencyfn DeepFRIMFsaliencymaps.json --goid GO:0097367 --protein_id 2PE5-B
``` Run pymol_viz.py in pymol to visualize the results

Excitation Backpropogation:
To visualize heatmaps use viz_EB.py script with the following options:
saliency_fnstr, JSON filename with saliency maps generated bypredict.pyscriptlist_allbool, list all proteins and their predicted GO terms with corresponding class activation (saliency) mapsprotein_idstr, protein (PDB chain), saliency maps of which are to be visualized for each predicted functiongo_idstr, GO term, saliency maps of which are to be visualizedgo_namestr, GO name, saliency maps of which are to be visualized
Example:
```
python vizEB.py --saliencyfn DeepFRIeb.json --goid GO:0032553 --proteinid 2W83-E ``` Run pymolviz.py in pymol to visualize the results

PGExplainer
To visualize heatmaps use viz_PGExplainer.py script with the following options:
saliency_fnstr, JSON filename with saliency maps generated bypredict.pyscriptlist_allbool, list all proteins and their predicted GO terms with corresponding class activation (saliency) mapsprotein_idstr, protein (PDB chain), saliency maps of which are to be visualized for each predicted functiongo_idstr, GO term, saliency maps of which are to be visualizedgo_namestr, GO name, saliency maps of which are to be visualized
Example:
```
python vizPGExplainer.py --saliencyfn DeepFRIPGExplainer.json --goid GO:0097367 --proteinid 2W83-E ``` Run pymolviz.py in pymol to visualize the results

The files that we (Valentina Simon, Ananya Krishna, Arjan Kohli) edited were:
- predict.py
- Predictor.py
- viz_gradCAM.py
- viz_EB.py
- viz_PGExplainer.py
Predictor.py can be found in the deepfrier folder.
We implemented the GradCAM, EB, and PGExplainer classes and corresponding functions in Predictor.py. This is where the heatmaps were calculated
Owner
- Login: ScienceFair2018
- Kind: user
- Repositories: 1
- Profile: https://github.com/ScienceFair2018
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- biopython ==1.76
- networkx ==2.4
- numpy ==1.18.5
- scikit-learn ==0.23.1
- tensorflow-gpu ==2.3.1