ecs129-project
A program that compares a protein structure prediction to a solved structure and evaluates the prediction's accuracy using RMSD.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: science.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary
Keywords
Repository
A program that compares a protein structure prediction to a solved structure and evaluates the prediction's accuracy using RMSD.
Basic Info
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
ECS 129 Winter Quarter 2022 Project - Protein Structure Comparison Program
What
This program uses the quaternion-based root-mean-square deviation algorithm from "Using Quaternions to Calculate RMSD" by Coutsias, Seok, and Dill (2004) to compare a set of protein structures to one another.
Coutsias, E.A., Seok, C. and Dill, K.A. (2004), Using quaternions to calculate RMSD. J. Comput. Chem., 25: 1849-1857. https://doi.org/10.1002/jcc.20110
What, continued
This command-line utility takes five protein structure predictions (the default number of structures generated by AlphaFold) and one experimentally determined structure (as empirically derived from crystallography, cryo-EM, or other means) and evaluates their structural similarity with regards to one another using the root-mean-square deviations between the alpha carbons of different structures.
The program displays a heatmap generated from the root-mean-square deviations of all comparisons of all protein structures and a bar graph illustrating how the protein structure predictions stack up against one another when compared to the experimentally-determined structure.
This program was written in Python3 and relies on numpy, matplotlib, and biopython among other dependencies. These are listed in requirements.txt.
Why
Protein structure prediction has been a huge area of research for decades, and as of late, it's been all over the news. For instance, AlphaFold recently went on display at CASP, the major biannual research competition dedicated to 3D protein structure work, and and has been making headlines. We chose to make this topic the focus of our ECS 129 project and write a program that evaluates the validity of a protein structure prediction by comparing it to the empirically-determined true structure.
Who
- Raymond Chen - programming, debugging, refactoring, report writing, additional improvements
- Emily Cheng - programming, report writing
- Daniel Cardenas - programming, report writing
Report
A report on this program was prepared for ECS 129 - Computational Structural Bioinformatics taught by Patrice Koehl. It can be found here.
2022-Jun-26 update: This report was written when the project was in a different state than it is now, before I merged my own branch containing my own individual work with the master branch, so some stuff (such as the presence of the CLI) might be different.
how to use
Open up your shell of choice and follow the instructions below.
the main program (command_line.py)
This is the entry point of the main program. Use the CLI (command-line interface) to interact with the program and calculate the root-mean-square deviation between alpha carbons in your structures of interest.
flags and arguments
-a, the directory that contains AlphaFold's predictions in the form of PDB files. By default, it is./structures/alphafold_predictions.-c, which protein chains of the predicted structures you want to examine. By default, it's just chain A. You can inputABif you want the program to read chains A and B, orABCDfor chains A, B, C, and D.-C, which protein chains of the solved structure you want to examine. Similar to-c.-h, required, the hash value assigned to your AlphaFold predictions by ColabFold.-p, required, the PDB ID of your solved structure.-s, the directory that contains the solved structures. These are the experimentally determined structures, solved using x-ray crystallography or cryo-EM or similar methods. By default, it is./structures/solved_structures.
example usage
I type this command into my shell:
python .\command_line.py -p 1grq -h b2b87
The first word calls my python executable.
The second word is the entry point of the program, a python file called command_line.py.
Then, some flags and arguments.
1. -p 1grq tells the program that the PDB ID of the solved protein structure is 1GRQ. This is the PDB file for the solved structure of CHLORAMPHENICOL PHOSPHOTRANSFERASE IN COMPLEX WITH P-AMINO-CHLORAMPHENICOL FROM STREPTOMYCES VENEZUELAE.
2. -h b2b87 tells the program that the ColabFold hash is b2b87. When I used ColabFold to predict how this protein would fold, it gave me a jobname and attached a hash value to it. All the PDB files of AlphaFold's structure predictions now have this hash value, b2b87, in their filenames. The program will use this hash value to discern which structure predictions are of the same protein.
3. You'll notice there's no -s, -a, -c, or -C. The program will use the default values for the solved structures directory, AlphaFold predictions directory, chains of the predicted structure to examine, and chains of the solved structure to examine.
The program will run and print some comparison stats to the shell, as well as display a heatmap and bar graph visualizing the data.
Here's the output.
``` The (New and Improved) ECS 129 Protein Structure Comparison Program. © 2022-20XX rsrchen (github.com/rsrchen)
No argument provided for -a; default predicted structures directory (./structures/alphafoldpredictions) will be used. No argument provided for -s; default solved structures directory (./structures/solvedstructures) will be used. No argument provided for -c; default chain A will be used. No argument provided for -C; default chain A will be used. 1grqrank1 and 1grqrank1 RMSD: 0.0099 1grqrank1 and 1grqrank2 RMSD: 0.188 1grqrank1 and 1grqrank3 RMSD: 0.2099 1grqrank1 and 1grqrank4 RMSD: 0.2562 1grqrank1 and 1grqrank5 RMSD: 0.1959 1grqrank1 and 1grqgoldstandard RMSD: 0.639 1grqrank2 and 1grqrank1 RMSD: 0.188 1grqrank2 and 1grqrank2 RMSD: 0.0047 1grqrank2 and 1grqrank3 RMSD: 0.1231 1grqrank2 and 1grqrank4 RMSD: 0.2149 1grqrank2 and 1grqrank5 RMSD: 0.206 1grqrank2 and 1grqgoldstandard RMSD: 0.5896 1grqrank3 and 1grqrank1 RMSD: 0.2099 1grqrank3 and 1grqrank2 RMSD: 0.1231 1grqrank3 and 1grqrank3 RMSD: 0.0047 1grqrank3 and 1grqrank4 RMSD: 0.2239 1grqrank3 and 1grqrank5 RMSD: 0.2126 1grqrank3 and 1grqgoldstandard RMSD: 0.58 1grqrank4 and 1grqrank1 RMSD: 0.2562 1grqrank4 and 1grqrank2 RMSD: 0.2149 1grqrank4 and 1grqrank3 RMSD: 0.2239 1grqrank4 and 1grqrank4 RMSD: 0.0081 1grqrank4 and 1grqrank5 RMSD: 0.1754 1grqrank4 and 1grqgoldstandard RMSD: 0.6273 1grqrank5 and 1grqrank1 RMSD: 0.1959 1grqrank5 and 1grqrank2 RMSD: 0.206 1grqrank5 and 1grqrank3 RMSD: 0.2126 1grqrank5 and 1grqrank4 RMSD: 0.1754 1grqrank5 and 1grqrank5 RMSD: 0.0166 1grqrank5 and 1grqgoldstandard RMSD: 0.6342 1grqgoldstandard and 1grqrank1 RMSD: 0.639 1grqgoldstandard and 1grqrank2 RMSD: 0.5896 1grqgoldstandard and 1grqrank3 RMSD: 0.58 1grqgoldstandard and 1grqrank4 RMSD: 0.6273 1grqgoldstandard and 1grqrank5 RMSD: 0.6342 1grqgoldstandard and 1grqgoldstandard RMSD: 0.011 ```

One more example:
python .\command_line.py -p 1czd -h df4c0 -c b -C a -a new-directory/some-subdirectory -s new-directory
A few things are different from the previous example.
1. The PDB ID and hash values are different; I'm looking at CRYSTAL STRUCTURE OF THE PROCESSIVITY CLAMP GP45 FROM BACTERIOPHAGE T4 this time.
2. -c b is used to specify that the program should look at chain B of the predicted structure.
3. -C a is used to specify that the program should look at chain A of the solved structure.
4. -a and -s are used to specify that the program should search in the ./new-directory/some-subdirectory and ./new-directory paths to find the predicted structure and solved structure files, respectively.
Here's the output.
``` The (New and Improved) ECS 129 Protein Structure Comparison Program. © 2022-20XX rsrchen (github.com/rsrchen)
1czdrank1 and 1czdrank1 RMSD: 0.0041 1czdrank1 and 1czdrank2 RMSD: 0.1409 1czdrank1 and 1czdrank3 RMSD: 0.1101 1czdrank1 and 1czdrank4 RMSD: 0.1227 1czdrank1 and 1czdrank5 RMSD: 0.212 1czdrank1 and 1czdgoldstandard RMSD: 0.6979 1czdrank2 and 1czdrank1 RMSD: 0.1409 1czdrank2 and 1czdrank2 RMSD: 0.019 1czdrank2 and 1czdrank3 RMSD: 0.1064 1czdrank2 and 1czdrank4 RMSD: 0.1256 1czdrank2 and 1czdrank5 RMSD: 0.2199 1czdrank2 and 1czdgoldstandard RMSD: 0.6843 1czdrank3 and 1czdrank1 RMSD: 0.1101 1czdrank3 and 1czdrank2 RMSD: 0.1064 1czdrank3 and 1czdrank3 RMSD: 0.016 1czdrank3 and 1czdrank4 RMSD: 0.1243 1czdrank3 and 1czdrank5 RMSD: 0.1768 1czdrank3 and 1czdgoldstandard RMSD: 0.6854 1czdrank4 and 1czdrank1 RMSD: 0.1227 1czdrank4 and 1czdrank2 RMSD: 0.1256 1czdrank4 and 1czdrank3 RMSD: 0.1243 1czdrank4 and 1czdrank4 RMSD: 0.0072 1czdrank4 and 1czdrank5 RMSD: 0.2341 1czdrank4 and 1czdgoldstandard RMSD: 0.7056 1czdrank5 and 1czdrank1 RMSD: 0.212 1czdrank5 and 1czdrank2 RMSD: 0.2199 1czdrank5 and 1czdrank3 RMSD: 0.1768 1czdrank5 and 1czdrank4 RMSD: 0.2341 1czdrank5 and 1czdrank5 RMSD: 0.0155 1czdrank5 and 1czdgoldstandard RMSD: 0.7256 1czdgoldstandard and 1czdrank1 RMSD: 0.6979 1czdgoldstandard and 1czdrank2 RMSD: 0.6843 1czdgoldstandard and 1czdrank3 RMSD: 0.6854 1czdgoldstandard and 1czdrank4 RMSD: 0.7056 1czdgoldstandard and 1czdrank5 RMSD: 0.7256 1czdgoldstandard and 1czdgoldstandard RMSD: 0.0106 ```

pdb length finder (pdb_length_finder.py)
This is an additional utility you can use to find the length of a particular protein sequence.
flags and arguments
-n, required, the name of the file.-c, which protein chains you want to measure the length of.
example usage
I type this command into my shell:
python .\pdb_length_finder.py -n 1a2y.pdb -c ab
- I use
-nto indicate the name of the file. That's1a2y.pdb. - I use
-cto indicate the chains I want to measure the length of, chains A and B.
Here's the output.
``` PDB File Length Finder
Length of sequence specified: 223 ```
Owner
- Login: rsrchen
- Kind: user
- Location: Northern California
- Website: landing.raidsrc.me
- Repositories: 1
- Profile: https://github.com/rsrchen
Student at UC Davis. @raidsrc is my personal account
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: ECS 129 Protein Structure Comparison Program
message: Please cite this software using these metadata.
type: software
authors:
- email: rsrchen@ucdavis.edu
given-names: Ray
family-names: Chen
- email: dancard@ucdavis.edu
given-names: Daniel
family-names: Cardenas
- email: etcheng@ucdavis.edu
given-names: Emily
family-names: Cheng
repository-code: 'https://github.com/rsrchen/ecs129-project'
keywords:
- protein
- structure
- sequence
- comparison
- bioinformatics
- alphafold
- quaternion
GitHub Events
Total
Last Year
Dependencies
- bio ==1.7.1
- biopython ==1.83
- biothings-client ==0.3.1
- certifi ==2024.6.2
- charset-normalizer ==3.3.2
- colorama ==0.4.6
- contourpy ==1.2.1
- cycler ==0.12.1
- fonttools ==4.53.0
- gprofiler-official ==1.0.0
- idna ==3.7
- kiwisolver ==1.4.5
- matplotlib ==3.9.0
- mygene ==3.2.2
- numpy ==2.0.0
- packaging ==24.1
- pandas ==2.2.2
- pillow ==10.3.0
- platformdirs ==4.2.2
- pooch ==1.8.2
- pyparsing ==3.1.2
- python-dateutil ==2.9.0.post0
- pytz ==2024.1
- regex ==2024.5.15
- requests ==2.32.3
- six ==1.16.0
- tqdm ==4.66.4
- tzdata ==2024.1
- urllib3 ==2.2.2