ecs129-project

A program that compares a protein structure prediction to a solved structure and evaluates the prediction's accuracy using RMSD.

https://github.com/rsrchen/ecs129-project

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: science.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary

Keywords

alphafold2 protein protein-structure python rmsd

Last synced: 6 months ago · JSON representation ·

Repository

A program that compares a protein structure prediction to a solved structure and evaluates the prediction's accuracy using RMSD.

Basic Info

Host: GitHub
Owner: rsrchen
Language: Python
Default Branch: master
Homepage:
Size: 10.6 MB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

alphafold2 protein protein-structure python rmsd

Created about 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

ECS 129 Winter Quarter 2022 Project - Protein Structure Comparison Program

ECS 129 Winter Quarter 2022 Project - Protein Structure Comparison Program
- What
- What, continued
- Why
- Who
- Report
how to use

What

This program uses the quaternion-based root-mean-square deviation algorithm from "Using Quaternions to Calculate RMSD" by Coutsias, Seok, and Dill (2004) to compare a set of protein structures to one another.

Coutsias, E.A., Seok, C. and Dill, K.A. (2004), Using quaternions to calculate RMSD. J. Comput. Chem., 25: 1849-1857. https://doi.org/10.1002/jcc.20110

What, continued

This command-line utility takes five protein structure predictions (the default number of structures generated by AlphaFold) and one experimentally determined structure (as empirically derived from crystallography, cryo-EM, or other means) and evaluates their structural similarity with regards to one another using the root-mean-square deviations between the alpha carbons of different structures.

The program displays a heatmap generated from the root-mean-square deviations of all comparisons of all protein structures and a bar graph illustrating how the protein structure predictions stack up against one another when compared to the experimentally-determined structure.

This program was written in Python3 and relies on numpy, matplotlib, and biopython among other dependencies. These are listed in requirements.txt.

Why

Protein structure prediction has been a huge area of research for decades, and as of late, it's been all over the news. For instance, AlphaFold recently went on display at CASP, the major biannual research competition dedicated to 3D protein structure work, and and has been making headlines. We chose to make this topic the focus of our ECS 129 project and write a program that evaluates the validity of a protein structure prediction by comparing it to the empirically-determined true structure.

Who

Raymond Chen - programming, debugging, refactoring, report writing, additional improvements
Emily Cheng - programming, report writing
Daniel Cardenas - programming, report writing

Report

A report on this program was prepared for ECS 129 - Computational Structural Bioinformatics taught by Patrice Koehl. It can be found here.

2022-Jun-26 update: This report was written when the project was in a different state than it is now, before I merged my own branch containing my own individual work with the master branch, so some stuff (such as the presence of the CLI) might be different.

how to use

Open up your shell of choice and follow the instructions below.

the main program (`command_line.py`)

This is the entry point of the main program. Use the CLI (command-line interface) to interact with the program and calculate the root-mean-square deviation between alpha carbons in your structures of interest.

flags and arguments

-a, the directory that contains AlphaFold's predictions in the form of PDB files. By default, it is ./structures/alphafold_predictions.
-c, which protein chains of the predicted structures you want to examine. By default, it's just chain A. You can input AB if you want the program to read chains A and B, or ABCD for chains A, B, C, and D.
-C, which protein chains of the solved structure you want to examine. Similar to -c.
-h, required, the hash value assigned to your AlphaFold predictions by ColabFold.
-p, required, the PDB ID of your solved structure.
-s, the directory that contains the solved structures. These are the experimentally determined structures, solved using x-ray crystallography or cryo-EM or similar methods. By default, it is ./structures/solved_structures.

example usage

I type this command into my shell: python .\command_line.py -p 1grq -h b2b87

The first word calls my python executable.

The second word is the entry point of the program, a python file called command_line.py.

Then, some flags and arguments. 1. -p 1grq tells the program that the PDB ID of the solved protein structure is 1GRQ. This is the PDB file for the solved structure of CHLORAMPHENICOL PHOSPHOTRANSFERASE IN COMPLEX WITH P-AMINO-CHLORAMPHENICOL FROM STREPTOMYCES VENEZUELAE. 2. -h b2b87 tells the program that the ColabFold hash is b2b87. When I used ColabFold to predict how this protein would fold, it gave me a jobname and attached a hash value to it. All the PDB files of AlphaFold's structure predictions now have this hash value, b2b87, in their filenames. The program will use this hash value to discern which structure predictions are of the same protein. 3. You'll notice there's no -s, -a, -c, or -C. The program will use the default values for the solved structures directory, AlphaFold predictions directory, chains of the predicted structure to examine, and chains of the solved structure to examine.

The program will run and print some comparison stats to the shell, as well as display a heatmap and bar graph visualizing the data.

Here's the output.

No argument provided for -a; default predicted structures directory (./structures/alphafoldpredictions) will be used. No argument provided for -s; default solved structures directory (./structures/solvedstructures) will be used. No argument provided for -c; default chain A will be used. No argument provided for -C; default chain A will be used. 1grqrank1 and 1grqrank1 RMSD: 0.0099 1grqrank1 and 1grqrank2 RMSD: 0.188 1grqrank1 and 1grqrank3 RMSD: 0.2099 1grqrank1 and 1grqrank4 RMSD: 0.2562 1grqrank1 and 1grqrank5 RMSD: 0.1959 1grqrank1 and 1grqgoldstandard RMSD: 0.639 1grqrank2 and 1grqrank1 RMSD: 0.188 1grqrank2 and 1grqrank2 RMSD: 0.0047 1grqrank2 and 1grqrank3 RMSD: 0.1231 1grqrank2 and 1grqrank4 RMSD: 0.2149 1grqrank2 and 1grqrank5 RMSD: 0.206 1grqrank2 and 1grqgoldstandard RMSD: 0.5896 1grqrank3 and 1grqrank1 RMSD: 0.2099 1grqrank3 and 1grqrank2 RMSD: 0.1231 1grqrank3 and 1grqrank3 RMSD: 0.0047 1grqrank3 and 1grqrank4 RMSD: 0.2239 1grqrank3 and 1grqrank5 RMSD: 0.2126 1grqrank3 and 1grqgoldstandard RMSD: 0.58 1grqrank4 and 1grqrank1 RMSD: 0.2562 1grqrank4 and 1grqrank2 RMSD: 0.2149 1grqrank4 and 1grqrank3 RMSD: 0.2239 1grqrank4 and 1grqrank4 RMSD: 0.0081 1grqrank4 and 1grqrank5 RMSD: 0.1754 1grqrank4 and 1grqgoldstandard RMSD: 0.6273 1grqrank5 and 1grqrank1 RMSD: 0.1959 1grqrank5 and 1grqrank2 RMSD: 0.206 1grqrank5 and 1grqrank3 RMSD: 0.2126 1grqrank5 and 1grqrank4 RMSD: 0.1754 1grqrank5 and 1grqrank5 RMSD: 0.0166 1grqrank5 and 1grqgoldstandard RMSD: 0.6342 1grqgoldstandard and 1grqrank1 RMSD: 0.639 1grqgoldstandard and 1grqrank2 RMSD: 0.5896 1grqgoldstandard and 1grqrank3 RMSD: 0.58 1grqgoldstandard and 1grqrank4 RMSD: 0.6273 1grqgoldstandard and 1grqrank5 RMSD: 0.6342 1grqgoldstandard and 1grqgoldstandard RMSD: 0.011 ```

output plots

One more example:

python .\command_line.py -p 1czd -h df4c0 -c b -C a -a new-directory/some-subdirectory -s new-directory

A few things are different from the previous example. 1. The PDB ID and hash values are different; I'm looking at CRYSTAL STRUCTURE OF THE PROCESSIVITY CLAMP GP45 FROM BACTERIOPHAGE T4 this time. 2. -c b is used to specify that the program should look at chain B of the predicted structure. 3. -C a is used to specify that the program should look at chain A of the solved structure. 4. -a and -s are used to specify that the program should search in the ./new-directory/some-subdirectory and ./new-directory paths to find the predicted structure and solved structure files, respectively.

Here's the output.

1czdrank1 and 1czdrank1 RMSD: 0.0041 1czdrank1 and 1czdrank2 RMSD: 0.1409 1czdrank1 and 1czdrank3 RMSD: 0.1101 1czdrank1 and 1czdrank4 RMSD: 0.1227 1czdrank1 and 1czdrank5 RMSD: 0.212 1czdrank1 and 1czdgoldstandard RMSD: 0.6979 1czdrank2 and 1czdrank1 RMSD: 0.1409 1czdrank2 and 1czdrank2 RMSD: 0.019 1czdrank2 and 1czdrank3 RMSD: 0.1064 1czdrank2 and 1czdrank4 RMSD: 0.1256 1czdrank2 and 1czdrank5 RMSD: 0.2199 1czdrank2 and 1czdgoldstandard RMSD: 0.6843 1czdrank3 and 1czdrank1 RMSD: 0.1101 1czdrank3 and 1czdrank2 RMSD: 0.1064 1czdrank3 and 1czdrank3 RMSD: 0.016 1czdrank3 and 1czdrank4 RMSD: 0.1243 1czdrank3 and 1czdrank5 RMSD: 0.1768 1czdrank3 and 1czdgoldstandard RMSD: 0.6854 1czdrank4 and 1czdrank1 RMSD: 0.1227 1czdrank4 and 1czdrank2 RMSD: 0.1256 1czdrank4 and 1czdrank3 RMSD: 0.1243 1czdrank4 and 1czdrank4 RMSD: 0.0072 1czdrank4 and 1czdrank5 RMSD: 0.2341 1czdrank4 and 1czdgoldstandard RMSD: 0.7056 1czdrank5 and 1czdrank1 RMSD: 0.212 1czdrank5 and 1czdrank2 RMSD: 0.2199 1czdrank5 and 1czdrank3 RMSD: 0.1768 1czdrank5 and 1czdrank4 RMSD: 0.2341 1czdrank5 and 1czdrank5 RMSD: 0.0155 1czdrank5 and 1czdgoldstandard RMSD: 0.7256 1czdgoldstandard and 1czdrank1 RMSD: 0.6979 1czdgoldstandard and 1czdrank2 RMSD: 0.6843 1czdgoldstandard and 1czdrank3 RMSD: 0.6854 1czdgoldstandard and 1czdrank4 RMSD: 0.7056 1czdgoldstandard and 1czdrank5 RMSD: 0.7256 1czdgoldstandard and 1czdgoldstandard RMSD: 0.0106 ```

output plots

pdb length finder (`pdb_length_finder.py`)

This is an additional utility you can use to find the length of a particular protein sequence.

flags and arguments

-n, required, the name of the file.
-c, which protein chains you want to measure the length of.

example usage

I type this command into my shell: python .\pdb_length_finder.py -n 1a2y.pdb -c ab

I use -n to indicate the name of the file. That's 1a2y.pdb.
I use -c to indicate the chains I want to measure the length of, chains A and B.

Here's the output.

``` PDB File Length Finder

Length of sequence specified: 223 ```

Owner

Login: rsrchen
Kind: user
Location: Northern California

Website: landing.raidsrc.me
Repositories: 1
Profile: https://github.com/rsrchen

Student at UC Davis. @raidsrc is my personal account

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: ECS 129 Protein Structure Comparison Program
message: Please cite this software using these metadata.
type: software
authors:
  - email: rsrchen@ucdavis.edu
    given-names: Ray
    family-names: Chen
  - email: dancard@ucdavis.edu
    given-names: Daniel
    family-names: Cardenas
  - email: etcheng@ucdavis.edu
    given-names: Emily
    family-names: Cheng
repository-code: 'https://github.com/rsrchen/ecs129-project'
keywords:
  - protein
  - structure
  - sequence
  - comparison
  - bioinformatics
  - alphafold
  - quaternion

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

bio ==1.7.1
biopython ==1.83
biothings-client ==0.3.1
certifi ==2024.6.2
charset-normalizer ==3.3.2
colorama ==0.4.6
contourpy ==1.2.1
cycler ==0.12.1
fonttools ==4.53.0
gprofiler-official ==1.0.0
idna ==3.7
kiwisolver ==1.4.5
matplotlib ==3.9.0
mygene ==3.2.2
numpy ==2.0.0
packaging ==24.1
pandas ==2.2.2
pillow ==10.3.0
platformdirs ==4.2.2
pooch ==1.8.2
pyparsing ==3.1.2
python-dateutil ==2.9.0.post0
pytz ==2024.1
regex ==2024.5.15
requests ==2.32.3
six ==1.16.0
tqdm ==4.66.4
tzdata ==2024.1
urllib3 ==2.2.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

ecs129-project

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ECS 129 Winter Quarter 2022 Project - Protein Structure Comparison Program

What

What, continued

Why

Who

Report

how to use

the main program (`command_line.py`)

flags and arguments

example usage

pdb length finder (`pdb_length_finder.py`)

flags and arguments

example usage

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

ecs129-project

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ECS 129 Winter Quarter 2022 Project - Protein Structure Comparison Program

What

What, continued

Why

Who

Report

how to use

the main program (command_line.py)

flags and arguments

example usage

pdb length finder (pdb_length_finder.py)

flags and arguments

example usage

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

the main program (`command_line.py`)

pdb length finder (`pdb_length_finder.py`)