prime

protein structure prediction with precision

https://github.com/mqcomplab/prime

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

algorithms molecular-dynamics proteins python
Last synced: 4 months ago · JSON representation ·

Repository

protein structure prediction with precision

Basic Info
Statistics
  • Stars: 19
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
algorithms molecular-dynamics proteins python
Created almost 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

 

🪄 Predict Protein Structure with Precision 🪄

Table of Contents

Overview

Protein structures prediction is important because the accuracy of protein structures influence how our understanding of its function and its interactions with other molecules, which can help to design new drugs to target specific molecular interactions. Protein Retrieval via Integrative Molecular Ensembles (PRIME) is a novel algorithm that predicts the native structure of a protein from simulation or clustering data. This repo contains six different ways of determining the native structure of biomolecules from simulation or clustering data. These methods perfectly mapped all the structural motifs in the studied systems and required unprecedented linear scaling.

2k2e
Fig 1. Superposition of the most representative structures found with extended indices (yellow) and experimental native structures (blue) of 2k2e.

Installation

PRIME requires Python 3.6+ and the following packages: MDAnalysis, numpy, and matplotlib. bash git clone https://github.com/mqcomplab/PRIME.git cd PRIME

Tutorial

The following tutorial will guide you through the process of determining the native structure of a biomolecule using the PRIME algorithm. If you already have clustered data, you can skip to Step 4.

1. Input Preparations

Preparation for Molecular Dynamics Trajectory

Prepare a valid topology file (e.g. .pdb, .prmtop), trajectory file (e.g. .dcd, .nc), and the atom selection. This step will convert a Molecular Dynamics trajectory to a numpy ndarray. Make sure the trajectory is already aligned and/or centered if needed!

Step-by-step tutorial can be found in the scripts/inputs/preprocessing.ipynb.

2. Cluster Assignment

In this example, we will use k-means clustering to assign labels to the clusters and the number of clusters will be 20. Any clustering method can be used as long as the data is clustered (e.g. DBSCAN, Hierarchical Clustering). Please check out MDANCE for more clustering methods!

scripts/nani/assign_labels.py will assign labels to the clusters using k-means clustering

# System info - EDIT THESE
input_traj_numpy = '../../example/aligned_tau.npy'
N_atoms = 50
sieve = 1

# K-means params - EDIT THESE
n_clusters = 20
output_dir = 'outputs'

Inputs

System info

input_traj_numpy is the numpy array prepared from step 1.
N_atoms is the number of atoms used in the clustering, should be same as atom selection in step 1.
sieve takes every sieveth frame from the trajectory for analysis.

k-means params

n_clusters is the number of clusters for labeling.
output_dir is the directory where the output files will be saved.

Execution

bash python assign_labels.py

Outputs

  1. csv file containing the cluster labels for each frame.
  2. csv file containing the population of each cluster.

3. Cluster Trajectories

scripts/outputs/postprocessing.ipynb will use the indices from last step to extract the designated frames from the original trajectory for each cluster.

4. Cluster Normalization

With already clustered data, scripts/normalization/normalize.py will normalize the trajectory data between $[0,1]$ using the Min-Max Normalization.

# System info - EDIT THESE
input_top = '../../example/aligned_tau.pdb'
unnormed_cluster_dir = '../clusters/outputs/clusttraj_*'
output_base_name = 'normed_clusttraj'
atomSelection = 'resid 3 to 12 and name N CA C O H'
n_clusters = 10

Inputs

System info

input_top is the topology file used in the clustering.
unnormed_cluster_dir is the directory where the clustering files are located from step 3.
output_base_name is the base name for the output files.
atomSelection is the atom selection used in the clustering.
n_clusters is the number of clusters used in the PRIME. If number less than total number of cluster, it will take top n number of clusters.

bash python normalize.py

Outputs

  1. normed_clusttraj.c*.npy files, normalized clustering files.
  2. normed_data.npy, appended all normed files together.

5. Similarity Calculations

scripts/prime/exec_similarity.py generates a similarity dictionary from running PRIME.

  • -h - for help with the argument options.
  • -m - methods, pairwise, union, medoid, outlier (required).
  • -n - number of clusters (required).
  • -i - similarity index, RR or SM (required).
  • -t - Fraction of outliers to trim in decimals (default is None).
  • -w - Weighing clusters by frames it contains (default is True).
  • -d - directory where the normed_clusttraj.c*.npy files are located (required)
  • -s - location where summary file is located with population of each cluster (required)

Example

bash python ../../utils/similarity.py -m union -n 10 -i SM -t 0.1 -d ../normalization -s ../clusters/outputs/summary_20.txt

To generate a similarity dictionary using data in ../normalization (make sure you are in the prime directory) using the union method (2.2 in Fig 2) and Sokal Michener index. In addition, 10% of the outliers were trimmed. You can either python exec_similarity.py or run example above.

Outputs

w_union_SM_t10.txt file with the similarity dictionary. The result is a dictionary organized as followes: Keys are frame #. Values are [cluster 1 similarity, cluster #2 similarity, ..., average similarity of all clusters].

6. Representative Frames

scripts/prime/execrepframes.py will determine the native structure of the protein using the similarity dictionary generated in step 5.

  • h - for help with the argument options.
  • m - methods (for one method, None for all methods)
  • s - folder to access for w_union_SM_t10.txt file
  • i - similarity index (required)
  • t - Fraction of outliers to trim in decimals (default is None).
  • d - directory where the normed_clusttraj.c* files are located (required if method is None)

Example

bash python ../../utils/rep_frames.py -m union -s outputs -d ../normalization -t 0.1 -i SM

Outputs

w_rep_SM_t10_union.txt file with the representative frames index.

Further Reading

For more information on the PRIME algorithm, please refer to the PRIME paper. Please cite using CITATION.bib.

methods

Fig 2. Six techniques of protein refinement. Blue is top cluster.

Funding

Research contained in this package was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM150620.

Owner

  • Name: mqcomplab
  • Login: mqcomplab
  • Kind: organization
  • Email: ramirandaq@gmail.com
  • Location: United States of America

MQ Lab software and projects

Citation (CITATION.bib)

@article{chen_protein_2024,
	title = {Protein retrieval via integrative molecular ensembles ({PRIME}) through extended similarity indices},
	url = {https://www.biorxiv.org/content/early/2024/03/21/2024.03.19.585783},
	doi = {10.1101/2024.03.19.585783},
	abstract = {Molecular dynamics (MD) simulations are ideally suited to describe conformational ensembles of biomolecules such as proteins and nucleic acids. Microsecond-long simulations are now routine, facilitated by the emergence of graphical processing units. Processing such ensembles on the basis of statistical mechanics can bring insights about different biologically relevant states, their representative structures, states, and even dynamics between states. Clustering, which groups objects based on structural similarity, is typically used to process ensembles, leading to different states, their populations, and the identification of representative structures. For some purposes, such as in protein structure prediction, we are interested in identifying the representative structure that is more similar to the native state of the protein. The traditional pipeline combines hierarchical clustering for clustering and selecting the cluster centroid as representative of the cluster. However, even when the first cluster represents the native basin, the centroid can be several angstroms away in RMSD from the native state and many other structures inside this cluster could be better choices of representative structures, reducing the need for protein structure refinement. In this study, we developed a module Protein Retrieval via Integrative Molecular Ensemble (PRIME), that consists of tools to determine the most prevalent states in an ensemble using extended continuous similarity. PRIME is integrated with our Molecular Dynamics Analysis with N ary Clustering Ensembles (MDANCE) package and can be used as a post-processing tool for arbitrary clustering algorithms, compatible with several MD suites. PRIME was validated with ensembles of different protein and protein complex systems for their ability to reliably identify the most native-like state, which we compare to their experimental structure, and to the traditional approach. Systems were chosen to represent different degrees of difficulty such as folding processes and binding which require large conformational changes. PRIME predictions produced structures that when aligned to the experimental structure were better superposed (lower RMSD). A further benefit of PRIME is its linear scaling rather than the traditional O(N2) traditionally associated to comparisons of elements in a set.Competing Interest StatementThe authors have declared no competing interest.},
	journal = {bioRxiv : the preprint server for biology},
	author = {Chen, Lexin and Mondal, Arup and Perez, Alberto and Miranda-Quintana, Ramon Alain},
	year = {2024},
}

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 0
  • Total pull requests: 28
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 27
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • lexin-chen (28)
Top Labels
Issue Labels
Pull Request Labels