mini-proteins

Simple molecular dynamics simulations of mini proteins in GROMACS. Designed to facilitate machine learning algorithim development and encourage greater dataset diversity.

https://github.com/hunter-heidenreich/mini-proteins

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Simple molecular dynamics simulations of mini proteins in GROMACS. Designed to facilitate machine learning algorithim development and encourage greater dataset diversity.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Citation

README.md

Mini-Proteins

A repo of simple molecular dynamics simulations of small proteins with GROMACS.

Overview

This repo contains scripts to: - Perform energy minimization - Solvate the protein - Add ions to neutralize the system - Equilibrate the system (NVT) - Equilibrate the system (NPT) - Run a production simulation - Post-process the simulation of a mini-protein using GROMACS.

In this repo, we consider a "mini-protein" to be a non-technical designation for a single amino acid residue (or a dipeptide), capped with an acetyl group and an N-methyl group.

Frequently, alanine dipeptide (Ace-Ala-Nme) is used as a model system for protein folding studies. It's especially enjoyed by machine learning researchers, because it's small enough to be simulated quickly, but large enough to exhibit interesting folding behavior.

This repo extends a typical data generation of alanine dipeptide to include other amino acids. While not all amino acids are included, these scripts could allow for easy generation of multiple so-called dipeptide "mini-proteins" for machine learning studies to add slight diversity to the models considered.

For example, the addition of a disulfide bond in methionine dipeptide could be used to study the effects of disulfide bonds on protein folding. Or the addition of a tryptophan residue could be used to study the effects of aromatic residues on protein folding. Furthermore, glycine dipeptide could be used to study the effects of a residue with a small side chain on protein folding, inducing more flexibility.

The scripts are written building off of the GROMACS tutorial by Luca Tubiana at the University of Trento. We make several key deviations: - langevin dynamics is used instead of velocity rescaling - the production simulation is run for a longer time - the production simulation writes uncompressed trajectory files, which are much larger but allow for force extraction

Usage

0. Prepare the simulation structure

The first step is to prepare the simulation structure. This is done by running the 0_preprocess.sh script: ID=ala sh scripts/0_preprocess.sh where ID is the three-letter amino acid code of the protein to simulate.

This script will: - Build the protein topology from the PDB file - Build the box - Solvate the protein in water - Add ions to neutralize the system

Additional parameters can be found at the top of the script.

1. Energy Minimization & Equilibration

The next step is to perform energy minimization and equilibration. This is done by running the 1_equil.sh script: bash ID=ala sh scripts/1_equil.sh where ID is the three-letter amino acid code of the protein to simulate.

This script will: - Perform energy minimization (using steepest descent, see config/minim.mdp for all parameters) - Equilibrate the system with constant volume (NVT, T=298K, see config/nvt.mdp for all parameters) for 100 ps - Equilibrate the system with constant pressure (NPT, T=298K, P=1bar, see config/npt.mdp for all parameters) for 200 ps

Additional parameters can be found at the top of the script.

2. Production Simulation

The next step is to run the production simulation. This is done by running the 2_prod.sh script: bash ID=ala sh scripts/2_prod.sh where ID is the three-letter amino acid code of the protein to simulate.

This script will: - Run the production simulation (NVT, T=298K, see config/prod.mdp for all parameters) for 1 ns - A full simulation would be much longer, but this is sufficient for a demonstration

Additional parameters can be found at the top of the script.

3. Post-Process Simulation

The final step is to post-process the simulation.

This is done by running the 3_post.sh script: bash ID=ala sh scripts/3_post.sh where ID is the three-letter amino acid code of the protein to simulate.

This script will: - Generate a plot of the potential energy over time - Generate a plot of the total energy over time - Generate a plot of the temperature over time - Extract the trajectory as a PDB file - Extract the forces as a xvg file

Additional parameters can be found at the top of the script.

All-in-one

Alternatively, all of the above steps can be run at once by running the run.sh script: bash ID=ala sh scripts/run.sh where ID is the three-letter amino acid code of the protein to simulate.

Included Proteins (And Providence)

Alanine Dipeptide

Alanine Dipetide

  • data/ala.pdb:
  • Alanine Dipeptide (Ace-Ala-Nme)
  • PubChem CID: 5484387 (URL)
  • ATB: URL

Glycine Dipeptide

Glycine Dipetide

  • data/gly.pdb:
  • Glycine Dipeptide (Ace-Gly-Nme)
  • PubChem CID: 439506 (URL)
  • ATB: URL

Isoleucine Dipeptide

Isoleucine Dipetide

  • data/ile.pdb
  • Isoleucine Dipeptide (Ace-Ile-Nme)
  • PubChem CID: 7019852 (URL)
  • ATB: URL

Leucine Dipeptide

Leucine Dipetide

  • data/leu.pdb
  • Leucine Dipeptide (Ace-Leu-Nme)
  • PubChem CID: 6950977 (URL)
  • ATB: URL

Methionine Dipeptide

Methionine Dipetide

  • data/met.pdb
  • Methionine Dipeptide (Ace-Met-Nme)
  • PubChem CID: 13875186 (URL)
  • ATB: URL
  • Contains a disulfide bond

Phenylalanine Dipeptide

Phenylalanine Dipetide

  • data/phe.pdb
  • Phenylalanine Dipeptide (Ace-Phe-Nme)
  • PubChem CID: 7019860 (URL)
  • ATB: URL

Proline Dipeptide

Proline Dipetide

  • data/pro.pdb
  • Proline Dipeptide (Ace-Pro-Nme)
  • PubChem CID: 5245806 (URL)
  • ATB: URL

Tryptophan Dipeptide

Tryptophan Dipetide

  • data/trp.pdb
  • Tryptophan Dipeptide (Ace-Trp-Nme)
  • PubChem CID: 151412 (URL)
  • ATB: URL

Valine Dipeptide

Valine Dipetide

  • data/val.pdb
  • Valine Dipeptide (Ace-Val-Nme)
  • PubChem CID: 13875188 (URL)
  • ATB: URL

Citation

If you use this repo in your research, please cite:

@misc{Heidenreich_Mini-proteins_2023, author = {Heidenreich, Hunter}, month = sep, title = {{Mini-proteins}}, url = {https://github.com/hunter-heidenreich/mini-proteins}, year = {2023} }

Owner

  • Name: Hunter Heidenreich
  • Login: hunter-heidenreich
  • Kind: user
  • Location: Cambridge, MA
  • Company: Harvard University

AI, ML, DL, HPC If there's a science and engineering problem in need of ML, I'm interested 😈

Citation (CITATION.cff)

cff-version: 1.2.0
type: dataset
message: "If you use this software, please cite it as below."
authors:
- family-names: "Heidenreich"
  given-names: "Hunter"
title: "Mini-proteins"
version: 0.0.1
date-released: "2023-09-19"
url: "https://github.com/hunter-heidenreich/mini-proteins"

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • matplotlib *
  • numpy *