psfm

Position Specific Frequency Matrix

https://github.com/thp42/psfm

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Position Specific Frequency Matrix

Basic Info
  • Host: GitHub
  • Owner: thp42
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 19.5 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

PSFM - Position Specific Frequency Matrix and Heatmap Generator

This Python script calculates the frequency of each amino acid at each position from multiple sequence alignments (MSAs, as .fasta) and generates a heatmap visualization. It is designed to aid in the analysis of protein families by highlighting conservation and variability across sequences.


System Requirements

--- ### Software Requirements #### **OS Requirements** This package should work for most operating systems, as it is a standalone python script. The package has been tested on the following systems: - Linux: Ubuntu 20.04 #### **Dependencies** There is no ```environment.yml``` file provided. Please ensure the following Python packages are installed in your environment. The typical installation time should take some minutes. ``` python v.3.11.5 numpy v.1.26.2 matplotlib v.3.9.2 ``` --- ### Hardware Requirements #### **Recommended System** and **Database Storage Requirements** There are ***no specific hardware** requirements for running this script. A standard laptop or desktop system is sufficient for typical use cases.

Installation & Environment

This is a standalone Python script — no compilation or packaging is required. Simply clone the repository and make sure you have the required Python packages installed.


Function

- Parses sequences from `.fasta` files. - Calculates amino acid frequencies at each position in the sequence alignment. - Weights each family equally in the calculation of the average amino acid frequency, ensuring a balanced representation in the heatmap. - Generates heatmaps to visualize the frequency of each amino acid. - Implements a cut-off for underrepresentation of amino acids. - Exports frequency data to a CSV file for further analysis. - Allows for highlighting specific amino acids in the heatmap.

Instructions

To use the script, prepare a list of file paths to your `.fasta` files containing the multiple sequence alignments. For this you can take MSAs generated either by jackhmmr (see also https://github.com/thp42/SLiMFold) or searched by BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Modify the ```plt.savefig```, ```output_csv_path``` and ```family_file_paths``` list in the python script accordingly. You can also add a highlighting sequence, to get an idea of how an individual sequence fits into your conserved sequence. The heatmap will be displayed for the average frequency across all provided sequences. Note that each sequence family is weighted equally in the average calculation, ensuring that each family contributes identically to the final visualization, regardless of the number of sequences in each family.

Citation

If you use this script in published research, please cite: - Manuscript - Discovery of a new evolutionarily conserved short linear F-actin binding motif: Themistoklis Paraschiakos, Biao Yuan, Kostiantyn Sopelniak, Michael Bucher, Lisa Simon, Ksenija Zonjic, Dominic Eggers, Franziska Selle, Jing Li, Stefan Linder, Thomas C. Marlovits, Sabine Windhorst. bioRxiv 2025.04.16.649135; doi: https://doi.org/10.1101/2025.04.16.649135


Owner

  • Login: thp42
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Paraschiakos"
  given-names: "Themistoklis"
  orcid: "https://orcid.org/0000-0003-1736-2561"
title: "PSFM"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2024-02-25
url: "https://github.com/thp42/PSFM"

GitHub Events

Total
  • Push event: 9
Last Year
  • Push event: 9