pepgenome

Proteogenomics tool that enables to map peptides to genome coordinates

https://github.com/bigbio/pepgenome

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Proteogenomics tool that enables to map peptides to genome coordinates

Basic Info
  • Host: GitHub
  • Owner: bigbio
  • License: apache-2.0
  • Language: Java
  • Default Branch: master
  • Size: 129 MB
Statistics
  • Stars: 3
  • Watchers: 5
  • Forks: 0
  • Open Issues: 4
  • Releases: 1
Created almost 5 years ago · Last pushed almost 5 years ago
Metadata Files
Readme License Citation

README.md

PepGenome

Release codecov Documentation Status

PepGenome tool

In proteogenomic analyses it is essential to know the loci giving rise to peptides in order to improve genomic annotation and the functional characterization of protein products in their biological context. With next-generation sequencing of DNA and RNA for each sample studied by proteomic mass spectrometry integration and visualisation in a common coordinate system, i.e. the genome, is vital for systems biology. Advances in technology in mass spectrometry now allow almost complete quantification of the sample proteome. With research moving to protein quantitative trait loci (pQTL) to identify genomic alterations with functional effects on the proteome and the high complexity of combinations thereof integration and visualisation of protein and peptide quantification on genomic loci is paramount for this type of analysis. Furthermore, moving towards more personal multi-omics studies comparative visualisation of proteomic data on a genome has been lacking. Not only genomic variation affecting proteins have come into focus of functional integration studies but also post-translational modifications (PTM), the effect of single nucleotide variants and other alterations on PTMs and alternative modification loci, and the effects of alternative PTMs on protein abundance have become more a centre of attention for researchers. To facilitate this type of integration not only the genomic locations of modified peptides but specifically the genomic loci of associated with these modifications is required. Here, we provide a mapping tool, PepGenome, to quickly and efficiently identify genomic loci of peptides and post-translational modifications and couple these mappings with associated quantitative values over multiple samples. Using reference gene annotation and an associated transcript translations our tool identifies the genomic loci of peptides given as input and generates output in different formats borrowed from genomics and transcriptomics which can be loaded in various genome browsers such as UCSC Genome Browser, Ensembl Genome Browser, BioDalliance, and the Integrative Genomics Viewer.

Learn and Support

PepGenome uses transcript translations and reference gene annotations to identify the genomic loci of peptides and post-translational modifications. Multiple occurrences of peptides in the input data resulting in the same genomic loci will be collapsed as a single occurrence in the output.

Input format

The input format required by PepGenome is a tab delimited file with four columns.

ColumnColumn headerDescription
1SampleName of sample or experiment
2PeptidePeptide sequence with PSI-MS modification names in round brackets following the mpdified amino acid, e.g. PEPT(Phopsho)IDE for a phosphorylated threonine
3PSMsNumber of peptide-spectrum matches (PSMs) for the given peptide
4QuantQuantitative value for the given peptide in the given sample

Additional Input Files:

In addition the tool support mzTab File format input.

How to run

bash $ java -jar pepgenome-{version}-bin.jar

Output formats

BED

This format contains the genomic loci for peptides, the exon-structure, the peptide sequence, as well as a colour code for uniqueness of peptides within the genome.

Colour Description
Peptide is unique to single gene AND single transcript
Peptide is unique to single gene BUT shared between multiple transcripts
Peptide is shared between multiple genes

PTMBED

Like BED but containing the location of the post-translational modification on the genome. Thick parts of the peptide blocks indicate the position of the post-translational modification on a single amino acid (short thick block) while longer blocks indicate the occurrence of the first and last post-translational modification and residues in between. In the PTMBED the colour code is changed to indicate the type of modification.

Colour Post-translational Modification
Phosphorylation (phospho)
Acetylation (acetyl)
Amidation (amidated)
Oxidation (oxidation)
Methylation (methyl)
Ubiquitinylation (glygly; gg)
Sulfation (sulfo)
Palmitoylation (palmitoyl)
Formylation (formyl)
Deamidation (deamidated)
Any other post-translational modification

GTF

This output format contains besides the genomic loci the annotated information for the genes giving rise to each peptide sequence including status and biotype. For each mapped peptide the sample, number of peptide-spectrum matches and associated quantitative value as tags.

GCT

In this format the peptide sequences are combines with the Ensembl gene identifier. It contains the genomic loci for each peptide as well as the quantitative values for each peptide in different samples as a matrix.

Usage

Required arguments:

-fasta TRANSL
Filepath for file containing protein sequences in FASTA format
-gtf ANNO
Gene annotation with coding sequences (CDS) in GTF format
-in *.tsv
Path to single input file or comma separated list of paths to input files containing peptides to be mapped with associated number of peptide to spectrum matches, sample name and quantitative value (see input file format)

Optional arguments:

-format OUTF
Set output format GTF, GCT, BED, PTMBED or ALL. Comma separated combination possible. Default = ALL
-merge TRUE/FALSE
Set TRUE to merge output of multiple input files (output will be named after last input file *merged). Default = FALSE
-source SRC
Set TRUE to merge output of multiple input files (output will be named after last input file *merged). Default = FALSE
-mm NUM
Number of mismatches allowed in mapping (0, 1 or 2). DEFAULT = 0
-mmmode TRUE/FALSE
Set TRUE to restrict number of mismatch in kmer to 1. DEFAULT = FALSE
-genome GENOME
Filepath for the fine containing genome sequences in Ensembl FASTA format. Used to identify chromosome names and order and differenciate between chromosomes and scaffolds. If not set chromosome names are extracted from the GTF file without differenciation between chromosomes and scaffolds
-chr NUM
Export chr prefix Allowed 0, 1. (DEFAULT = 0)

Owner

  • Name: BigBio Stack
  • Login: bigbio
  • Kind: organization
  • Email: proteomicsstack@gmail.com
  • Location: Cambridge, UK

Provide big data solutions Bioinformatics

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 3.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ypriverol (4)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pom.xml maven
  • commons-cli:commons-cli 1.4
  • io.github.bigbio.external:pia 1.3.20
  • io.github.bigbio.pgatk:pgatk-io 1.0.1-SNAPSHOT
  • log4j:log4j 1.2.17
  • org.apache.commons:commons-lang3 3.8
  • org.ehcache:sizeof 0.4.0
  • org.mapdb:mapdb 3.0.7
  • org.projectlombok:lombok 1.18.2
  • uk.ac.ebi.jmzidml:jmzidentml 1.2.11
  • junit:junit 4.13.1 test