ortho2tree

Pipeline for the selection of canonical proteins for reference proteomes

https://github.com/g-insana/ortho2tree

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 19 DOI reference(s) in README
  • Academic publication links
    Links to: pubmed.ncbi, ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary

Keywords

canonical isoforms msa proteomes refseq uniprot
Last synced: 6 months ago · JSON representation ·

Repository

Pipeline for the selection of canonical proteins for reference proteomes

Basic Info
  • Host: GitHub
  • Owner: g-insana
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 18.1 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
canonical isoforms msa proteomes refseq uniprot
Created about 2 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

ortho2tree

DOI

The UniProt Reference Proteomes dataset seeks to provide complete proteomes for an evolutionarily diverse, less redundant, set of organisms.

As higher eukaryotes often encode multiple isoforms of a protein from a single gene, the Reference Proteome pipeline selects a single representative (‘canonical’) sequence. UniProt identifies canonical isoforms using a ‘Gene-Centric’ approach: proteins are grouped by gene-identifier and for each gene a single protein sequence is chosen.

For unreviewed (UniProtKB/TrEMBL) protein sequences (and for some reviewed sequences), the longest sequence in the Gene-Centric group is usually chosen as canonical. This can create inconsistencies, selecting canonical sequences with dramatically different lengths for orthologous genes.

The Ortho2tree data pipeline examines Gene-Centric canonical and isoform sequences from sets of orthologous proteins (from PantherDB), builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. Canonical choices can be either confirmed or a better one proposed.

The pipeline and the underlying analysis is described in the journal article "Improved selection of canonical proteins for reference proteomes".

An overview of the pipeline is shown in this figure: ortho2tree pipeline overview

The pipeline can retrieve protein sequences using direct access to the UniProt databases or using the UniProt web API.

Data processing is done via pandas DataFrames employing vectorized operations and all the orthogroups can be processed in parallel if multithread is available. For each orthogroup the pipeline: - builds a Multiple Sequence Alignment (via muscle) - calculates a gap-based Neighbour-Joining tree (via BioPython using a modified pairwise distance function focused on gaps) - scans the tree to identify low-cost clades - ranks the best low-cost clades to confirm existing canonicals or suggest replacements

Contents of the repository

ortho2tree.py # main script to use to run the pipeline on the command line ortho2tree.ipynb # jupyter notebook to run the pipeline interactively ortho2tree/ # modules folder requirements.txt # list of needed packages README.md # this text MS_figures_src/ # all of the datafiles and .R code to recreate the figures in the manuscript test/ # folder containing data ready for a quick test run test.cfg # configuration file for the quick test run qfomam.cfg # configuration file for the qfomam2022_05 analysis described in the manuscript

INSTALLATION

  • git clone the repository:

git clone https://github.com/g-insana/ortho2tree.git

  • install requirements (virtual environment is optional but recommended) via pip or conda/mamba:

via pip: cd ortho2tree && python3 -m venv venv_o2t source venv_o2t/bin/activate pip3 install -r requirements.txt via conda or mamba: cd ortho2tree && mamba create --name ortho2tree --file requirements.txt --channel conda-forge mamba activate ortho2tree

Note that you also need to install muscle for multiple sequence alignments, either version v3.8.31 or the new v5.1. Please check ortho2tree/config_muscle.py and update accordingly to your installation so that the muscle executable can be found and the correct format is set (according to the muscle version used).

e.g. via conda or mamba for 3.8.31: ```

EITHER:

mamba install -c bioconda "muscle<=4.0" #3.8.31

OR:

mamba install -c bioconda 'muscle>=5.0' #5.1 ```

QUICK TEST TO CHECK INSTALLATION

  • test run of a single group

./ortho2tree.py -set test -id PTHR43715:SF1

  • example of full analysis run of a set

./ortho2tree.py -set test -no_stats

COMMAND LINE USAGE

``` usage: ortho2tree.py [-h] -set DATASET_NAME [-d] [-nocache] [-nostats] [-id SINGLEGROUP [SINGLE_GROUP ...]] [-file LIST_FILENAME] [-sugg SUGG_FILE] [-prevgc PREVGC_FILE] [-outstamp OUTSTAMP]

optional arguments: -h, --help show this help message and exit -set DATASETNAME set for the analysis. a file SET.cfg should be present -d print verbose/debug messages -nocache do not use cache, re-create alignments/trees and do not save them -nostats do not print any stats on the dataframe -id SINGLEGROUP [SINGLEGROUP ...] to only work on one or few group(s) -file LISTFILENAME to work on a series of groups, from a file -sugg SUGGFILE to simulate integration of canonical suggestions reading a previosly generated changes file; note that file should be placed in the set main dir -prevgc PREVGC_FILE to integrate previosly generated changes file; note that file should be placed in the set main dir -outstamp OUTSTAMP to name and timestamp the output files and the dumps; this overrides the outstamp parameter from the config

Examples:
   -set=qfomam                                 #will do the analysis on the whole set
   -set=qfomam -id=PTHR19918:SF1               #only for one orthogroup
   -set=qfomam -id=PTHR19918:SF1 PTHR40139:SF1 #only for two orthogroups
   -set=qfomam -file=list_of_ids.txt           #for a series of groups listed in a file

```

CONFIGURATION

Please check the the example YAML configuration files provided for the list of the parameters. E.g. test yaml configuration file

DOCUMENTATION

Please refer to the DOCS.md file for information on how to setup a new analysis and how to interpret the output produced.

Analysis of UP2022_05 QfO mammals

The manuscript "Improved selection of canonical proteins for reference proteomes" (preprint) describes the ortho2tree analysis of eight QfO (Quest for Orthologs) mammalian proteomes, based on UniProtKB data (release UP2022_05).

See the folder MSfiguressrc for datafiles and .R code to recreate the figures in the manuscript

To replicate the analysis from the paper: wget -O qfomam.tar.gz https://zenodo.org/records/10778115/files/qfomam.tar.gz?download=1 #retrieve the archive tar xfz qfomam.tgz #uncompress the archive ./ortho2tree.py -set qfomam -id PTRH43715:SF1 #run a single orthogroup ./ortho2tree.py -set qfomam -outstamp $(date +%y%m%d) #do the analysis The Zenodo archive qfomam.tgz contains pre-computed alignments, trees and clades (or alternatively from Figshare).

A web interface for filtering and viewing the pdf files (with trees and alignments for each orthogroup) from the result of that analysis (and subsequent ones) is available at fasta.bioch.virginia.edu/ortho2tree

The pdf files, generated whenever canonicals were confirmed or changes were proposed, are available as a Zenodo archive: qfomampdfdata.tgz (or alternatively from Figshare).

A script to generate the pdf files is included under the folder pdfcreation/

LINKS

CITATION

If you find this software useful, please consider citing our paper (pubmed 38130879): Insana, G., Martin, M.J. & Pearson, W.R. Improved selection of canonical proteins for reference proteomes NAR Genomics and Bioinformatics (2024). https://doi.org/10.1093/nargab/lqae066

Bibtex: @article{10.1093/nargab/lqae066, author = {Insana, Giuseppe and Martin, Maria J and Pearson, William R}, title = "{Improved selection of canonical proteins for reference proteomes}", journal = {NAR Genomics and Bioinformatics}, volume = {6}, number = {2}, pages = {lqae066}, year = {2024}, month = {06}, abstract = "{The ‘canonical’ protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting \\&gt;95\\% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022\_05, ortho2tree proposed 7804 canonical changes for release 2023\_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82\\% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92\\% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are \\&gt;60\\% identical, a group that includes vertebrates and higher plants.}", issn = {2631-9268}, doi = {10.1093/nargab/lqae066}, url = {https://doi.org/10.1093/nargab/lqae066}, }

Owner

  • Name: Giuseppe Insana
  • Login: g-insana
  • Kind: user
  • Location: Cambridge, UK
  • Company: EMBL-EBI @embl-ebi

Data scientist, biology and linguistic expert, business consultant.

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below"
authors:
- family-names: "Insana"
  given-names: "Giuseppe"
  orcid: "https://orcid.org/0000-0002-8186-1026"
- family-names: "Martin"
  given-names: "Maria J."
  orcid: "https://orcid.org/0000-0001-5454-2815"
- family-names: "Pearson"
  given-names: "William R."
  orcid: "https://orcid.org/0000-0002-0727-3680"
title: "Pipeline for the selection of canonical proteins for reference proteomes"

publication:
  authors:
  - family-names: "Insana"
    given-names: "Giuseppe"
    orcid: "https://orcid.org/0000-0002-8186-1026"
  - family-names: "Martin"
    given-names: "Maria J."
    orcid: "https://orcid.org/0000-0001-5454-2815"
  - family-names: "Pearson"
    given-names: "William R."
    orcid: "https://orcid.org/0000-0002-0727-3680"
  doi: "10.1093/nargab/lqae066"
  title: "Improved selection of canonical proteins for reference proteomes"
  journal: "NAR genomics and bioinformatics"
  year: 2024
  
related_resources:
  - type: "code"
    url: "https://github.com/g-insana/ortho2tree"

version: 2024.04.16
date-released: 2024.05.04

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • biopython >=1.78
  • numpy >=1.23.5
  • pandas >=2.0.3
  • pyyaml *
  • requests >=2.25.1