vespasian

Genome-scale detection of selective pressure variation

https://github.com/bede/vespasian

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Genome-scale detection of selective pressure variation

Basic Info
  • Host: GitHub
  • Owner: bede
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 8.48 MB
Statistics
  • Stars: 3
  • Watchers: 3
  • Forks: 1
  • Open Issues: 10
  • Releases: 8
Created almost 8 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

DOI Tests PyPI

Vespasian

Vespasian performs genome scale detection of site and branch-site signatures of positive selection by orchestrating evolutionary hypothesis tests with PAML. Given a collection of alignments of protein-coding orthologous gene families and labelled trees, Vespasian infers gene trees from a species tree and evaluates site and lineage-specific models of evolution. Model testing is CPU-intensive but embarrassingly parallel, and can be executed on one or many machines with snakemake. Vespasian is the pure Python successor to VESPA by Webb et al. (2017).

Installation

Installing Miniconda

If the conda package manager is already installed, skip this step, otherwise:

Linux

  • Install Miniconda, following instructions and accepting default options:

bash curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh

MacOS

An x86_64 Miniconda installation is required in order to install Vespasian.

  • If using a Mac with an Intel processor, skip this step. Otherwise:

bash arch -x86_64 zsh

  • Install Miniconda, following instructions and accepting default options:

bash curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh bash Miniconda3-latest-MacOSX-x86_64.sh

Installing Vespasian

  • If using a Mac has an Intel processor, skip this step. Otherwise:

bash arch -x86_64 zsh

  • Install Vespasian: bash curl -OJ https://raw.githubusercontent.com/bede/vespasian/master/environment.yml conda env create -f environment.yml conda activate vespasian

  • Test Vespasian: bash vespasian version

Development install

bash conda create -y -n vespasian-dev python=3.11 paml==4.10.6 -c conda-forge -c bioconda conda activate vespasian-dev git clone https://github.com/bede/vespasian pip install --editable './vespasian[dev]'

Usage

Step 1: gene tree inference from a species tree

e.g. vespasian infer-gene-trees --warnings --progress input tree

  • Required input (please read carefully):
    • input Path to directory containing orthologous gene families as individual nucleotide alignments in fasta format with a .fasta or .fa extension. These should be in frame and free from stop codons. Fasta headers should contain a taxonomic identifier (mirroring tip labels in the tree file), optionally followed by separator character ('|' by default). A minimum of seven taxa must be present.
    • tree Path to species tree in Newick format. Tip labels must correspond to fasta headers before the separator character.
  • Output:
    • Directory (default name gene-trees) containing minimal gene trees for each family.

``` $ vespasian infer-gene-trees -h usage: vespasian infer-gene-trees [-h] [-o OUTPUT] [-s SEPARATOR] [-w] [-p] input tree

Create gene trees by pruning a given species tree

positional arguments: input path to directory containing gene families tree path to newick formatted species tree

optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT path to output directory (default: 'gene-trees') -s SEPARATOR, --separator SEPARATOR character separating taxon name and identifier(s) (default: '|') -w, --warnings show warnings (default: False) -p, --progress show progress bar (default: False) ```

Step 2: Configure model test environments

e.g. vespasian codeml-setup --progress --warnings --branches branches.yml input gene-trees

  • Required input:

    • input Path to directory containing aligned orthologous gene families as individual fasta files.
    • gene-trees Path to directory containing minimal gene trees.
  • Optional input:

    • —-branches BRANCHES Path to yaml file containing a YAML mapping of lineages to be labelled for evaluation of lineage-specific evolutionary signal using branch-site tests. To label an individual leaf node taxon, specify its name followed by a colon. To label an internal node, choose a suitable name (e.g. carnivora) followed by a colon and its corresponding leaf nodes inside square brackets (a sequence in yaml) and separated by commas. For internal nodes, all child nodes present in the species tree must be specified, even if they are not present in all of the gene families.
    • yaml cat: carnivora: [cat, dog]
  • Output:

    • Directory (default name codeml) containing nested directory structure of models and starting parameters for each gene family.
    • File codeml-commands.sh containing list of commands to execute the model tests
    • File Snakefile for running the contents of codeml-commands.sh locally or using a cluster

N.B. By default, at least two taxa must be present within a given family for a named internal node to be labelled. Use --strict to skip named internal nodes unless all child leaf nodes are present.

``` $ vespasian codeml-setup -h usage: vespasian codeml-setup [-h] [-b BRANCHES] [-o OUTPUT] [--separator SEPARATOR] [--strict] [-t THREADS] [-w] [-p] input gene-trees

Create suite of branch and branch-site codeml environments

positional arguments: input path to directory containing aligned gene families gene-trees path to directory containing gene trees

optional arguments: -h, --help show this help message and exit -b BRANCHES, --branches BRANCHES path to yaml file containing branches to be labelled (default: -) -o OUTPUT, --output OUTPUT path to output directory (default: 'codeml') --separator SEPARATOR character separating taxon name and identifier(s) (default: '|') --strict label only branches with all taxa present in tree (default is >= 2) (default: False) -t THREADS, --threads THREADS number of parallel workers (default: 6) -w, --warnings show warnings (default: False) -p, --progress show progress bar (default: False) ```

Step 3: Run models

e.g. cd codeml && snakemake --cores 8

  • Ensure codeml binary is present inside $PATH
  • Using PAML version 4.9=h01d97ff_5 from Conda is recommended
  • cd codeml (the directory created by codeml-setup in step 2)
  • Local execution (for small jobs)
    • snakemake -k --cores 8 (recommended)
    • Or, using GNU parallel (not recommended – doesn't catch errors!)
    • parallel --bar :::: codeml-commands.sh
  • Cluster execution

    • snakemake -k --cores MAXJOBS --cluster OPTIONS
    • SGE example:

Step 4: Report model tests and positively selected sites

e.g. vespasian report --progress input

  • Required input (please read carefully):
    • input path to directory (default codeml) containing models configured in step 2 and executed in step 3
  • Output:
    • Directory containing per-gene tables of likelihood ratio test results, model parameters, and positively selected sites from the highest scoring models.

``` $ vespasian report -h usage: vespasian report [-h] [-o OUTPUT] [--hide] [-p] input

Perform likelihood ratio tests and and report positively selected sites

positional arguments: input path to codeml-setup output directory

optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT path to output directory (default: 'report-codeml') --hide hide gratuitous emperor portrait (default: False) -p, --progress show progress bar (default: False) ```

Todo

  • [ ] Positively selected site visualisation
  • [ ] Python API
  • [ ] Specify site and/or branch-site models only
  • [ ] Renaming:
    • [ ] infer-gene-trees -> infer-trees
  • [ ] Consider B-H correction

Owner

  • Name: Bede Constantinides
  • Login: bede
  • Kind: user
  • Company: Oxford Nanopore Technologies

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it using these metadata.
title: "Vespasian: genome scale detection of selective pressure variation"
abstract: "Vespasian performs genome scale detection of site and branch-site signatures of positive selection by orchestrating evolutionary hypothesis tests with PAML. Given a collection of alignments of protein-coding orthologous gene families and labelled trees, Vespasian infers gene trees from a species tree and evaluates site and lineage-specific models of evolution."
authors:
  - family-names: Constantinides
    given-names: Bede
    orcid: "https://orcid.org/0000-0002-3480-3819"
  - family-names: Orr
    given-names: David
  - family-names: Ovchinnikov
    given-names: Vladimir
    orcid: "https://orcid.org/0000-0003-0350-8478"
  - family-names: Mulhair
    given-names: Peter
    orcid: "https://orcid.org/0000-0003-3311-4883"
  - family-names: Webb
    given-names: Andrew E
  - family-names: O'Connell
    given-names: Mary J
    orcid: "https://orcid.org/0000-0002-1877-1001"
version: 0.5.3
date-released: "2021-12-14"
identifiers:
  - description: Collection of archived Vespasian versions
    type: doi
    value: "10.5281/zenodo.5779868"
doi: 10.5281/zenodo.5779868
license: GPL-3.0
repository-code: "https://github.com/bede/vespasian"

GitHub Events

Total
Last Year

Dependencies

.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • s-weigand/setup-conda v1 composite
environment.yml pypi
  • vespasian *
pyproject.toml pypi
  • argh *
  • biopython >=1.78
  • numpy >=1.20.2
  • pandas >=1.2.4
  • parmap *
  • pyyaml *
  • scipy >=1.6.2
  • snakemake *
  • tqdm *
  • treeswift ==1.1.14