Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Repository
Genome-scale detection of selective pressure variation
Basic Info
Statistics
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 10
- Releases: 8
Metadata Files
README.md
Vespasian
Vespasian performs genome scale detection of site and branch-site signatures of positive selection by orchestrating evolutionary hypothesis tests with PAML. Given a collection of alignments of protein-coding orthologous gene families and labelled trees, Vespasian infers gene trees from a species tree and evaluates site and lineage-specific models of evolution. Model testing is CPU-intensive but embarrassingly parallel, and can be executed on one or many machines with snakemake. Vespasian is the pure Python successor to VESPA by Webb et al. (2017).
Installation
Installing Miniconda
If the conda package manager is already installed, skip this step, otherwise:
Linux
- Install Miniconda, following instructions and accepting default options:
bash
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
MacOS
An x86_64 Miniconda installation is required in order to install Vespasian.
- If using a Mac with an Intel processor, skip this step. Otherwise:
bash
arch -x86_64 zsh
- Install Miniconda, following instructions and accepting default options:
bash
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
Installing Vespasian
- If using a Mac has an Intel processor, skip this step. Otherwise:
bash
arch -x86_64 zsh
Install Vespasian:
bash curl -OJ https://raw.githubusercontent.com/bede/vespasian/master/environment.yml conda env create -f environment.yml conda activate vespasianTest Vespasian:
bash vespasian version
Development install
bash
conda create -y -n vespasian-dev python=3.11 paml==4.10.6 -c conda-forge -c bioconda
conda activate vespasian-dev
git clone https://github.com/bede/vespasian
pip install --editable './vespasian[dev]'
Usage
Step 1: gene tree inference from a species tree
e.g. vespasian infer-gene-trees --warnings --progress input tree
- Required input (please read carefully):
inputPath to directory containing orthologous gene families as individual nucleotide alignments in fasta format with a.fastaor.faextension. These should be in frame and free from stop codons. Fasta headers should contain a taxonomic identifier (mirroring tip labels in the tree file), optionally followed by separator character ('|' by default). A minimum of seven taxa must be present.treePath to species tree in Newick format. Tip labels must correspond to fasta headers before the separator character.
- Output:
- Directory (default name
gene-trees) containing minimal gene trees for each family.
- Directory (default name
``` $ vespasian infer-gene-trees -h usage: vespasian infer-gene-trees [-h] [-o OUTPUT] [-s SEPARATOR] [-w] [-p] input tree
Create gene trees by pruning a given species tree
positional arguments: input path to directory containing gene families tree path to newick formatted species tree
optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT path to output directory (default: 'gene-trees') -s SEPARATOR, --separator SEPARATOR character separating taxon name and identifier(s) (default: '|') -w, --warnings show warnings (default: False) -p, --progress show progress bar (default: False) ```
Step 2: Configure model test environments
e.g. vespasian codeml-setup --progress --warnings --branches branches.yml input gene-trees
Required input:
inputPath to directory containing aligned orthologous gene families as individual fasta files.gene-treesPath to directory containing minimal gene trees.
Optional input:
—-branches BRANCHESPath to yaml file containing a YAML mapping of lineages to be labelled for evaluation of lineage-specific evolutionary signal using branch-site tests. To label an individual leaf node taxon, specify its name followed by a colon. To label an internal node, choose a suitable name (e.g.carnivora) followed by a colon and its corresponding leaf nodes inside square brackets (a sequence in yaml) and separated by commas. For internal nodes, all child nodes present in the species tree must be specified, even if they are not present in all of the gene families.yaml cat: carnivora: [cat, dog]
Output:
- Directory (default name
codeml) containing nested directory structure of models and starting parameters for each gene family. - File
codeml-commands.shcontaining list of commands to execute the model tests - File
Snakefilefor running the contents ofcodeml-commands.shlocally or using a cluster
- Directory (default name
N.B. By default, at least two taxa must be present within a given family for a named internal node to be labelled. Use --strict to skip named internal nodes unless all child leaf nodes are present.
``` $ vespasian codeml-setup -h usage: vespasian codeml-setup [-h] [-b BRANCHES] [-o OUTPUT] [--separator SEPARATOR] [--strict] [-t THREADS] [-w] [-p] input gene-trees
Create suite of branch and branch-site codeml environments
positional arguments: input path to directory containing aligned gene families gene-trees path to directory containing gene trees
optional arguments: -h, --help show this help message and exit -b BRANCHES, --branches BRANCHES path to yaml file containing branches to be labelled (default: -) -o OUTPUT, --output OUTPUT path to output directory (default: 'codeml') --separator SEPARATOR character separating taxon name and identifier(s) (default: '|') --strict label only branches with all taxa present in tree (default is >= 2) (default: False) -t THREADS, --threads THREADS number of parallel workers (default: 6) -w, --warnings show warnings (default: False) -p, --progress show progress bar (default: False) ```
Step 3: Run models
e.g. cd codeml && snakemake --cores 8
- Ensure
codemlbinary is present inside$PATH - Using PAML version
4.9=h01d97ff_5from Conda is recommended cd codeml(the directory created bycodeml-setupin step 2)- Local execution (for small jobs)
snakemake -k --cores 8(recommended)- Or, using GNU parallel (not recommended – doesn't catch errors!)
parallel --bar :::: codeml-commands.sh
Cluster execution
snakemake -k --cores MAXJOBS --cluster OPTIONS- SGE example:
snakemake -k --jobs 100 --cluster "qsub -cwd -V" --max-status-checks-per-second 0.1- Oxford Rescomp:
qsub -cwd -V -P bag.prjc -q short.qc - Profiles are available for other cluster platforms
Step 4: Report model tests and positively selected sites
e.g. vespasian report --progress input
- Required input (please read carefully):
inputpath to directory (defaultcodeml) containing models configured in step 2 and executed in step 3
- Output:
- Directory containing per-gene tables of likelihood ratio test results, model parameters, and positively selected sites from the highest scoring models.
``` $ vespasian report -h usage: vespasian report [-h] [-o OUTPUT] [--hide] [-p] input
Perform likelihood ratio tests and and report positively selected sites
positional arguments: input path to codeml-setup output directory
optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT path to output directory (default: 'report-codeml') --hide hide gratuitous emperor portrait (default: False) -p, --progress show progress bar (default: False) ```
Todo
- [ ] Positively selected site visualisation
- [ ] Python API
- [ ] Specify site and/or branch-site models only
- [ ] Renaming:
- [ ]
infer-gene-trees->infer-trees
- [ ]
- [ ] Consider B-H correction
Owner
- Name: Bede Constantinides
- Login: bede
- Kind: user
- Company: Oxford Nanopore Technologies
- Website: bede.im
- Twitter: beconsta
- Repositories: 76
- Profile: https://github.com/bede
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite it using these metadata.
title: "Vespasian: genome scale detection of selective pressure variation"
abstract: "Vespasian performs genome scale detection of site and branch-site signatures of positive selection by orchestrating evolutionary hypothesis tests with PAML. Given a collection of alignments of protein-coding orthologous gene families and labelled trees, Vespasian infers gene trees from a species tree and evaluates site and lineage-specific models of evolution."
authors:
- family-names: Constantinides
given-names: Bede
orcid: "https://orcid.org/0000-0002-3480-3819"
- family-names: Orr
given-names: David
- family-names: Ovchinnikov
given-names: Vladimir
orcid: "https://orcid.org/0000-0003-0350-8478"
- family-names: Mulhair
given-names: Peter
orcid: "https://orcid.org/0000-0003-3311-4883"
- family-names: Webb
given-names: Andrew E
- family-names: O'Connell
given-names: Mary J
orcid: "https://orcid.org/0000-0002-1877-1001"
version: 0.5.3
date-released: "2021-12-14"
identifiers:
- description: Collection of archived Vespasian versions
type: doi
value: "10.5281/zenodo.5779868"
doi: 10.5281/zenodo.5779868
license: GPL-3.0
repository-code: "https://github.com/bede/vespasian"
GitHub Events
Total
Last Year
Dependencies
- actions/checkout v2 composite
- s-weigand/setup-conda v1 composite
- vespasian *
- argh *
- biopython >=1.78
- numpy >=1.20.2
- pandas >=1.2.4
- parmap *
- pyyaml *
- scipy >=1.6.2
- snakemake *
- tqdm *
- treeswift ==1.1.14