pairsnp
A set of scripts for calculating pairwise SNP distance
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Keywords
Repository
A set of scripts for calculating pairwise SNP distance
Basic Info
Statistics
- Stars: 42
- Watchers: 5
- Forks: 7
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
pairsnp
A set of scripts for very quickly obtaining pairwise SNP distance matrices from multiple sequence alignments using sparse matrix libraries to improve performance.
For larger alignments such as the Maela pneumococcal data set (3e5 x 3e3) the c++ version is approximately an order of magnitude faster than approaches based on pairwise comparison of every site such as snp-dists from which the skeleton code for the c++ version was taken.
In order to be most useful implementations in R, python and c++ are available.
| Implementation | Travis |
| ------------- |:-------------:|
| R | |
| python |
|
| c++ |
|
Installation
R
The R version can be installed using devtools or downloaded from its repository
```
install.packages("devtools")
devtools::install_github("gtonkinhill/pairsnp-r") ```
python
The python version can be installed using pip or by downloading the repository and running setup.py.
python -m pip install pairsnp
or alternatively download the repository and run
cd ./pairsnp-python/
python ./setup.py install
c++
The c++ version can be installed manually, by downloading the binaries in this repository, or with conda as
conda install -c bioconda pairsnp
The c++ code relies on a recent version of Armadillo (currently tested on v8.6) and after downloading the repository can be built by running
cd ./pairsnp-cpp/
./configure
make
make install
The majority of time is spend doing sparse matrix multiplications so linking to a parallelised library for this is likely to improve performance further.
At the moment you may need to run touch ./* before compiling to avoid some issues with time stamps.
Quick Start
R
library(pairsnp)
fasta.file.name <- system.file("extdata", "seqs.fa", package = "pairsnp")
sparse.data <- import_fasta_sparse(fasta.file.name)
d <- snp_dist(sparse.data)
python
The python version can be run from the python interpreter as
``` from pairsnp import calculatesnpmatrix, calculatedistancematrix
sparsematrix, consensus, seqnames = calculatesnpmatrix(fasta.file.name) d = calculatedistancematrix(sparse_matrix, consensus, "dist", False) ```
alternatively if installed using pip it can be used at the command line as
pairsnp -f /path/to/msa.fasta -o /path/to/output.csv
additional options include
``` Program to calculate pairwise SNP distance and similarity matrices.
optional arguments: -h, --help show this help message and exit -t {sim,dist}, --type {sim,dist} either sim (similarity) or dist (distance) (default). -n, --inc_n flag to indicate differences to gaps should be counted. -f FILENAME, --file FILENAME location of a multiple sequence alignment. Currently only DNA alignments are supported. -z, --zipped Alignment is gzipped. -c, --csv Output csv-delimited table (default tsv). -o OUTPUT, --out OUTPUT location of output file. ```
c++
The c++ version can be run from the command line as
pairsnp -c msa.fasta > output.csv
additional options include
SYNOPSIS
Pairwise SNP distance matrices using fast matrix algerbra libraries
USAGE
pairsnp [options] alignment.fasta[.gz] > matrix.csv
OPTIONS
-h Show this help
-v Print version and exit
-s Output in sparse matrix form (i,j,distance).
-d Distance threshold for sparse output. Only distances <= d will be returned.
-k Will on return the k nearest neighbours for each sample in sparse output.
-c Output CSV instead of TSV
-n Count comparisons with Ns (off by default)
-t Number of threads to use (default=1)
-b Blank top left corner cell instead of 'pairsnp 0.1.0'
Owner
- Name: Gerry Tonkin-Hill
- Login: gtonkinhill
- Kind: user
- Location: Australia
- Website: https://gtonkinhill.github.io
- Twitter: gerrythill
- Repositories: 14
- Profile: https://github.com/gtonkinhill
Postdoctoral Fellow at the University of Oslo | previously at the Sanger Institute | pathogen genomics, statistics, machine learning
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Gerry Tonkin-Hill | g****0@c****k | 59 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 6
- Total pull requests: 1
- Average time to close issues: 18 days
- Average time to close pull requests: less than a minute
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 1.17
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- sdwfrost (5)
- johnlees (1)
Pull Request Authors
- gtonkinhill (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 19 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 5
- Total maintainers: 1
pypi.org: pairsnp
A simple package for calculating pairwise SNP distances
- Homepage: https://github.com/gtonkinhill/pairsnp/
- Documentation: https://pairsnp.readthedocs.io/
- License: MIT License
-
Latest release: 0.0.7
published almost 7 years ago