ReferenceSeeker

ReferenceSeeker: rapid determination of appropriate reference genomes - Published in JOSS (2020)

https://github.com/oschwengers/referenceseeker

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 15 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, joss.theoj.org, zenodo.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

ani bioinformatics mash microbiology reference-genomes refseq wgs

Scientific Fields

Biology Life Sciences - 41% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Rapid determination of appropriate reference genomes.

Basic Info

Host: GitHub
Owner: oschwengers
License: gpl-3.0
Language: Python
Default Branch: main
Homepage: https://doi.org/10.21105/joss.01994
Size: 19.7 MB

Statistics

Stars: 98
Watchers: 5
Forks: 6
Open Issues: 4
Releases: 14

Topics

ani bioinformatics mash microbiology reference-genomes refseq wgs

Created over 7 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct Citation

README.md

PyPI - Python Version GitHub release PyPI - Status

ReferenceSeeker: rapid determination of appropriate reference genomes

Description
Input & Output
Installation
- BioConda
- GitHub
Usage
Examples
Databases
- RefSeq
- Custom
Dependencies
Citation
Feedback

Description

ReferenceSeeker determines closely related reference genomes following a scalable hierarchical approach combining an fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.

ReferenceSeeker computes kmer-based genome distances between a query genome and potential reference genome candidates via Mash (Ondov et al. 2016). For resulting candidates ReferenceSeeker subsequently computes (bidirectional) ANI values picking genomes meeting community standard thresholds by default (ANI >= 95 % & conserved DNA >= 69 %) (Goris, Konstantinos et al. 2007) ranked by the product of ANI and conserved DNA values to take into account both genome coverage and identity.

Custom databases can be built with local genomes. For further convenience, we provide pre-built databases with sequences from RefSeq (https://www.ncbi.nlm.nih.gov/refseq), GTDB and PLSDB copmrising the following taxa:

bacteria
archaea
fungi
protozoa
viruses

as well as plasmids.

The reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes, however the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservation (next to DNA identity) is very important for many kinds of analyses, e.g. reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes. If desired, ANI and conserved DNA values can be computed bidirectionally.

Mash D vs. ANI / conDNA

Input & Output

Input

Path to a taxon database and a draft or finished genome in (zipped) fasta format:

bash $ referenceseeker ~/bacteria GCF_000013425.1.fna

Output

Tab separated lines to STDOUT comprising the following columns:

Unidirectionally (query -> references):

RefSeq Assembly ID
Mash Distance
ANI
Conserved DNA
NCBI Taxonomy ID
Assembly Status
Organism

```bash

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

GCF000013425.1 0.00000 100.00 100.00 93061 complete Staphylococcus aureus subsp. aureus NCTC 8325 GCF001900185.1 0.00002 100.00 99.89 46170 complete Staphylococcus aureus subsp. aureus HG001 GCF900475245.1 0.00004 100.00 99.57 93061 complete Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325 GCF001018725.2 0.00016 100.00 99.28 1280 complete Staphylococcus aureus FDAARGOS10 GCF003595465.1 0.00185 99.86 96.81 1280 complete Staphylococcus aureus USA300-SUR6 GCF003595385.1 0.00180 99.87 96.80 1280 complete Staphylococcus aureus USA300-SUR2 GCF003595365.1 0.00180 99.87 96.80 1280 complete Staphylococcus aureus USA300-SUR1 GCF001956815.1 0.00180 99.87 96.80 46170 complete Staphylococcus aureus subsp. aureus USA300SUR1 ... ```

Bidirectionally (query -> references [QR] & references -> query [RQ]):

RefSeq Assembly ID
Mash Distance
QR ANI
QR Conserved DNA
RQ ANI
RQ Conserved DNA
NCBI Taxonomy ID
Assembly Status
Organism

```bash

ID Mash Distance QR ANI QR Con. DNA RQ ANI RQ Con. DNA Taxonomy ID Assembly Status Organism

GCF000013425.1 0.00000 100.00 100.00 100.00 100.00 93061 complete Staphylococcus aureus subsp. aureus NCTC 8325 GCF001900185.1 0.00002 100.00 99.89 100.00 99.89 46170 complete Staphylococcus aureus subsp. aureus HG001 GCF900475245.1 0.00004 100.00 99.57 99.99 99.67 93061 complete Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325 GCF001018725.2 0.00016 100.00 99.28 99.95 98.88 1280 complete Staphylococcus aureus FDAARGOS10 GCF001018915.2 0.00056 99.99 96.35 99.98 99.55 1280 complete Staphylococcus aureus NRS133 GCF001019415.2 0.00081 99.99 94.47 99.98 99.36 1280 complete Staphylococcus aureus NRS146 GCF001018735.2 0.00096 100.00 94.76 99.98 98.58 1280 complete Staphylococcus aureus NRS137 GCF_003354885.1 0.00103 99.93 96.63 99.93 96.66 1280 complete Staphylococcus aureus 164 ... ```

Installation

ReferenceSeeker can be installed via Conda and Git(Hub). In either case, a taxon database must be downloaded which we provide for download at Zenodo: For more information have a look at Databases.

BioConda

The preferred way to install and run ReferenceSeeker is Conda using the Bioconda channel:

bash $ conda install -c bioconda referenceseeker $ referenceseeker --help

GitHub

Alternatively, you can use this raw GitHub repository:

install necessary Python dependencies (if necessary)
clone the latest version of the repository
install necessary 3rd party executables (Mash, MUMmer4)

bash $ pip3 install --user biopython xopen $ git clone https://github.com/oschwengers/referenceseeker.git $ # install Mash & MUMmer $ ./referenceseeker/bin/referenceseeker --help

Test

To test your installation we prepared a tiny mock database comprising 4 Salmonella spp genomes and a query assembly (SRA: SRR498276) in the tests directory:

```bash $ git clone https://github.com/oschwengers/referenceseeker.git

# GitHub installation $ ./referenceseeker/bin/referenceseeker referenceseeker/test/db referenceseeker/test/data/SalmonellaentericaCFSAN000189.fasta

# BioConda installation $ referenceseeker referenceseeker/test/db referenceseeker/test/data/SalmonellaentericaCFSAN000189.fasta ```

Expected output:

```bash

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

GCF000439415.1 0.00003 100.00 99.55 1173427 complete Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189 GCF900205275.1 0.01522 98.61 83.13 90370 complete Salmonella enterica subsp. enterica serovar Typhi ```

Usage

Usage:

```bash usage: referenceseeker [--crg CRG] [--ani ANI] [--conserved-dna CONSERVED_DNA] [--unfiltered] [--bidirectional] [--help] [--version] [--verbose] [--threads THREADS]

Rapid determination of appropriate reference genomes.

positional arguments: ReferenceSeeker database path target draft genome in fasta format

Filter options / thresholds: These options control the filtering and alignment workflow.

--crg CRG, -r CRG Max number of candidate reference genomes to pass kmer prefilter (default = 100) --ani ANI, -a ANI ANI threshold (default = 0.95) --conserved-dna CONSERVEDDNA, -c CONSERVEDDNA Conserved DNA threshold (default = 0.69) --unfiltered, -u Set kmer prefilter to extremely conservative values and skip species level ANI cutoffs (ANI >= 0.95 and conserved DNA >= 0.69 --bidirectional, -b Compute bidirectional ANI/conserved DNA values (default = False)

Runtime & auxiliary options: --help, -h Show this help message and exit --version, -V show program's version number and exit --verbose, -v Print verbose information --threads THREADS, -t THREADS Number of used threads (default = number of available CPU cores) ```

Examples

Installation:

bash $ conda install -c bioconda referenceseeker $ wget https://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz $ tar -xzf bacteria-refseq.tar.gz $ rm bacteria-refseq.tar.gz

Simple:

bash $ # referenceseeker <REFERENCE_SEEKER_DB> <GENOME> $ referenceseeker bacteria-refseq/ genome.fasta

Expert: verbose output and increased output of candidate reference genomes using a defined number of threads:

bash $ # referenceseeker --crg 500 --verbose --threads 8 <REFERENCE_SEEKER_DB> <GENOME> $ referenceseeker --crg 500 --verbose --threads 8 bacteria-refseq/ genome.fasta

Databases

ReferenceSeeker depends on databases comprising taxonomic genome informations as well as kmer hash profiles for each entry.

Pre-built

We provide pre-built databases based on public genome data hosted at Zenodo: :

RefSeq

release: 221 (2024-01-09)

| Taxon | URL | # Genomes | Size | | :---: | --- | ---: | :---: | | bacteria | https://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz | 50,226 | 59.6 Gb | | archaea | https://zenodo.org/record/4415843/files/archaea-refseq.tar.gz | 905 | 897 Mb | | fungi | https://zenodo.org/record/4415843/files/fungi-refseq.tar.gz | 557 | 5.9 Gb | | protozoa | https://zenodo.org/record/4415843/files/protozoa-refseq.tar.gz | 90 | 1.1 Gb | | viruses | https://zenodo.org/record/4415843/files/viral-refseq.tar.gz | 14,012 | 1 Mb |

GTDB

release: v214 (2024-01-11)

| Taxon | URL | # Genomes | Size | | :---: | --- | ---: | :---: | | bacteria | n.a. due to storage quota resitrctions | 80,789 | 82 Gb | | archaea | https://zenodo.org/record/4415843/files/archaea-gtdb.tar.gz | 4,416 | 2.8 Gb |

Plasmids

In addition to the genome based databases, we provide the following plasmid databases based on RefSeq and PLSDB:

| DB | URL | # Plasmids | Size | | :---: | --- | ---: | :---: | | RefSeq | https://zenodo.org/record/4415843/files/plasmids-refseq.tar.gz | 81,674 | 2.6 Gb | | PLSDB | https://zenodo.org/record/4415843/files/plasmids-plsdb.tar.gz | 59,882 | 2.3 Gb |

Custom database

If above mentiond RefSeq based databases do not contain sufficiently-close related genomes or are just too large, ReferenceSeeker provides auxiliary commands in order to either create databases from scratch or to expand existing ones. Therefore, a second executable referenceseeker_db accepts init and import subcommands:

Usage:

```bash usage: referenceseeker_db [--help] [--version] {init,import} ...

Rapid determination of appropriate reference genomes.

positional arguments: {init,import} sub-command help init Initialize a new database import Add a new genome to database

Runtime & auxiliary options: --help, -h Show this help message and exit --version, -V show program's version number and exit ```

If a new database should be created, use referenceseeker_db init:

```bash usage: referenceseeker_db init [-h] [--output OUTPUT] --db DB

optional arguments: -h, --help show this help message and exit --output OUTPUT, -o OUTPUT output directory (default = current working directory) --db DB, -d DB Name of the new ReferenceSeeker database ```

This new database or an existing one can be used to import genomes in Fasta, GenBank or EMBL format:

```bash usage: referenceseeker_db import [-h] --db DB --genome GENOME [--id ID] [--taxonomy TAXONOMY] [--status {complete,chromosome,scaffold,contig}] [--organism ORGANISM]

optional arguments: -h, --help show this help message and exit --db DB, -d DB ReferenceSeeker database path --genome GENOME, -g GENOME Genome path [Fasta, GenBank, EMBL] --id ID, -i ID Unique genome identifier (default sequence id of first record) --taxonomy TAXONOMY, -t TAXONOMY Taxonomy ID (default = 12908 [unclassified sequences]) --status {complete,chromosome,scaffold,contig}, -s {complete,chromosome,scaffold,contig} Assembly level (default = contig) --organism ORGANISM, -o ORGANISM Organism name (default = "NA") ```

Example:

If ReferenceSeeker is properly installed, clone this repository and change into its parent directoriy.

$ git clone https://github.com/oschwengers/referenceseeker.git $ cd referenceseeker $ referenceseeker_db init --db test-db --output ./ $ referenceseeker_db import --db ./test-db --genome test/db/GCF_000439415.1.fna.gz --id GCF_000439415.1 --taxonomy 28901 --status complete --organism "Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189" $ referenceseeker_db import --db ./test-db --genome test/db/GCF_002211925.1.fna.gz --id GCF_002211925.1 --organism "Salmonella bongori str. SA19983605" $ referenceseeker -v ./test-db ./test/data/Salmonella_enterica_CFSAN000189.fasta

Dependencies

ReferenceSeeker needs the following dependencies:

Python (3.8, 3.9), Biopython (>=1.78), xopen(>=1.1.0)
Mash (2.3) https://github.com/marbl/Mash
MUMmer (4.0.0-beta2) https://github.com/gmarcais/mummer

ReferenceSeeker has been tested against aforementioned versions.

Citation

Schwengers et al., (2020). ReferenceSeeker: rapid determination of appropriate reference genomes. Journal of Open Source Software, 5(46), 1994, https://doi.org/10.21105/joss.01994

Feedback

We highly wellcome and appreciate feedback of all kind!

So, if you run into any issues with ReferenceSeeker, we'd be happy to hear about it! Please, start the pipeline with -v (verbose) and do not hesitate to file an issue here on GitHub including as much of the following as possible:

a detailed description of the issue
the ReferenceSeeker cmd line output
a reproducible example of the issue with a small dataset that you can share (helps us identify whether the issue is specific to a particular computer, operating system, and/or dataset).

The maintenance of ReferenceSeeker is supported by deNBI. If you would like to provide (non-technical) feedback, please find a service monitoring survey here.

Owner

Name: Oliver Schwengers
Login: oschwengers
Kind: user
Location: Giessen, Germany
Company: @ag-computational-bio - JLU Giessen

Twitter: oschwengers1
Repositories: 6
Profile: https://github.com/oschwengers

Microbial bioinformatics, WGS bacteria, plasmids, PostDoc, father of 2, husband, astrophotographer

JOSS Publication

ReferenceSeeker: rapid determination of appropriate reference genomes

Published

February 04, 2020

DOI

10.21105/joss.01994

Volume 5, Issue 46, Page 1994

Authors

Oliver Schwengers

Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, 35392, Germany, Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, 35392, Germany, German Centre for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany

Torsten Hain
Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, 35392, Germany, German Centre for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany

Trinad Chakraborty
Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, 35392, Germany, German Centre for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany

Alexander Goesmann

Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, 35392, Germany, German Centre for Infection Research (DZIF), partner site Giessen-Marburg-Langen, Giessen, Germany

Editor

Will Rowe

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use ReferenceSeeker, please cite it using these metadata."
authors: 
  -
    family-names: Schwengers
    given-names: Oliver
    orcid: https://orcid.org/0000-0003-4216-2721
    affiliation: Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, 35392, Germany
  -
    family-names: Hain
    given-names: Torsten
    affiliation: Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, 35392, Germany
  -
    family-names: Chakraborty
    given-names: Trinad
    affiliation: Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen, 35392, Germany
  -
    family-names: Goesmann
    given-names: Alexander
    orcid: https://orcid.org/0000-0002-7086-2568
    affiliation: Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, 35392, Germany
title: "ReferenceSeeker: rapid determination of appropriate reference genome"
doi: "10.21105/joss.01994"
version: 1.7.3
keywords: 
  - bioinformatics
  - WGS
  - NGS
  - Microbiology
license: GPL-3.0

GitHub Events

Total

Issues event: 3
Watch event: 9
Issue comment event: 2
Push event: 2

Last Year

Issues event: 3
Watch event: 9
Issue comment event: 2
Push event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 222
Total Committers: 1
Avg Commits per committer: 222.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 5
Committers: 1
Avg Commits per committer: 5.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Oliver Schwengers	o**s@c**e	222

Committer Domains (Top 20 + Academic)

computational.bio.uni-giessen.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 27
Total pull requests: 4
Average time to close issues: 8 days
Average time to close pull requests: 1 day
Total issue authors: 15
Total pull request authors: 2
Average comments per issue: 2.67
Average comments per pull request: 1.75
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 2 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.5
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

standage (4)
oschwengers (4)
Rob-murphys (4)
MostafaYA (3)
natir (2)
drelo (1)
Jwebster89 (1)
erelior (1)
samlhao (1)
augustkx (1)
nick-youngblut (1)
GuilhemRoyer (1)
theo-llewellyn (1)
padbc (1)
Benjamin-Lee (1)

Pull Request Authors

oschwengers (3)
standage (1)

Top Labels

Issue Labels

enhancement (8) bug (6) help wanted (3) question (1)

Pull Request Labels

enhancement (1)

Packages

Total packages: 1
Total downloads:
- pypi 54 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 12
Total maintainers: 1

pypi.org: referenceseeker

ReferenceSeeker: rapid determination of appropriate reference genomes.

Homepage: https://github.com/oschwengers/referenceseeker
Documentation: https://referenceseeker.readthedocs.io/
License: GPLv3
Latest release: 1.8.0
published about 4 years ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 54 Last month

Rankings

Stargazers count: 7.6%

Dependent packages count: 10.0%

Forks count: 14.2%

Average: 17.9%

Dependent repos count: 21.7%

Downloads: 36.2%

Maintainers (1)

oschwengers

Last synced: 6 months ago

Dependencies

setup.py pypi

biopython *
xopen *

.github/workflows/python-package-conda.yml actions

actions/checkout v2 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/pythonpackage.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/pythonpublish.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite
pypa/gh-action-pypi-publish master composite

environment.yml conda

biopython >=1.78
mash >=2.3.0
mummer4 >=4.0.0beta2
xopen >=1.1.0

ReferenceSeeker

Science Score: 100.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ReferenceSeeker: rapid determination of appropriate reference genomes

Contents

Description

Input & Output

Input

Output

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

ID Mash Distance QR ANI QR Con. DNA RQ ANI RQ Con. DNA Taxonomy ID Assembly Status Organism

Installation

BioConda

GitHub

Test

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

Usage

Examples

Databases

Pre-built

RefSeq

GTDB

Plasmids

Custom database

Dependencies

Citation

Feedback

Owner

JOSS Publication

ReferenceSeeker: rapid determination of appropriate reference genomes

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: referenceseeker

Rankings

Maintainers (1)

Dependencies