skder

skDER & CiDDER: efficient & high-resolution dereplication of microbial genomes.

https://github.com/raufs/skder

Keywords

average-nucleotide-identity genome-dereplication genomics representative-genome strain-clustering

Last synced: 9 months ago · JSON representation

Repository

skDER & CiDDER: efficient & high-resolution dereplication of microbial genomes.

Basic Info

Host: GitHub
Owner: raufs
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 164 MB

Statistics

Stars: 48
Watchers: 1
Forks: 3
Open Issues: 0
Releases: 26

Topics

average-nucleotide-identity genome-dereplication genomics representative-genome strain-clustering

Created almost 3 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

skDER (& CiDDER)

skDER (& CIDDER): efficient & high-resolution dereplication of microbial genomes to select representatives.

Warning: Please make sure to use version 1.0.7 or greater to avoid a bug in previous versions!

Contents

drawing

[!IMPORTANT] In v1.3.3, we introduced a low_mem_greedy option for low-memory dereplication for the top 20 or so taxa which are particularly well sequenced (e.g. those which have >10k or >20k genomes available). As we showed in the manuscript, while dereplication by skDER/CiDDER or other methods is typically not very memory-intensive when applied to an input set of <5,000 genomes, memory needs can expand when you go beyond this. The `lom_mem_greedy` mode was not included in the manuscript but for stats on how it compares please check out this [wiki page](https://github.com/raufs/skDER/wiki/Benchmarking-skder-low_mem_greedy). The quality of representatives selected were slightly lower and worse, for instance not as many distinct ortholog groups sampled per representative genome selected as the standard `greedy` approach. This is because it does not account for "connectivity" (aka "centrality") in prioritizing their selection, but it is considerably faster and more computationally efficient by leveraging skani's `search` function through a greedy/iterative approach that prioritizes based on only N50. As an example, we were able to dereplicate >20,000 Staphylococcus from GTDB R220 in around 2.25 hours using 20 threads and ~1 GB of memory using the command: skder -t Staphylococcus -d greedy -c 20 -r R220 -o Staph_R220_skDER_LMG_Results/ -auto -d low_mem_greedy. For those interested in using this on their laptops, genomes can still add up in size, so make sure you have an appropriate amount of disk space available for the number of genomes you plan to dereplicate.

Installation

Bioconda

Note, (for some setups at least) it is critical to specify the conda-forge channel before the bioconda channel to properly configure priority and lead to a successful installation.

Recommended: For a significantly faster installation process, use mamba in place of conda in the below commands, by installing mamba in your base conda environment.

bash conda create -n skder_env -c conda-forge -c bioconda skder conda activate skder_env

[!NOTE] For Mac users with Apple Silicon chips, you might need to specify CONDA_SUBDIR=osx-64 prior to conda create as described here. So you would issue: CONDA_SUBDIR=osx-64 conda create -n skder_env -c conda-forge -c bioconda skder.

installation with mgecut (for removing MGEs prior to dereplication assessment)

To also use the option to prune out positions corresponding to MGEs using either PhiSpy or geNomad

bash conda create -n skder_env -c conda-forge -c bioconda skder genomad=1.8.0 phispy "keras>=2.7,<3.0" "tensorflow>=2.7,<2.16" conda activate skder_env

[!TIP] geNomad is actively under development and new databases and versions are being released intermittendly. In the above code, we are suggesting installing v1.8.0 because it has been tested to work with database version 1.5, which we suggest downloading below. These are the versions we used for the skDER/CiDDER manuscript. For the most up to date installation of geNomad and its database possible, please consult its git repo. If there are issues with installation, please first try to install geNomad by itself in an individual conda repo. If successful, which suggests the issue is due to conflicts that have risen between geNomad and skDER/CiDDER dependencies, then please just let us know via a GitHub issue and we will attempt to resolve the issue.

Docker

Download the bash wrapper script to simplify usage for skDER or CiDDER:

```bash

download the skDER wrapper script and make it executable

wget https://raw.githubusercontent.com/raufs/skDER/refs/heads/main/Docker/skDER/runskder.sh chmod a+x runskder.sh

or download the CiDDER wrapper script and make it executable

wget https://raw.githubusercontent.com/raufs/skDER/refs/heads/main/Docker/CiDDER/runcidder.sh chmod a+x runcidder.sh

test it out!

./runskder.sh -h ./runcidder.sh -h ```

Optionally, if you are interested in filtering MGEs using geNomad, download the relevant databases:

bash wget https://zenodo.org/records/8339387/files/genomad_db_v1.5.tar.gz?download=1 mv genomad_db_v1.5* genomad_db_v1.5.tar.gz tar -zxvf genomad_db_v1.5.tar.gz

Conda Manual

```bash

1. clone Git repo and change directories into it!

git clone https://github.com/raufs/skDER/ cd skDER/

2. create conda environment using yaml file and activate it!

conda env create -f skDERenv.yml -n skDERenv conda activate skDER_env

3. complete python installation with the following command:

pip install -e . ```

Overview

skDER

skDER will perform dereplication of genomes using skani average nucleotide identity (ANI) and aligned fraction (AF) estimates and either a dynamic programming or greedy-based based approach. It assesses such pairwise ANI & AF estimates to determine whether two genomes are similar to each other and then chooses which genome is better suited to serve as a representative based on assembly N50 (favoring the more contiguous assembly) and connectedness (favoring genomes deemed similar to a greater number of alternate genomes).

Compared to dRep by Olm et al. 2017 and galah, skDER does not use a divide-and-conquer approach based on primary clustering with a less-accurate ANI estimator (e.g. MASH or dashing) followed by greedy clustering/dereplication based on more precise ANI estimates (for instance computed using FastANI) in a secondary round. skDER instead leverages advances in accurate yet speedy ANI calculations by skani by Shaw and Yu to simply take a "one-round" approach (albeit skani triangle itself uses a preliminary 80% ANI cutoff based on k-mer sketches, which we by default increase to 90% in skDER). skDER is also primarily designed for selecting distinct genomes for a taxonomic group for comparative genomics rather than for metagenomic application.

skDER, specifically the "dynamic programming" based approach, can still be used for metagenomic applications if users are cautious and filter out MAGs or individual contigs which have high levels of contamination, which can be assessed using CheckM or charcoal. To support this application with the realization that most MAGs likely suffer from incompleteness, we have introduced a parameter/cutoff for the max alignment fraction difference for each pair of genomes. For example, if the AF for genome 1 to genome 2 is 95% (95% of genome 1 is contained in genome 2) and the AF for genome 2 to genome 1 is 80%, then the difference is 15%. Because the default value for the difference cutoff is 10%, in that example the genome with the larger value will automatically be regarded as redundant and become disqualified as a potential representative genome.

skDER features three distinct algorithms for dereplication (details can be found below):

dynamic approach: approximates selection of a single representative genome per transitive cluster - results in a concise listing of representative genomes - well suited for metagenomic applications.
greedy approach: performs selection based on greedy set cover type approach - better suited to more comprehensively select representative genomes and sample more of a taxon's pangenome [current default].
greedy low-memory approach: performs selection iteratively using a greedy set cover type approach where genomes chosen as representatives are prioritized soley based on N50. Should result in lower-quality representative selections compared to the standard greedy mode, which also prioritizes genomes based on connectivity, but should be more more memory-efficient.

[!NOTE] The skDER "greedy" algorithm is just referring to the selection algorithm - all vs all assessment is performed using skani triangle still. This is in contrast to dRep or galah where "greedy" is referring to their iterative process of selecting a representative genome followed by targeted ANI asseessment, to avoid all-vs-all comparison, of the genome against related genomes determined from primary, coarser clustering. The info needed for selecting the next representative genome is then known and the process is repeated as many times as needed. While this stratedgy can speed things up when using fastANI, with skani this does not make much of a difference (in applicaiton we found it can be more memory efficient - significantly when dealing with >10k genomes -, but also results in slower speeds than just using skani triangle directly). For memmory efficiency on very large datasets (which should only be relevant for a select set of genera) we thus recently implemented the low_mem_greedy approach.

CiDDER

In v1.2.0, we also introduced a second program called CiDDER (CD-hit based DEReplication) - which allows for optimizing selection of a minimal number of genomes that achieve some level of saturation of the pan-genome of the full set of genomes (see below for details). Note, CD-HIT determines protein clusters, not proper ortholog groups, and as such an approximation is made of the pan-genome space being sampled by representative genomes.

If providing genomes in FASTA format, this method will use pyrodigal for gene calling - which is specific to bacteria/archaea; however, more recently CiDDER can also accept proteome files which should allow it to work on eukaryotes and viruses as well.

Details on Dereplication Algorithms

Using Pan-Genome Saturation (CiDDER)

Starting in v1.2.0, CiDDER was introduced to allow representative genome selection based on pan-genome saturation estimates using CD-HIT. After inferring ORFs using pyrodigal, predicted protein sequences are conatenated into a giant FASTA file and clustered using CD-HIT (where parameters are possible to adjust). Each genome is thus treated as a set of distinct protein clusters it features.

Here is an overview of the algorithm:

Download or process input genomes.

Predict proteins using pyrodigal.

Comprehensive clustering of all proteins using CD-HIT

Select genome with the most number of distinct protein clusters as the initial representative.

Iteratively add more representative genomes one at a time, selecting the next based on maximized addition of novel protein clusters to the current representative set.

End addition of representative genomes if one of three criteria are met: (i) Next genome adds less than X number of distinct protein clusters (X is by default 0), (ii) over Y% of the total distinct protein clusters across all genomes are found in the so-far selected reprsentative genomes (Y is by default 90%), or (iii) over Z% of the total distinct multi-genome protein clusters across all genomes are found in the so-far selected representative genomes (Z is by default 100%). Thus, by default, only Y is used for representative genome selection.

Using the Dynamic Programming Dereplication Approach (skDER)

The dynamic dereplication method in skDER approximates selection of a single representative for coarser clusters of geneomes using a dynamic programming approach in which a set of genomes deemed as redundant is kept track of, avoiding the need to actually cluster genomes.

Here is an overview of the algorithm:

Download or process input genomes.

Compute and create a tsv linking each genome to their N50 assembly quality metric (N50[g]).

Compute ANI and AF using skani triangle to get a tsv "edge listing" between pairs of genomes (with filters applied based on ANI and AF cutoffs).

Run through "edge listing" tsv on first pass and compute connectivity (C[g]) for each genome - how many other genomes it is similar to at a certain threshold.

Run through "N50" tsv and store information.

Second pass through "edge listing" tsv and assess each pair one at a time keeping track of a singular set of genomes regarded as redudnant:

if (AF[g1] - _AF[g2]) >= parameter `--maxafdistancecutoff` (default of 10%), then automatically regard corresponding genome of max(AF[g1], _AF[g_2]) as redundant.

else calculate the following score for each genome: N50[g]*C[g] = S[g] and regard corresponding genome for min(S[g1], S[g2]) as redundant.

Second pass through "N50" tsv file and record genome identifier if they were never deemed redudant.

Using the Greedy Dereplication Approach (skDER)

skDER greedy clustering generally leads to a larger, more-comprehensive selection of representative genomes that covers more of the pan-genome.

Here is an overview of this alternate approach:

Download or process input genomes.

Compute and create a tsv linking each genome to their N50 assembly quality metric (N50[g]).

Compute ANI and AF using skani triangle to get a tsv "edge listing" between pairs of genomes (with filters applied based on ANI and AF cutoffs).

Run through "edge listing" tsv on first pass and compute connectivity (C[g]) for each genome - how many other genomes it is similar to at a certain threshold

Only consider a genome as connected to a focal genome if they share an ANI greater than the --percent_identity_cutoff (default of 99%) and the comparing genome exhibits an AF greater than the --aligned_fraction_cutoff (default of 90%) to the focal genome (is sufficiently representative of both the core and auxiliary content of the focal genome).

Run through "N50" tsv and compute the score for each genome: N50[g]*C[g] = S[g] and write to new tsv where each line corresponds to a single genome, the second column corresponds to the S[g] computed, and the third column to connected genomes to the focal genome.

Sort resulting tsv file based on S[g] in descending order and use a greedy approach to select representative genomes if they have not been accounted for as a connected genome from an already selected representative genome with a higher score.

Using the Low-Memory Greedy Dereplication Approach (skDER)

Starting from v1.3.3, skDER also allows users to request low_mem_greedy clustering instead. This generally leads to a larger, more-comprehensive selection of representative genomes that covers more of the pan-genome. Note, if a secondary clustering is envoked this will end up running skani dist between representative and non-representative genomes which will require more memory than simply using skani search.

Here is an overview of this alternate approach:

Download or process input genomes.

Compute and create a tsv linking each genome to their N50 assembly quality metric (N50[g]).

Create a sketch of all the genomes using skani sketch.

Select the genome with the largest N50[g] to be the first representative and use skani search to query the selected genome against all others (in the sketch database).

For each matching genome only consider it as connected to the focal genome if they share an ANI greater than the --percent_identity_cutoff (default of 99%) and the comparing genome exhibits an AF greater than the --aligned_fraction_cutoff (default of 90%) to the focal genome (is sufficiently representative of both the core and auxiliary content of the focal genome). Add such genomes to a set of redundant genomes that are already accounted for by the selected representative genome.

Continue this process for the next representative genome, highest N50 genome that has not been regarded as redundant and continue until all genomes are either representatives or deemed redundant.

Test Case

We provide a simple test case to dereplicate the six genomes available for Cutibacterium granulosum in GTDB release 214 using skDER, together with expected results.

To run this test case:

```bash

Download test data

wget https://github.com/raufs/skDER/raw/main/test_case.tar.gz

Download bash script to run skder

wget https://raw.githubusercontent.com/raufs/skDER/main/run_tests.sh

Run the wrapper script to perform testing

bash ./run_tests.sh ```

Usage

If experiencing issues related to "Argument list too long", consider issuing ulimit -S -s unlimited prior to running skDER.

```bash

the skder executable should be in the path after installation and can be reference as such:

skder -h ```

The help function should return the following:

``` usage: skder [-h] [-g GENOMES [GENOMES ...]] [-t TAXANAME] -o OUTPUTDIRECTORY [-d DEREPLICATION_MODE] [-i PERCENTIDENTITYCUTOFF] [-f ALIGNEDFRACTIONCUTOFF] [-a MAXAFDISTANCE_CUTOFF] [-tc] [-p SKANITRIANGLEPARAMETERS] [-s] [-fm] [-gd GENOMAD_DATABASE] [-n] [-mn MINIMAL_N50] [-l] [-r GTDB_RELEASE] [-auto] [-mm MAX_MEMORY] [-c THREADS] [-v]

Program: skder
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

skDER: efficient & high-resolution dereplication of microbial genomes to select
       representative genomes.

skDER will perform dereplication of genomes using skani average nucleotide identity
(ANI) and aligned fraction (AF) estimates and either a dynamic programming or
greedy-based based approach. It assesses such pairwise ANI & AF estimates to determine
whether two genomes are similar to each other and then chooses which genome is better
suited to serve as a representative based on assembly N50 (favoring the more contiguous
assembly) and connectedness (favoring genomes deemed similar to a greater number of
alternate genomes).

Note, if --filter-mge is requested, the original paths to genomes are reported but
the statistics reported in the clustering reports (e.g. ANI, AF) will all be based
on processed (MGE filtered) genomes. Importantly, computation of N50 is performed
before MGE filtering to not penalize genomes of high quality that simply have many
MGEs and enable them to still be selected as representatives.

If you use skDER for your research, please kindly cite both:

Fast and robust metagenomic sequence comparison through sparse chaining with skani.
Nature Methods. Shaw and Yu, 2023.

and

skDER & CiDDER: two scalable approaches for microbial dereplication. Microbial
Genomics. Salamzade, Kottapalli, and Kalan, 2025.

options: -h, --help show this help message and exit -g GENOMES [GENOMES ...], --genomes GENOMES [GENOMES ...] Genome assembly file paths or paths to containing directories. Files should be in FASTA format and can be gzipped (accepted suffices are: *.fasta, *.fa, *.fas, or *.fna) [Optional]. -t TAXANAME, --taxa-name TAXANAME Genus or species identifier from GTDB for which to download genomes for and include in dereplication analysis [Optional]. -o OUTPUTDIRECTORY, --output-directory OUTPUTDIRECTORY Output directory. -d DEREPLICATIONMODE, --dereplication-mode DEREPLICATIONMODE Whether to use a "dynamic" (more concise), "greedy" (more comprehensive), or "lowmemgreedy" (currently experimental) approach to selecting representative genomes. [Default is "greedy"] -i PERCENTIDENTITYCUTOFF, --percent-identity-cutoff PERCENTIDENTITYCUTOFF ANI cutoff for dereplication [Default is 99.5]. -f ALIGNEDFRACTIONCUTOFF, --aligned-fraction-cutoff ALIGNEDFRACTIONCUTOFF Aligned cutoff threshold for dereplication - only needed by one genome [Default is 50.0]. -a MAXAFDISTANCECUTOFF, --max-af-distance-cutoff MAXAFDISTANCECUTOFF Maximum difference for aligned fraction between a pair to automatically disqualify the genome with a higher AF from being a representative [Default is 10.0]. -tc, --test-cutoffs Assess clustering using various pre-selected cutoffs. -p SKANITRIANGLEPARAMETERS, --skani-triangle-parameters SKANITRIANGLEPARAMETERS Options for skani triangle. Note ANI and AF cutoffs are specified separately and the -E parameter is always requested. [Default is "-s X", where X is 10 below the ANI cutoff]. -s, --sanity-check Confirm each FASTA file provided or downloaded is actually a FASTA file. Makes it slower, but generally good practice. -fm, --filter-mge Filter predicted MGE coordinates along genomes before dereplication assessment but after N50 computation. -gd GENOMADDATABASE, --genomad-database GENOMADDATABASE If filter-mge is specified, it will by default use PhiSpy; however, if a database directory for geNomad is provided - it will use that instead to predict MGEs. -n, --determine-clusters Perform secondary clustering to assign non-representative genomes to their closest representative genomes. -mn MINIMALN50, --minimaln50 MINIMALN50 Minimal N50 of genomes to be included in dereplication [Default is 0]. -l, --symlink Symlink representative genomes in results subdirectory instead of performing a copy of the files. -r GTDBRELEASE, --gtdb-release GTDBRELEASE Which GTDB release to use if -t argument issued [Default is R226]. -auto, --automate Automatically skip warnings and download genomes from NCBI if -t argument issued. Automatation off by default to prevent unexpected downloading of large genomes [Default is False]. -mm MAXMEMORY, --max-memory MAX_MEMORY Max memory in Gigabytes [Default is 0 = unlimited]. -c THREADS, --threads THREADS Number of threads/processes to use [Default is 1]. -v, --version Report version of skDER. ```

Usage for CiDDER

```bash

the cidder executable should be in the path after installation and can be reference as such:

cidder -h ```

The help function should return the following:

``` usage: cidder [-h] [-g GENOMES [GENOMES ...]] [-p PROTEOMES [PROTEOMES ...]] [-t TAXANAME] [-a NEWPROTEINSNEEDED] [-ts TOTAL_SATURATION] [-mgs MULTIGENOMESATURATION] [-rs REQUIRE_SIMILARITY] -o OUTPUTDIRECTORY [-cdp CDHITPARAMS] [-mg] [-e] [-s] [-fm] [-gd GENOMAD_DATABASE] [-n] [-ns] [-mn MINIMAL_N50] [-l] [-r GTDB_RELEASE] [-auto] [-c THREADS] [-mm MAX_MEMORY] [-v]

Program: cidder
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

CiDDER: Performs genome dereplication based on CD-HIT clustering of proteins to
        select a representative set of genomes which adequately samples the
        pangenome space. Because gene prediction is performed using pyrodigal,
        geneder only works for bacterial genomes at the moment.

The general algorithm is to first select the genome with the most number of distinct
open-reading-frames (ORFS; predicted genes) and then iteratively add genomes based on
which maximizes the number of new ORFs. This iterative addition of selected genomes
is performed until: (i) the next genome to add does not have a minimum of X new disintct
ORFs to add to the set of ORFs belonging, (ii) some percentage Y of the total distinct
ORFs are found to have been sampled, or (iii) some percentage Z of the total multi-genome
distinct ORFs are found to have been sampled. The "added-on" genomes from the iterative
procedure are listed as representative genomes.

For information on how to alter CD-HIT parameters, please see:
https://github.com/weizhongli/cdhit/blob/master/doc/cdhit-user-guide.wiki#cd-hit

Note, if --filter-mge is requested, the statistics reported in clustering reports (number
of proteins overlapping, ANI) in the clustering reports will all be based on processed
(MGE filtered) genomes. However, the final representative genomes in the
Dereplicated_Representative_Genomes/ folder will be the original unprocesed genomes.

If you use CiDDER for your research, please kindly cite both:

CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics.
Fu et al. 2012.

and

skDER & CiDDER: microbial genome dereplication approaches for comparative genomic and
metagenomic applications. Salamzade, Kottapalli, and Kalan, 2025.

options: -h, --help show this help message and exit -g GENOMES [GENOMES ...], --genomes GENOMES [GENOMES ...] Genome assembly file paths or paths to containing directories. Files should be in FASTA format (accepted suffices are: *.fasta, *.fa, *.fas, or *.fna) [Optional]. -p PROTEOMES [PROTEOMES ...], --proteomes PROTEOMES [PROTEOMES ...] Proteome file paths or paths to containing directories. Files should be in FASTA format (accepted suffices are: *.fasta, *.fa, or *.faa) [Optional]. -t TAXANAME, --taxa-name TAXANAME Genus or species identifier from GTDB for which to download genomes for and include in dereplication analysis [Optional]. -a NEWPROTEINSNEEDED, --new-proteins-needed NEWPROTEINSNEEDED The number of new protein clusters needed to add [Default is 0]. -ts TOTALSATURATION, --total-saturation TOTALSATURATION The percentage of total proteins clusters needed to stop representative genome selection [Default is 90.0]. -mgs MULTIGENOMESATURATION, --multi-genome-saturation MULTIGENOMESATURATION The percentage of total multi-genome protein clusters needed to stop representative genome selection [Default is 100.0]. -rs REQUIRESIMILARITY, --require-similarity REQUIRESIMILARITY Require non-representative genomes to have X% of their protein clusters represented by an individual representative genome [Default is 0.0]. -o OUTPUTDIRECTORY, --output-directory OUTPUTDIRECTORY Output directory. -cdp CDHITPARAMS, --cd-hit-params CDHITPARAMS CD-HIT parameters to use for clustering proteins - select carefully (don't set threads or memory - those are done by default in cidder) and surround by quotes [Default is: "-n 5 -c 0.95 -aL 0.75 -aS 0.90"] -mg, --metagenome-mode Run pyrodigal using metagenome mode. -e, --include-edge-orfs Include proteins from ORFs that hang off the edge of a contig/scaffold. -s, --sanity-check Confirm each FASTA file provided or downloaded is actually a FASTA file. Makes it slower, but generally good practice. -fm, --filter-mge Filter predicted MGE coordinates along genomes before dereplication assessment but after N50 computation. -gd GENOMADDATABASE, --genomad-database GENOMADDATABASE If filter-mge is specified, it will by default use PhiSpy; however, if a database directory for geNomad is provided - it will use that instead to predict MGEs. -n, --determine-clusters Perform secondary clustering to assign non-representative genomes to their closest representative genomes based on shared protein clusters. -ns, --determine-clusters-skani Perform secondary clustering to assign non-representative genomes to their closest representative genomes based on skani-computed ANI. -mn MINIMALN50, --minimaln50 MINIMALN50 Minimal N50 of genomes to be included in dereplication [Default is 0]. -l, --symlink Symlink representative genomes in results subdirectory instead of performing a copy of the files. -r GTDBRELEASE, --gtdb-release GTDBRELEASE Which GTDB release to use if -t argument issued [Default is R226]. -auto, --automate Automatically skip warnings and download genomes from NCBI if -t argument issued. Automatation off by default to prevent unexpected downloading of large genomes [Default is False]. -c THREADS, --threads THREADS Number of threads/processes to use [Default is 1]. -mm MAXMEMORY, --max-memory MAX_MEMORY Max memory in Gigabytes [Default is 0 = unlimited]. -v, --version Report version of CiDDER. ```

Citation notice

skDER relies heavily on advances made by skani for fast ANI estimation while retaining accuracy - thus if you use skDER for your research please cite skani:

Fast and robust metagenomic sequence comparison through sparse chaining with skani

as well as the skDER manuscript:

skDER and CiDDER: two scalable approaches for microbial genome dereplication

If you use the option to downlod genomes for a taxonomy based on GTDB classifications, please also cite:

GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy

If you use CiDDER, please also consider citing pyrodigal (for gene-calling) and CD-HIT (for protein clustering):

CD-HIT: accelerated for clustering the next-generation sequencing data

Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes

If you use mgecut (for removal of predicted MGEs) then please cite either PhiSpy (default) or geNomad for their annotation:

Identification of mobile genetic elements with geNomad

PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies

Acknowledgments

We thank Titus Brown, Tessa Pierce-Ward, and Karthik Anantharaman for helpful discussions on the development of skDER/CiDDER - in particular the idea to directly asses the pan-genome space sampled by representative genomes. We also thank users on GitHub issues for suggesting ideas for new features.

LICENSE

``` BSD 3-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ```

Owner

Name: Rauf Salamzade
Login: raufs
Kind: user
Location: Madison, WI

Twitter: salamzader
Repositories: 2
Profile: https://github.com/raufs

Bioinformatician interested in studying eco/evo principles undelrying microbiome dynamics/stability.

GitHub Events

Total

Create event: 4
Release event: 4
Issues event: 4
Watch event: 6
Delete event: 1
Issue comment event: 13
Push event: 35
Gollum event: 6
Pull request event: 1
Fork event: 1

Last Year

Create event: 4
Release event: 4
Issues event: 4
Watch event: 6
Delete event: 1
Issue comment event: 13
Push event: 35
Gollum event: 6
Pull request event: 1
Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: 4 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 6.5
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 1
Average comments per issue: 6.5
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

skder

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

skDER (& CiDDER)

Installation

Bioconda

installation with mgecut (for removing MGEs prior to dereplication assessment)

Docker

download the skDER wrapper script and make it executable

or download the CiDDER wrapper script and make it executable

test it out!

Conda Manual

1. clone Git repo and change directories into it!

2. create conda environment using yaml file and activate it!

3. complete python installation with the following command:

Overview

skDER

CiDDER

Details on Dereplication Algorithms

Using Pan-Genome Saturation (CiDDER)

Using the Dynamic Programming Dereplication Approach (skDER)

Using the Greedy Dereplication Approach (skDER)

Using the Low-Memory Greedy Dereplication Approach (skDER)

Test Case

Download test data

Download bash script to run skder

Run the wrapper script to perform testing

Usage

the skder executable should be in the path after installation and can be reference as such:

Usage for CiDDER

the cidder executable should be in the path after installation and can be reference as such:

Citation notice

Acknowledgments

LICENSE

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies