kraken2_classification

Snakemake workflow for metagenomic classification with Kraken2

https://github.com/bhattlab/kraken2_classification

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Snakemake workflow for metagenomic classification with Kraken2

Basic Info
  • Host: GitHub
  • Owner: bhattlab
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 13.8 MB
Statistics
  • Stars: 65
  • Watchers: 3
  • Forks: 16
  • Open Issues: 4
  • Releases: 2
Created over 7 years ago · Last pushed almost 3 years ago
Metadata Files
Readme Citation

README.md

DOI

Kraken2 classification

A Snakemake pipeline wrapper of the Kraken2 short read metagenomic classification software, with additional tools for analysis, plots, compositional data and differential abundance calculation. Designed and maintained by Ben Siranosian in Ami Bhatt's lab at Stanford University.

Introduction

Kraken2 is a short read classification system that is fast and memory efficient. It assigns a taxonomic identification to each sequencing read, by using the lowest common ancestor (LCA) of matching genomes in the database. Using Bracken provides accurate estimates of proportions of different species. This guide will cover some of the basics, but the full Kraken2 manual has much more detail.

Table of contents

  1. Installation
  2. Usage
  3. Available databases
  4. Downstream processing and plotting
  5. Additional considerations
  6. Expanded database construction
  7. Using metagenome assembled genomes as a database
  8. GCTx data parsing

Quickstart

Install

If you're in the Bhatt lab, most of this work will take place on the SCG cluster. External users should set this pipeline up on their infrastructure of choice: cloud, HPC, or even a laptop will work for processing small datasets. You will have to download or build a database, set the database options, and create the sample input files. All steps of this pipeline are containerized, meaning only snakemake and singularity are required to run all tools.

If you're in the Bhatt lab, use these instructions to set up snakemake and set up a profile to submit jobs to the SCG cluster. External users should follow these instructions:

  1. Install mambaforge.
  2. Create a fresh environment with snakemake or add it to an existing environment. Activate this environment for any step using this pipeline: mamba create --name snakemake --channel conda-forge --channel bioconda snakemake conda activate snakemake
  3. Install singularity.

Then, clone this repo in a convenient location. git clone https://github.com/bhattlab/kraken2_classification.git cd kraken2_classification

Run with test data

A Kraken2 database is required to use this pipeline. Pre-built database can be downloaded from Ben Langmead's site. As an example, we download the standard database limited to 8GB memory use, and unpack it into a folder to use with the tests: cd kraken2_classification/tests wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz mkdir db tar -C db -xvf k2_standard_08gb_20230605.tar.gz

A small test dataset from Yassour et. al (2018) is included in this repo. 10,000 reads from several timepoints from a mother-infant pair are used. Even with such low coverage, the differences in microbiome composition are apparent in clustering and taxonomic barplots. Launch and end-to-end test run with a command like so: ```

Launch this from the kraken2_classification directory

snakemake -s Snakefile --configfile tests/testconfig/configpe.yaml -j1 --use-singularity ```

The script tests/run_tests.sh ensures basic functionality of the pipeline executes as expected.

Run with real-world data

Copy the config.yaml file into the working directory for your samples. Change the options to suit your project. The main input is the sample_reads_file which defines the mapping from sample names to sequencing reads. See Usage for more detail.

On the Bhatt lab SCG cluster, you can then launch the workflow with a snakemake command like so: ```

Snakemake workflow - change options in config.yaml first

snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --singularity-args '--bind /oak/,/labs/,/home' --profile scg --jobs 99 ```

If you're not in the Bhatt lab, a more general command should be sufficient, but you might need to add singularity bind arguments or a profile for SLURM job submission depending on your configuration. This example uses 8 cores, but that can be changed to reflect available resources. snakemake -s path/to/Snakefile --configfile config.yaml --use-singularity --jobs 8 --cores 8

After running the workflow and you're satisfied the results, run the cleanup command to remove temporary files that are not needed anymore. snakemake cleanup -s path/to/Snakefile --configfile config.yaml

Run analysis on existing data

If you have a collection of kraken/bracken reports and just want to run the downstream analysis in this pipeline, you can provide the sample_reports_file in the config, which is a map from sample names to kraken and bracken report files. See tests/test_config/config_downstream_only_bracken.yaml as an example. Then, launch the pipeline with Snakefile_downstream_only. Tune the filtering and job submission parameters to meet your needs. snakemake -s Snakefile_downstream_only --configfile tests/test_config/config_downstream_only_bracken.yaml -j1 --use-singularity

Parsing output reports

The Kraken reports classification/sample.krak.report, bracken reports sample.krak_bracken.report, and data matrices or GCTx objects in the processed_results folder are the best for downstream analysis. See Downstream processing and plotting for details on using the data in R.

Example output

The pipeline outputs data, results and figures in the structure below. ``` | classification - sample.krak Kraken results - classification of each read. These files can get very large and are unnecessary if you only want the reports. - sample.krak.report Kraken report - reads and percentages at each taxonomic level. - sample.krak.report.bracken Standard bracken report at species level. Not useful, use the one below. - sample.krak_bracken.report Most useful format of the the Bracken results.

| processed_results - diversity.txt Diversity of each sample at each taxonomic level

| ALDEX2differentialabuncance Compositional data analysis done with the ALDEx2 package. Only done if you have 2 groups in the samplegroups file. - aldexresult[].tsv Differential abundance at the given taxonomic level. - aldexscatter[].pdf Scatterplot of effect vs dispersion with significant hits highlighted - aldexsignificantboxplots[].pdf Boxplot of any significant hits

| braycurtismatrices - bravcurtisdistance_[].tsv Matrix of braycurtis distance between samples at each taxonomic level

| plots Lots of plots! - classifiedtaxonomybarplot[].pdf Barplot at each taxonomic level. - compositionalPCAplot.pdf PCA done on clr values - diversityallsamples.pdf Diversity barplot - diversitybygroup.pdf Diversity barplot, stratified by sample group - PCoA2Dplotlabels.pdf Principal coordinates analysis, calculated on braycurtis distances - PCoA2Dplotnolabels.pdf Same as above without the point labels - rarefaction_curve.pdf Rarefaction "collectors" curve plot

| taxonomygctxclassifiedonly
- bracken
[]reads.gctx GCTx file with matrix containing reads classified at each level - bracken[]_percentage.gctx Same, but percentage of classified reads

| taxonomymatricesclassifiedonly - bracken[]reads.txt Matrix of taxa by sample classified reads - bracken[]percentage.txt Same, but percentage of classified reads - clrvalues_[].txt From compositional data analysis, centered log-ratio values at each level

| processedresultskrakenonly | Same as above, but using the results without Bracken. Also includes taxonomy matrices | that have unclassified reads in them (as bracken no longer reports unclassified reads)

| unmapped reads - sampleunmapped1.fq Only present if selected in the config file; reads that are not - sampleunmapped2.fq classified, as paired end fastq. ```

Taxonomic barplot default_barplot

PCoA plot example pcoa_plot

Changelog

2023-06-07

v2.0 (Breaking changes introduced to to configuration files and the ways parameters are used). This set of changes did a bit to modernize the pipeline: * All steps are now available with containerized execution * Created separate pipeline Snakefile_downstream_only which works from a list of report files to only un the downstream analysis steps. * Included a small test dataset and better test execution * Various code and README/manual changes * License file added

2019-09-01

The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the CMapR package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section 8. GCTx data processing for more information and tools for working with the new implementation.

Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in Additional considerations

Owner

  • Name: bhattlab
  • Login: bhattlab
  • Kind: organization

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Siranosian
    given-names: Benjamin
    orcid: https://orcid.org/0000-0002-4935-8172
  - family-names: Bhatt
    given-names: Ami
    orcid: https://orcid.org/0000-0001-8099-2975
title: "bhattlab/Kraken2_classification"
version: 1.0.0
doi: 10.5281/zenodo.5219057
date-released: 2021-08-18

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
Last Year
  • Issues event: 1
  • Watch event: 1

Dependencies

envs/anytree/Dockerfile docker
  • mambaorg/micromamba 1.4.3 build