metagenomics_snakemake

Snakemake-based pipeline for metagenomics classification. Currently support short-reads Illumina Sequences and long-reads ONT Sequences.

https://github.com/pablorr24/metagenomics_snakemake

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 13 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.4%) to scientific vocabulary

Keywords

metagenomics pipeline
Last synced: 6 months ago · JSON representation ·

Repository

Snakemake-based pipeline for metagenomics classification. Currently support short-reads Illumina Sequences and long-reads ONT Sequences.

Basic Info
  • Host: GitHub
  • Owner: pablorr24
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 319 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Topics
metagenomics pipeline
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.html














README












































Snakemake Metagenomics Workflow

Summary:

This Snakemake-based program allows to classify metagenomic samples from NGS and ONT samples.

Brief Description

The program allows to clasify and analyze NGS sequences and long-read ONT samples. The program consists of 3 workflows: short-reads classification, long-reads classification and post-classification workflow. The short-reads workflow and the long-reads workflow have a QC-only mode, which runs FastQC and NanoPlot respectively. This method is useful to evaluate sequence quality before classification. The post-classification workflow works on the results of the classification workflows and provides additional information using a metadata file and an additional target variable.

Prerequisites

This installation requires git and conda/miniconda. If they are already installed, skip these steps, otherwise install them they with the following steps:

Git Installation:
sudo apt install git

Miniconda Installation:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:zrm -rf ~/miniconda3/miniconda.sh

After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:

~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Snakemake Installation

git clone https://github.com/pablorr24/metagenomics_snakemake/
cd metagenomics/Snakemake
conda env create -f environment.yml -n snakemake_meta
conda activate snakemake_meta

Database Installation

If you already have a database such as Silva, Greengenes, RefSeq, Kraken2, or a similar classification database, you can skip this step. Otherwise, make sure you install a database. The following instructions will download and install the Silva database

kraken2-build --special silva --db SilvaDB
kraken2-build --special greengenes --db greengenes

Running a workflow

To run a workflow, first modify the configuration file and adjust to your parameters. Afterwards, run Snakemake

Short-reads

snakemake -s Snakefile_fastqc --cores all
snakemake -s Snakefile_full_workflow --cores all

Long-reads

snakemake -s Snakefile_nanoplot --cores all
snakemake -s Snakefile_long_read --cores all

Post Classification Workflow

snakemake -s Snakefile_post_analysis --cores all

Metadata File
The post-classification workflow requires a metadata file, with one row per sample, and different columns specifying specific sample variables (sample location, species, etc).

Output

After running the workflow, a timestamped folder is created in the output folder. All your results will be inside this folder.

References

Rules

The rule ‘create_otu_table’ uses a modified version of the ‘kraken2OTU.py’ script created by GitHub user sipost1, available in https://github.com/sipost1/kraken2OTUtable/blob/main/kraken2otu.py

The rules ‘calculate_alpha_diversity’ and ‘calculate_beta_diversity’ use a modified version of the ‘alpha_diversity.py’and ’beta_diversity_py’ script created by GitHub user jenniferlu717 in the DiversityTools repository, available in https://github.com/jenniferlu717/KrakenTools/blob/master/DiversityTools/alpha_diversity.py

The rules ‘create_dendogram’ and ‘create_pcoa_plot’ use a modified_version of the ‘dendro.R’ and ‘pca.R’ created by GitHub user GATB in the simka repository, both available in https://github.com/GATB/simka/tree/master/scripts/visualization

External Software

Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0

Ondov, B.D., Bergman, N.H. & Phillippy, A.M. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12, 385 (2011). https://doi.org/10.1186/1471-2105-12-385

Wouter De Coster, Svenn D’Hert, Darrin T Schultz, Marc Cruts, Christine Van Broeckhoven, NanoPack: visualizing and processing long-read sequencing data, Bioinformatics, Volume 34, Issue 15, August 2018, Pages 2666–2669, https://doi.org/10.1093/bioinformatics/bty149

Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016 Dec;26(12):1721-1729. doi: 10.1101/gr.210641.116. Epub 2016 Oct 17. PMID: 27852649; PMCID: PMC5131823.

Owner

  • Login: pablorr24
  • Kind: user

Citation (citations/fastqc_citation.txt)

Short-reads FastQC workflow

FastQC
Andrews, S. (2010). FastQC:  A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

GitHub Events

Total
Last Year

Dependencies

environment.yml conda
  • _libgcc_mutex 0.1
  • _openmp_mutex 4.5
  • aioeasywebdav 2.4.0
  • aiohttp 3.9.0
  • aiosignal 1.2.0
  • amply 0.1.6
  • appdirs 1.4.4
  • attmap 0.13.2
  • attrs 23.1.0
  • bcrypt 3.2.0
  • boto3 1.29.1
  • botocore 1.32.1
  • brotli-python 1.0.9
  • bzip2 1.0.8
  • c-ares 1.21.0
  • ca-certificates 2023.08.22
  • cachetools 4.2.2
  • certifi 2023.11.17
  • cffi 1.16.0
  • charset-normalizer 2.0.4
  • coin-or-cbc 2.10.10
  • coin-or-cgl 0.60.7
  • coin-or-clp 1.17.8
  • coin-or-osi 0.108.8
  • coin-or-utils 2.11.9
  • coincbc 2.10.10
  • configargparse 1.4
  • connection_pool 0.0.3
  • cryptography 41.0.7
  • datrie 0.8.2
  • defusedxml 0.7.1
  • docutils 0.18.1
  • dpath 2.1.6
  • dropbox 11.36.1
  • eido 0.2.2
  • fastqc 0.11.8
  • filechunkio 1.8
  • frozenlist 1.4.0
  • ftputil 5.0.4
  • gdbm 1.18
  • gitdb 4.0.7
  • gitpython 3.1.37
  • google-api-core 2.10.1
  • google-api-python-client 2.108.0
  • google-auth 2.22.0
  • google-auth-httplib2 0.1.1
  • google-cloud-core 2.3.2
  • google-cloud-storage 2.6.0
  • google-crc32c 1.5.0
  • google-resumable-media 2.4.0
  • googleapis-common-protos 1.56.4
  • grpcio 1.59.3
  • httplib2 0.22.0
  • humanfriendly 10.0
  • idna 3.4
  • iniconfig 1.1.1
  • jinja2 3.1.2
  • jmespath 1.0.1
  • jsonschema 4.19.2
  • jsonschema-specifications 2023.7.1
  • jupyter_core 5.5.0
  • krona 2.7
  • ld_impl_linux-64 2.38
  • libabseil 20230802.1
  • libblas 3.9.0
  • libcblas 3.9.0
  • libcrc32c 1.1.2
  • libexpat 2.5.0
  • libffi 3.4.4
  • libgcc-ng 13.2.0
  • libgfortran-ng 13.2.0
  • libgfortran5 13.2.0
  • libgomp 13.2.0
  • libgrpc 1.59.3
  • liblapack 3.9.0
  • liblapacke 3.9.0
  • libnsl 2.0.0
  • libopenblas 0.3.24
  • libprotobuf 4.24.4
  • libre2-11 2023.06.02
  • libsodium 1.0.18
  • libsqlite 3.44.0
  • libstdcxx-ng 13.2.0
  • libuuid 2.38.1
  • libzlib 1.2.13
  • logmuse 0.2.6
  • markdown-it-py 2.2.0
  • markupsafe 2.1.1
  • mdurl 0.1.0
  • multidict 6.0.4
  • nbformat 5.9.2
  • ncurses 6.4
  • numpy 1.26.0
  • oauth2client 4.1.3
  • openjdk 8.0.152
  • openssl 3.1.4
  • packaging 23.1
  • pandas 2.1.3
  • paramiko 2.8.1
  • peppy 0.35.7
  • perl 5.34.0
  • perl-threaded 5.32.1
  • pip 23.3.1
  • plac 1.3.4
  • platformdirs 3.10.0
  • pluggy 1.0.0
  • ply 3.11
  • prettytable 3.5.0
  • protobuf 4.24.4
  • psutil 5.9.0
  • pulp 2.7.0
  • pyasn1 0.4.8
  • pyasn1-modules 0.2.8
  • pycparser 2.21
  • pygments 2.15.1
  • pynacl 1.5.0
  • pyopenssl 23.2.0
  • pyparsing 3.0.9
  • pysftp 0.2.9
  • pysocks 1.7.1
  • pytest 7.4.0
  • python 3.11.6
  • python-dateutil 2.8.2
  • python-fastjsonschema 2.16.2
  • python-irodsclient 1.1.9
  • python-tzdata 2023.3
  • python_abi 3.11
  • pytz 2023.3.post1
  • pyyaml 6.0.1
  • re2 2023.06.02
  • readline 8.2
  • referencing 0.30.2
  • requests 2.31.0
  • reretry 0.11.8
  • rich 13.3.5
  • rpds-py 0.10.6
  • rsa 4.7.2
  • s3transfer 0.7.0
  • setuptools 68.0.0
  • setuptools-scm 7.1.0
  • six 1.16.0
  • slacker 0.14.0
  • smart_open 5.2.1
  • smmap 4.0.0
  • snakemake 7.32.4
  • snakemake-minimal 7.32.4
  • stone 3.3.1
  • stopit 1.1.2
  • tabulate 0.9.0
  • throttler 1.2.2
  • tk 8.6.13
  • toposort 1.10
  • traitlets 5.7.1
  • trimmomatic 0.39
  • typing-extensions 4.7.1
  • typing_extensions 4.7.1
  • tzdata 2023c
  • ubiquerg 0.6.3
  • uritemplate 4.1.1
  • urllib3 1.26.18
  • veracitools 0.1.3
  • wcwidth 0.2.5
  • wheel 0.41.2
  • wrapt 1.14.1
  • xz 5.4.5
  • yaml 0.2.5
  • yarl 1.9.3
  • yte 1.5.1