snakemake_rnaseq

A Snakemake pipeline to go from fastq mRNA sequencing files to raw and normalised counts (usable for downstream EDA and differential analysis)

https://github.com/bleekerlab/snakemake_rnaseq

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Keywords

bioinformatics conda docker genomics rna-seq-analysis rna-seq-pipeline snakemake

Last synced: 11 months ago · JSON representation ·

Repository

A Snakemake pipeline to go from fastq mRNA sequencing files to raw and normalised counts (usable for downstream EDA and differential analysis)

Basic Info

Host: GitHub
Owner: BleekerLab
License: mit
Language: Python
Default Branch: master
Homepage: https://snakemake-rnaseq-pipeline.readthedocs.io/en/latest/
Size: 85.4 MB

Statistics

Stars: 14
Watchers: 2
Forks: 8
Open Issues: 5
Releases: 8

Topics

bioinformatics conda docker genomics rna-seq-analysis rna-seq-pipeline snakemake

Created about 6 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

RNA-seq analysis pipeline

Description
Installation and usage (local machine)
Installation and usage (HPC cluster)
- Installation
- Usage
Directed Acyclic Graph of jobs
References :green_book:
Citation

Description

A Snakemake pipeline for the analysis of messenger RNA-seq data. It processes mRNA-seq fastq files and delivers both raw and normalised/scaled count tables. This pipeline also outputs a QC report per fastq file and a .bam mapping file to use with a genome browser for instance.
This pipeline can process single or paired-end data and is mostly suited for Illumina sequencing data.

Description

This pipeline analyses the raw RNA-seq data and produces two files containing the raw and normalized counts.

The raw fastq files will be trimmed for adaptors and quality checked with fastp.
The genome sequence FASTA file will be used for the mapping step of the trimmed reads using STAR.
A GTF annotation file will be used to obtain the raw counts using subread featureCounts.
The raw counts will be scaled by a custom R function that implements the DESeq2 median of ratios method to generate the scaled ("normalized") counts.

Input files

RNA-seq fastq files as listed in the config/samples.tsv file. Specify a sample name (e.g. "Sample_A") in the sample column and the paths to the forward read (fq1) and to the reverse read (fq2). If you have single-end reads, leave the fq2 column empty.
A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
A genome annotation file in the GTF format. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks: gffread my.gff3 -T -o my.gtf. :warning: for featureCounts to work, the feature in the GTF file should be exon while the meta-feature has to be transcript_id.

Below is an example of a GTF file format. :warning: a real GTF file does not have column names (seqname, source, etc.). Remove all non-data rows.

| seqname | source | feature | start | end | score | strand | frame | attributes | |-----------|------------|------|------|------|---|---|---|----------------------------------------------------------------------------------------------------| | SL4.0ch01 | makerITAG | CDS | 279 | 743 | . | + | 0 | transcriptid "Solyc01g004000.1.1"; geneid "gene:Solyc01g004000.1"; genename "Solyc01g004000.1"; | | SL4.0ch01 | makerITAG | exon | 1173 | 1616 | . | + | . | transcriptid "Solyc01g004002.1.1"; geneid "gene:Solyc01g004002.1"; genename "Solyc01g004002.1"; | | SL4.0ch01 | makerITAG | exon | 3793 | 3971 | . | + | . | transcriptid "Solyc01g004002.1.1"; geneid "gene:Solyc01g004002.1"; genename "Solyc01g004002.1"; |

Output files

A table of raw counts called raw_counts.txt: this table can be used to perform a differential gene expression analysis with DESeq2.
A table of DESeq2-normalised counts called scaled_counts.tsv: this table can be used to perform an Exploratory Data Analysis with a PCA, heatmaps, sample clustering, etc.
fastp QC reports: one per fastq file.
bam files: one per fastq file (or pair of fastq files).

Prerequisites: what you should know before using this pipeline

Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the wget and scp commands).
An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.

Content of this GitHub repository

Snakefile: a master file that contains the desired outputs and the rules to generate them from the input files.
config/samples.tsv: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.
config/config.yaml: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
config/refs/: a folder containing
- a genomic reference in fasta format. The S_lycopersicum_chromosomes.4.00.chrom1.fa is placed for testing purposes.
- a GTF annotation file. The ITAG4.0_gene_models.sub.gtf for testing purposes.
.fastq/: a (hidden) folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk: seqtk sample -s100 <inputfile> 250000 > <output file> This folder should contain the fastq of the paired-end RNA-seq data, you want to run.
envs/: a folder containing the environments needed for the pipeline:
- The environment.yaml is used by the conda package manager to create a working environment (see below).
- The Dockerfile is a Docker file used to build the docker image by refering to the environment.yaml (see below).

Installation and usage (local machine)

Installation

You will need a local copy of the GitHub snakemake_rnaseq repository on your machine. You can either: - use git in the shell: git clone git@github.com:BleekerLab/snakemake_rnaseq.git. - click on "Clone or download" and select download. - Then navigate inside the snakemake_rnaseq folder using Shell commands.

Usage

Configuration :pencil2:

You'll need to change a few things to accomodate this pipeline to your needs. Make sure you have changed the parameters in the config/config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.

:round_pushpin: Option 1: conda (easiest)

Using the conda package manager, you need to create an environment where core softwares such as Snakemake will be installed.
1. Install the Miniconda3 distribution (>= Python 3.7 version) for your OS (Windows, Linux or Mac OS X).
2. Inside a Shell window (command line interface), create a virtual environment named rnaseq using the envs/environment.yaml file with the following command: conda env create --name rnaseq --file envs/environment.yaml 3. Then, before you run the Snakemake pipeline, activate this virtual environment with source activate rnaseq.

While a conda environment will in most cases work just fine, Docker is the recommended solution as it increases pipeline execution reproducibility.

:whale: Option 2: Docker (recommended)

:roundpushpin: Option 2: using a Docker container
1. Install Docker desktop for your operating system. 2. Open a Shell window and type: `docker pull bleekerlab/snakemakernaseq:4.7.12to retrieve a Docker image that includes the pipeline required softwares (Snakemake and conda and many others). 3. Run the pipeline on your system with:docker run --rm -v $PWD:/home/snakemake/ bleekerlab/snakemake_rnaseq:4.7.12and add any options for snakemake (-n,--cores 10`) etc. The image was built using a Dockerfile based on the 4.7.12 Miniconda3 official Docker image.

:whale: Option 3: Singularity

Install singularity
Open a Shell window and type: singularity run docker://bleekerlab/snakemake_rnaseq:4.7.12 to retrieve a Docker image that includes the pipeline required software (Snakemake and conda and many others).
Run the pipeline on your system with singularity run snakemake_rnaseq_4.7.12.sif and add any options for snakemake (-n, --cores 10) etc. The directory where the sif file is stored will automatically be mapped to /home/snakemake. Results will be written to a folder named $PWD/results/ (you can change results to something you like in the result_dir parameter of the config.yaml).

Dry run

With conda: use the snakemake -np to perform a dry run that prints out the rules and commands.
With Docker: use the docker run

Real run

With conda: snakemake --cores 10

Installation and usage (HPC cluster)

Installation

You will need a local copy of the GitHub snakemake_rnaseq repository on your machine. On a HPC system, you will have to clone it using the Shell command-line: git clone git@github.com:BleekerLab/snakemake_rnaseq.git. - click on "Clone or download" and select download. - Then navigate inside the snakemake_rnaseq folder using Shell commands.

Usage

See the detailed protocol here.

Directed Acyclic Graph of jobs

dag

References :green_book:

Authors

Marc Galland, m.galland@uva.nl
Tijs Bliek, m.bliek@uva.nl
Frans van der Kloet f.m.vanderkloet@uva.nl

Pipeline dependencies

Acknowledgments :clap:

Johannes Köster; creator of Snakemake.

Citation

If you use this software, please use the following citation:

Bliek T., Chouaref J., van der Kloet F., Galland M. (2021). RNA-seq analysis pipeline (version 0.3.7). DOI: https://doi.org/https://doi.org/10.5281/zenodo.4707140

Owner

Name: Petra Bleeker laboratory
Login: BleekerLab
Kind: organization
Email: P.M.Bleeker@uva.nl
Location: University of Amsterdam

Repositories: 6
Profile: https://github.com/BleekerLab

Laboratory of Petra Bleeker at University of Amsterdam

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Bliek
    given-names: Tijs
    orcid: https://orcid.org/0000-0002-0488-4873
  - family-names: Chouaref
    given-names: Jihed
    orcid: https://orcid.org/0000-0003-3865-896X
  - family-names: van der Kloet
    given-names: Frans
    orcid: https://orcid.org/0000-0002-8573-2651
  - family-names: Galland
    given-names: Marc
    orcid: https://orcid.org/0000-0003-2161-8689
title: "RNA-seq analysis pipeline"
version: 0.3.7
doi: https://doi.org/10.5281/zenodo.4707140
date-released: 2021-04-27

GitHub Events

Total

Watch event: 6
Fork event: 3

Last Year

Watch event: 6
Fork event: 3

Dependencies

Dockerfile docker

continuumio/miniconda 4.7.12 build

environment.yaml pypi

snakemake_rnaseq

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

RNA-seq analysis pipeline

Description

Description

Input files

Output files

Prerequisites: what you should know before using this pipeline

Content of this GitHub repository

Installation and usage (local machine)

Installation

Usage

Configuration :pencil2:

:round_pushpin: Option 1: conda (easiest)

:whale: Option 2: Docker (recommended)

:whale: Option 3: Singularity

Dry run

Real run

Installation and usage (HPC cluster)

Installation

Usage

Directed Acyclic Graph of jobs

References :green_book:

Authors

Pipeline dependencies

Acknowledgments :clap:

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies