snakemake_rnaseq
A Snakemake pipeline to go from fastq mRNA sequencing files to raw and normalised counts (usable for downstream EDA and differential analysis)
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Keywords
Repository
A Snakemake pipeline to go from fastq mRNA sequencing files to raw and normalised counts (usable for downstream EDA and differential analysis)
Basic Info
- Host: GitHub
- Owner: BleekerLab
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://snakemake-rnaseq-pipeline.readthedocs.io/en/latest/
- Size: 85.4 MB
Statistics
- Stars: 14
- Watchers: 2
- Forks: 8
- Open Issues: 5
- Releases: 8
Topics
Metadata Files
README.md
RNA-seq analysis pipeline
- Description
- Installation and usage (local machine)
- Installation and usage (HPC cluster)
- Directed Acyclic Graph of jobs
- References :green_book:
- Citation
Description
A Snakemake pipeline for the analysis of messenger RNA-seq data. It processes mRNA-seq fastq files and delivers both raw and normalised/scaled count tables. This pipeline also outputs a QC report per fastq file and a .bam mapping file to use with a genome browser for instance.
This pipeline can process single or paired-end data and is mostly suited for Illumina sequencing data.
Description
This pipeline analyses the raw RNA-seq data and produces two files containing the raw and normalized counts.
- The raw fastq files will be trimmed for adaptors and quality checked with
fastp. - The genome sequence FASTA file will be used for the mapping step of the trimmed reads using
STAR. - A GTF annotation file will be used to obtain the raw counts using
subread featureCounts. - The raw counts will be scaled by a custom R function that implements the
DESeq2median of ratios method to generate the scaled ("normalized") counts.
Input files
- RNA-seq fastq files as listed in the
config/samples.tsvfile. Specify a sample name (e.g. "Sample_A") in thesamplecolumn and the paths to the forward read (fq1) and to the reverse read (fq2). If you have single-end reads, leave thefq2column empty. - A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
- A genome annotation file in the GTF format. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks:
gffread my.gff3 -T -o my.gtf. :warning: for featureCounts to work, the feature in the GTF file should beexonwhile the meta-feature has to betranscript_id.
Below is an example of a GTF file format. :warning: a real GTF file does not have column names (seqname, source, etc.). Remove all non-data rows.
| seqname | source | feature | start | end | score | strand | frame | attributes | |-----------|------------|------|------|------|---|---|---|----------------------------------------------------------------------------------------------------| | SL4.0ch01 | makerITAG | CDS | 279 | 743 | . | + | 0 | transcriptid "Solyc01g004000.1.1"; geneid "gene:Solyc01g004000.1"; genename "Solyc01g004000.1"; | | SL4.0ch01 | makerITAG | exon | 1173 | 1616 | . | + | . | transcriptid "Solyc01g004002.1.1"; geneid "gene:Solyc01g004002.1"; genename "Solyc01g004002.1"; | | SL4.0ch01 | makerITAG | exon | 3793 | 3971 | . | + | . | transcriptid "Solyc01g004002.1.1"; geneid "gene:Solyc01g004002.1"; genename "Solyc01g004002.1"; |
Output files
- A table of raw counts called
raw_counts.txt: this table can be used to perform a differential gene expression analysis withDESeq2. - A table of DESeq2-normalised counts called
scaled_counts.tsv: this table can be used to perform an Exploratory Data Analysis with a PCA, heatmaps, sample clustering, etc. - fastp QC reports: one per fastq file.
- bam files: one per fastq file (or pair of fastq files).
Prerequisites: what you should know before using this pipeline
- Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
- Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the
wgetandscpcommands). - An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.
Content of this GitHub repository
Snakefile: a master file that contains the desired outputs and the rules to generate them from the input files.config/samples.tsv: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.config/config.yaml: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.config/refs/: a folder containing- a genomic reference in fasta format. The
S_lycopersicum_chromosomes.4.00.chrom1.fais placed for testing purposes. - a GTF annotation file. The
ITAG4.0_gene_models.sub.gtffor testing purposes.
- a genomic reference in fasta format. The
.fastq/: a (hidden) folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk:seqtk sample -s100 <inputfile> 250000 > <output file>This folder should contain thefastqof the paired-end RNA-seq data, you want to run.envs/: a folder containing the environments needed for the pipeline:- The
environment.yamlis used by the conda package manager to create a working environment (see below). - The
Dockerfileis a Docker file used to build the docker image by refering to theenvironment.yaml(see below).
- The
Installation and usage (local machine)
Installation
You will need a local copy of the GitHub snakemake_rnaseq repository on your machine. You can either:
- use git in the shell: git clone git@github.com:BleekerLab/snakemake_rnaseq.git.
- click on "Clone or download" and select download.
- Then navigate inside the snakemake_rnaseq folder using Shell commands.
Usage
Configuration :pencil2:
You'll need to change a few things to accomodate this pipeline to your needs. Make sure you have changed the parameters in the config/config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.
:round_pushpin: Option 1: conda (easiest)
Using the conda package manager, you need to create an environment where core softwares such as Snakemake will be installed.
1. Install the Miniconda3 distribution (>= Python 3.7 version) for your OS (Windows, Linux or Mac OS X).
2. Inside a Shell window (command line interface), create a virtual environment named rnaseq using the envs/environment.yaml file with the following command: conda env create --name rnaseq --file envs/environment.yaml
3. Then, before you run the Snakemake pipeline, activate this virtual environment with source activate rnaseq.
While a conda environment will in most cases work just fine, Docker is the recommended solution as it increases pipeline execution reproducibility.
:whale: Option 2: Docker (recommended)
:roundpushpin: Option 2: using a Docker container
1. Install Docker desktop for your operating system.
2. Open a Shell window and type: `docker pull bleekerlab/snakemakernaseq:4.7.12to retrieve a Docker image that includes the pipeline required softwares (Snakemake and conda and many others).
3. Run the pipeline on your system with:
docker run --rm -v $PWD:/home/snakemake/ bleekerlab/snakemake_rnaseq:4.7.12and add any options for snakemake (-n,--cores 10`) etc.
The image was built using a Dockerfile based on the 4.7.12 Miniconda3 official Docker image.
:whale: Option 3: Singularity
- Install singularity
- Open a Shell window and type:
singularity run docker://bleekerlab/snakemake_rnaseq:4.7.12to retrieve a Docker image that includes the pipeline required software (Snakemake and conda and many others). - Run the pipeline on your system with
singularity run snakemake_rnaseq_4.7.12.sifand add any options for snakemake (-n,--cores 10) etc. The directory where the sif file is stored will automatically be mapped to/home/snakemake. Results will be written to a folder named$PWD/results/(you can changeresultsto something you like in theresult_dirparameter of theconfig.yaml).
Dry run
- With conda: use the
snakemake -npto perform a dry run that prints out the rules and commands. - With Docker: use the
docker run
Real run
With conda: snakemake --cores 10
Installation and usage (HPC cluster)
Installation
You will need a local copy of the GitHub snakemake_rnaseq repository on your machine. On a HPC system, you will have to clone it using the Shell command-line: git clone git@github.com:BleekerLab/snakemake_rnaseq.git.
- click on "Clone or download" and select download.
- Then navigate inside the snakemake_rnaseq folder using Shell commands.
Usage
See the detailed protocol here.
Directed Acyclic Graph of jobs

References :green_book:
Authors
- Marc Galland, m.galland@uva.nl
- Tijs Bliek, m.bliek@uva.nl
- Frans van der Kloet f.m.vanderkloet@uva.nl
Pipeline dependencies
Acknowledgments :clap:
Johannes Köster; creator of Snakemake.
Citation
If you use this software, please use the following citation:
Bliek T., Chouaref J., van der Kloet F., Galland M. (2021). RNA-seq analysis pipeline (version 0.3.7). DOI: https://doi.org/https://doi.org/10.5281/zenodo.4707140
Owner
- Name: Petra Bleeker laboratory
- Login: BleekerLab
- Kind: organization
- Email: P.M.Bleeker@uva.nl
- Location: University of Amsterdam
- Repositories: 6
- Profile: https://github.com/BleekerLab
Laboratory of Petra Bleeker at University of Amsterdam
Citation (CITATION.cff)
cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Bliek
given-names: Tijs
orcid: https://orcid.org/0000-0002-0488-4873
- family-names: Chouaref
given-names: Jihed
orcid: https://orcid.org/0000-0003-3865-896X
- family-names: van der Kloet
given-names: Frans
orcid: https://orcid.org/0000-0002-8573-2651
- family-names: Galland
given-names: Marc
orcid: https://orcid.org/0000-0003-2161-8689
title: "RNA-seq analysis pipeline"
version: 0.3.7
doi: https://doi.org/10.5281/zenodo.4707140
date-released: 2021-04-27
GitHub Events
Total
- Watch event: 6
- Fork event: 3
Last Year
- Watch event: 6
- Fork event: 3
Dependencies
- continuumio/miniconda 4.7.12 build