nf-rnaseqcount
Assembly and differential expression analysis
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
Assembly and differential expression analysis
Basic Info
- Host: GitHub
- Owner: phelelani
- License: mit
- Language: Nextflow
- Default Branch: master
- Size: 650 KB
Statistics
- Stars: 4
- Watchers: 1
- Forks: 4
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
nf-rnaSeqCount
nf-rnaSeqCount is a Nextflow pipeline for obtaining raw read counts for RNA-seq data using a given reference genome and annotation. To use the nf-rnaSeqCount pipeline, the following dependencies are required:
1. Installed softwares:
- Nextflow
- Singularity
2. Singularity containers with the required applications/programs for executing the workflow:
- nf-rnaSeqCount-fastqc.sif
- nf-rnaSeqCount-featurecounts.sif
- nf-rnaSeqCount-htseqcount.sif
- nf-rnaSeqCount-multiqc.sif
- nf-rnaSeqCount-star.sif
- nf-rnaSeqCount-trimmomatic.sif
- nf-rnaSeqCount-bowtie2.sif
3. Reference genome, annotation and indexes
- Reference genome (.fa/.fasta) and genome annotation (.gtf) files.
- Reference genome indexes (bowtie2 & STAR - see 1.3. below on how to generate the indexes).
1. Obtaining the nf-rnaSeqCount pipeline and preparing data
First, you need to clone the nf-rnaSeqCount repository onto you machine. You can eisther use git or nextflow (see the two methods below). I recommend using nextflow and creating you own config file (will explain later) for executing the workflow in the directory of your choosing. The rest of this documentation assumes that you have used nextflow to clone this workflow - If your're an expert and have used git to clone the workflow - you know what to do :)
```bash
Using nextflow
nextflow pull https://github.com/phelelani/nf-rnaSeqCount
Content of the repository:
bash
nf-rnaSeqCount
|--containers ## Folder for Singularity images and recipes (in case you want to build yourself). All downloaded images go here!
| |--Singularity.fastqc ## Singularity recipe file for
| |--Singularity.featureCounts ## Singularity recipe file for
| |--Singularity.htseqCount ## Singularity recipe file for
| |--Singularity.multiQC ## Singularity recipe file for
| |--Singularity.star ## Singularity recipe file for
| |--Singularity.trimmomatic ## Singularity recipe file for
| |--Singularity.trinity ## Singularity recipe file for
|--templates ## Folder for extra scripts for the pipeline.
| |--cleanfeatureCounts.sh ## Script for
| |--cleanhtseqCounts.sh ## Script for
|--LICENSE ## Duh!
|--main.config ## User configuration file! All inputs, outputs and options GO HERE!! ONLY file that SHOULD be modified by user!
|--main.nf ## Main nf-rnaSeqCount nextflow scripts.
|--nextflow.config ## Pipeline configuration file! DO NOT EDIT!!!
|--nf-rnaSeqCount.png ## Pipeline flow diagram
|--README.md ## Duh!
To get the `help menu` for the workflow, execute the following from anywherre on your system aftercloning the repository:
nextflow run nf-rnaSeqCount --help
``
The command above will give you the following usage information and options for running thenf-rnaSeqCount` workflow:
```
################################ nf-rnaSeqCount v0.2
====================================================================================================
USAGE: nextflow run nf-rnaSeqCount -profile "slurm" --data "/path/to/data" --genome "/path/to/genome.fa" --genes "/path/to/genes.gtf"
HELP: nextflow run nf-rnaSeqCount --help
MANDATORY ARGUEMENTS: -profile STRING Executor to be used. Available options: "standard" : Local execution (no job scheduler). "slurm" : SLURM scheduler. --mode STRING To specify which step of the workflow you are running (see https://github.com/phelelani/nf-rnaSeqCount). Available options: "prep.Containers" : For downloading Singularity containers used in this workflow. "prep.Indexes" : For indexing your reference genome using STAR and Bowtie2. "run.ReadQC" : For performing general QC on your reads using FastQC. "run.ReadTrimming" : For trimming low quality bases and removing adapters from your reads using Trimmmomatic. "run.ReadAlignment" : For aligning your reads to your reference genome using STAR. "run.ReadCounting" : For counting features in your reads using HTSeq-count and featureCounts. "run.MultiQC" : For getting a summary of QC through the analysis using MultiQC. --data FOLDER Path to where the input data (FASTQ files) is located. Supported FASTQ files: [ fastq | fastq.gz | fastq.bz2 | fq | fq.gz | fq.bz2 ] --genome FILE The whole genome FASTA sequence. Supported FASTA files: [ fasta | fa | fna ] --genes FILE The genome annotation GFT file. Supported GTF file: [ gtf ]
OPTIONAL ARGUEMENTS: --help To show this menu. --out FOLDER Path to where the output should be directed. Default: $PWD/resultsnf-rnaSeqCount. --from STRING Specify to resume workflow from the QC or trimming step. Options: "run.ReadQC" : To resume from the QC step (default). "run.ReadTrimming" : To resume from the trimming step. --pairedEnd If working with paired-end FASTQ files (default). --singleEnd If working with single-end FASTQ files. --trim STRING Parameters for Trimmomatic. See http://www.usadellab.org/cms/index.php?page=trimmomatic for a more detailed use. The default parameters for Trimmomatic I have given you here (for both paird- and single-end sequences) are: For paired-end: "ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:true TRAILING:28 MINLEN:40" For single-end: "ILLUMINACLIP:TruSeq3-SE.fa:2:30:10:8:true TRAILING:28 MINLEN:40" --maxmemory STRING Maximum memory you have access to. Default: "200.GB" --maxcpus STRING Maximum CPUs you have access to. Default: "24" --maxtime STRING Maximum time you have access to.
Default: "24.h"
```
1.1. Download test datasets (optional)
We will now download the reference genome (along with its annotation file) from Ensembl. We will also download the FASTQ files from the H3ABioNet site, which we will analyse using the nf-rnaSeqCount workflow. NB: Skip this section if you have your own data to analyse using this workflow! This section is only for getting data to practice using the nf-rnaSeqCount workflow!
- [x] Download and decompress the mouse reference genome along with its annotation: ``` ## Make a directory for the reference genome: mkdir reference
Download the reference genome (FASTA) and annotation file (GTF) files and put them into the newlly created directory:
wget -c -O reference/genome.fa.gz ftp://ftp.ensembl.org/pub/release-68/fasta/musmusculus/dna/Musmusculus.GRCm38.68.dna.toplevel.fa.gz wget -c -O reference/genes.gtf.gz ftp://ftp.ensembl.org/pub/release-68/gtf/musmusculus/Musmusculus.GRCm38.68.gtf.gz gunzip reference/genome.fa.gz gunzip reference/genes.gtf.gz ```
- [x] Download RNA-seq test dataset from H3ABioNet: ``` ## Make a directory for the data: mkdir data
Download the data:
for sample in sample{37..42}_R{1,2}.fastq.gz; do wget -c -O data/$sample http://h3data.cbio.uct.ac.za/assessments/RNASeq/practice/dataset/$sample; done ```
1.2. Download the Singularity containers (required to execute the pipeline):
bash
nextflow run nf-rnaSeqCount -profile slurm --mode prep.Containers
1.3. Generating genome indexes.
To generate the STAR and Bowtie2 genome indexes, run the following commands:
```
Generate STAR and Bowtie2 indexes
nextflow run nf-rnaSeqCount -profile slurm --mode prep.Indexes --genome "$PWD/reference/genome.fa" --genes "$PWD/reference/genes.gtf" ``` We are now ready to execute the workflow!
2. Executing the main nf-rnaSeqCount pipeline
As seen on the help menu above, there are a couple of options that you can use with this workflow. It can become a bit tedious and confusing having to specify these commands everytime you have to execute the each section for the analysis. To make your life easier, we will create a configuration script that we will use in this tutorial (we will pass this using the -c option of nextflow). You can name it whatever you want, but for now, lets call it myparams.config. We will add the mandatory arguements for now, but as you become more farmiliar with the workflow - you can experiment with other options. You can use your favourite text editor to create the myparams.config file. Copy and paste the the parameters below:
params {
data = "$PWD/data"
genome = "$PWD/reference/genome.fa"
genes = "$PWD/reference/genes.fa"
}
Obviously - the above myparams.config assumes that you have been following this tutorial. If you have your data lying around somewhere in your system, you need to put the full path to where your the data, genome and genes files are. Since the --mode will keep changing, we will add this on the command as we do the analysis. Now that we have the mandatory arguements in our myparams.config, lets do some analysis
2.1. Read QC (optional):
To perform the QC of your fastq files, use this command:
bash
nextflow run nf-rnaSeqCount -profile slurm --mode run.ReadQC -c myparams.config
2.2. Read Trimming (optional):
To run the trimming step of the nf-rnaSeqCount pipeline, use this command:
bash
nextflow run nf-rnaSeqCount -profile slurm --mode run.ReadTrimming -c myparams.config
2.3. Read Alignment:
To run the read alignment step of the nf-rnaSeqCount pipeline, use this comman (NB: can be run with --from run.ReadTrimming if you would like to use your trimmed reads):
bash
nextflow run nf-rnaSeqCount -profile slurm --mode run.ReadAlignment -c myparams.config
2.4. Read Counting:
This step uses the BAM file outputs generated by the read alignment step! You MUST run STEP 2.3 (--mode run.ReadAlignment) before running this step:
bash
nextflow run nf-rnaSeqCount -profile slurm --mode run.ReadCounting -c myparams.config
2.6. Workflow QC (optional):
This step performs a Quality Check of the different pipeline steps that have been ran. You need to run at least ONE step of the pipeline to be able to run this MultiQC step!
bash
nextflow run nf-rnaSeqCount -profile slurm --mode run.MultiQC -c myparams.config
CONGRATULATIONS for getting this far!! :) You can now explore the results and use the read counts to perform differential expression analysis!
3. Explore nf-rnaSeqCount results
- [1] Read QC (optional) => `<output_directory>/1_RQC`
- [2] Read Trimming (optional) => `<output_directory>/2_Read_Trimming`
- [3] Read Alignment => `<output_directory>/3_Read_Alignment`
- [4] Read Counting => `<output_directory>/4_Read_Counts`
- [5] MultiQC => `<output_directory>/5_MultiQC
- [6] Workflow tracing => `<output_directory>/workflow-tracing
In addition to the 5 directories created for each step in the results directory, a directory workflow-tracing is created to monitor the resources used in each step. This directory will contain 3 files for each step (--mode) of the workflow:
- nf-rnaSeqCount_<mode>_report.html
- nf-rnaSeqCount_<mode>_timeline.html
- nf-rnaSeqCount_<mode>_trace.txt
These files contain detailed information on the resources (CPU, MEMORY and TIME) usage of each of the process in the different pipeline steps. The <output_directory> directory structure is summarized below:
bash
<output_directory>
|--1_Read_QC
| |--<sample_1>_R1.fastqc.html .. <sample_N>_R1.fastqc.html
| |--<sample_1>_R2.fastqc.html .. <sample_N>_R1.fastqc.html
|--2_Read_Trimming
| |--<sample_1>.1P.fastq.gz .. <sample_N>.1P.fastq.gz
| |--<sample_1>.2P.fastq.gz .. <sample_N>.2P.fastq.gz
|--3_Read_Alignment
| |--<sample_1>_Aligned.out.bam .. <sample_N>_Aligned.out.bam
| |--<sample_1>_Log.final.out .. <sample_N>_Log.final.out
| |--<sample_1>_Log.out .. <sample_N>_Log.out
| |--<sample_1>_Log.progress.out .. <sample_N>_Log.progress.out
| |--<samplle_1>_SJ.out.tab .. <sample>_SJ.out.tab
|--4_Read_Counts
| |--featureCounts
| | |--gene_counts_final.txt
| | |--gene_counts.txt
| | |--gene_counts.txt.jcounts
| | |--gene_counts.txt.summary
| |--htseqCounts
| | |--gene_counts_final.txt
| | |--<sample>.txt .. <sample>.txt
|--5_MultiQC
| |--multiqc_data
| |--multiqc_report.html
|--workflow-tracing
| |--nf-rnaSeqCount_run.MultiQC_{report.html,timeline.html,trace.txt}
| |--nf-rnaSeqCount_run.ReadAlignment_{report.html,timeline.html,trace.txt}
| |--nf-rnaSeqCount_run.ReadCounting_{report.html,timeline.html,trace.txt}
| |--nf-rnaSeqCount_run.ReadTrimming_{report.html,timeline.html,trace.txt}
| |--nf-rnaSeqCount_run.ReadQC_{report.html,timeline.html,trace.txt}
NB: I am working on further improving the pipleine and the associated documentation, feel free to share comments and suggestions!
Owner
- Name: Phelelani Mpangase
- Login: phelelani
- Kind: user
- Location: Johannesburg, South Africa
- Company: Sydney Brenner Institute for Molecular Bioscience
- Repositories: 3
- Profile: https://github.com/phelelani
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "nf-rnaSeqCount: A Nextflow pipeline for obtaining raw read counts from RNA-seq data"
authors:
- family-names: Mpangase
given-names: "Phelelani Thokozani"
orcid: "https://orcid.org/0000-0001-8280-8940"
- family-names: Frost
given-names: Jacqueline
orcid: "https://orcid.org/0000-0001-7627-011X"
- family-names: Mohammed
given-names: Tikly
orcid: "https://orcid.org/0000-0001-7850-3538"
- family-names: Ramsay
given-names: "Michèle"
orcid: "https://orcid.org/0000-0002-4156-4801"
- family-names: Hazelhurst
given-names: Scott
orcid: "https://orcid.org/0000-0002-0581-149X"
repository-code: "https://github.com/phelelani/nf-rnaSeqCount"
license: MIT
version: "0.2"
GitHub Events
Total
Last Year
Dependencies
- ubuntu 18.04 build
- ubuntu 18.04 build
- ubuntu 18.04 build
- ubuntu 18.04 build
- ubuntu 18.04 build
- ubuntu 18.04 build
- ubuntu 18.04 build