rnaflow

A simple RNA-Seq differential gene expression pipeline using Nextflow

https://github.com/hoelzer-lab/rnaflow

Last synced: 10 months ago · JSON representation ·

Repository

A simple RNA-Seq differential gene expression pipeline using Nextflow

Basic Info

Host: GitHub
Owner: hoelzer-lab
License: gpl-3.0
Language: HTML
Default Branch: master
Homepage:
Size: 183 MB

Statistics

Stars: 101
Watchers: 3
Forks: 22
Open Issues: 50
Releases: 23

Created over 6 years ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

RNAflow - An effective and simple RNA-Seq differential gene expression pipeline using Nextflow

flow-chart Figure 1. Workflow. The user can decide after preprocessing to run a differential gene expression (DEG) analysis or a transcriptome assembly. Circles symbolize input data and download icons symbolize automated download of resources. Steps marked by asterisks are currently only available for some species. See here for a list of references for the used tools and please consider to cite them as well.

Table of Contents

- [Quick installation](#quick-installation) - [Quick start](#quick-start) - [Start a test run](#start-a-test-run) - [Call help](#call-help) - [Update the pipeline](#update-the-pipeline) - [Use a certain release](#use-a-certain-release) - [Usage](#usage) - [Input files](#input-files) - [Read files (required)](#read-files-required) - [Genomes and annotation](#genomes-and-annotation) - [Build-in species](#build-in-species) - [Comparisons for DEG analysis](#comparisons-for-deg-analysis) - [Resume your run](#resume-your-run) - [Workflow control](#workflow-control) - [Preprocessing](#preprocessing) - [DEG analysis](#deg-analysis) - [Transcriptome assembly](#transcriptome-assembly) - [Profiles/configuration options](#profilesconfiguration-options) - [Executor options...](#executor-options) - [Engine options...](#engine-options) - [Monitoring](#monitoring) - [Output](#output) - [DESeq2 results](#deseq2-results) - [Working offline](#working-offline) - [Help message](#help-message) - [Known bugs and issues](#known-bugs-and-issues) - [Reference file name](#reference-file-name) - [Problems with `SortMeRNA`/ `HISAT2` error](#problems-with-sortmerna-hisat2-error-141-116) - [Description](#description) - [Workaround](#workaround) - [Latency problems on HPCs, issue (#79)](#latency-problems-on-hpcs-issue-79) - [Description](#description-1) - [Workaround](#workaround-1) - [Citation](#citation)

Quick installation

The pipeline is written in Nextflow, which can be used on any POSIX compatible system (Linux, OS X, etc). Windows system is supported through WSL. You need Nextflow installed and either conda, Docker, or Singularity to run the steps of the pipeline:

Install Nextflow
click here for a bash one-liner

```bash wget -qO- https://get.nextflow.io | bash

In the case you don’t have wget

curl -s https://get.nextflow.io | bash

```
Install conda
click here for a bash two-liner for Miniconda3 Linux 64-bit

bash wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh

OR

Install conda
click here for a bash two-liner for Miniconda3 Linux 64-bit

bash wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
Install Nextflow via conda
click here to see how to do that

bash conda create -n nextflow -c bioconda nextflow conda active nextflow

For transcriptome assembly you have to install also Docker or Singularity.

You can try to simply install Singularity via conda as well
click for an example command

bash conda create -n singularity -c conda-forge singularity conda active singularity

or if you already have a conda environment for nextflow:

bash conda activate nextflow conda install -c conda-forge singularity

A system admin-configured Singularity installation should be preferred in comparison to an own local conda installation. Please ask your sys admin!

All other dependencies and tools will be installed within the pipeline via conda, Docker or Singularity depending on the profile you run (see below).

Quick start

Start a test run

```bash

conda active nextflow

nextflow run hoelzer-lab/rnaflow -profile test,conda,local ```

... performs

a differential gene expression analysis with sub-sampled human read data,
on a reduced human genome and annotation (chromosome 1, 10 and 11),
comparing two conditions (MAQCA, MAQCB),
with a local execution (uses max. 4 cores in total and 8GB) and
conda dependency management.

Resource usage

For a local test run (with 30 cores in total at maximum): ```bash nextflow run hoelzer-lab/rnaflow -profile test,conda,local -w work \ --max_cores 30 --cores 10 --softlink_results -r master ``` we observed the following resource usage including downloads and `conda` environment creation for each process:

Total runtime: 25m 56s
Physical memory (RAM), max.: 3.015 GB at process hisat2index
Virtual memory (RAM + Disk swap), max.: 10.16 GB at process hisat2

A detailed HTML report automatically produced the pipeline can be found [here](test-data/execution_report.html).

Call help

bash nextflow run hoelzer-lab/rnaflow --help

Update the pipeline

bash nextflow pull hoelzer-lab/rnaflow

Use a certain release

We recommend to use a stable release of the pipeline:

bash nextflow pull hoelzer-lab/rnaflow -r <RELEASE>

Usage

bash nextflow run hoelzer-lab/rnaflow --reads input.csv --autodownload hsa --pathway hsa --max_cores 6 --cores 2

with --autodownload <hsa|mmu|ssc|mau|eco> build-in species, or define your own genome reference and annotation files in CSV files:

bash nextflow run hoelzer-lab/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --max_cores 6 --cores 2

Genomes and annotations from --autodownload, --genome and --annotation are concatenated.

By default, all possible comparisons are performed. Use --deg to change this.

--pathway <hsa|mmu|mau|ssc> performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for hsa, mmu and ssc, piano GSEA with different settings and consensus scoring for hsa, mmu, mau, and ssc.

Input files

Read files (required)

Specify your read files in FASTQ format with --reads input.csv. The file input.csv has to look like this for single-end reads:

csv Sample,R1,R2,Condition,Source,Strandedness mock_rep1,/path/to/reads/mock1.fastq.gz,,mock,,0 mock_rep2,/path/to/reads/mock2.fastq.gz,,mock,,0 mock_rep3,/path/to/reads/mock3.fastq.gz,,mock,,0 treated_rep1,/path/to/reads/treat1.fastq.gz,,treated,,0 treated_rep2,/path/to/reads/treat2.fastq.gz,,treated,,0 treated_rep3,/path/to/reads/treat3.fastq.gz,,treated,,0

and for paired-end reads, like this:

csv Sample,R1,R2,Condition,Source,Strandedness mock_rep1,/path/to/reads/mock1_1.fastq,/path/to/reads/mock1_2.fastq,mock,A,0 mock_rep2,/path/to/reads/mock2_1.fastq,/path/to/reads/mock2_2.fastq,mock,B,0 mock_rep3,/path/to/reads/mock3_1.fastq,/path/to/reads/mock3_2.fastq,mock,C,0 treated_rep1,/path/to/reads/treat1_1.fastq,/path/to/reads/treat1_2.fastq,treated,A,0 treated_rep2,/path/to/reads/treat2_1.fastq,/path/to/reads/treat2_2.fastq,treated,B,0 treated_rep3,/path/to/reads/treat3_1.fastq,/path/to/reads/treat3_2.fastq,treated,C,0

The first line is a required header. Read files can be compressed (.gz). You need at least two replicates for each condition to run the pipeline. Source labels are optional - the header is still required, the value can be empty as in the single-end example above. Source labels can be used to define the corresponding experiment even more precisely for improved differential expression testing, e.g. if RNA-Seq samples come from different Conditions (e.g. tissues) but the same Sources (e.g. patients). Still, the comparison will be performed between the Conditions but the Source information is additionally used in designing the DESeq2 experiment. Source labels also extend the heatmap sample annotation. Strandedness for the samples can optionally be defined directly in the csv or via the commandline parameter --strand. Where the strandedness column can be any value from: 0 = unstranded, 1 = stranded, 2 = reversely stranded, [default: 0]. Note that if strandedness is provided via the input CSV and the commandline parameter, the value from the command line will be used for the run.

Genomes and annotation

If you don't use one of the build-in species, specify your genomes via --genome fastas.csv, with fastas.csv looking like this:

/path/to/reference_genome1.fasta /path/to/reference_genome2.fasta

and --annotation gtfs.csv with gtfs.csv looking like this:

/path/to/reference_annotation_1.gtf /path/to/reference_annotation_2.gtf

You can add a build-in species to your defined genomes and annotation with --autodownload xxx.

Build-in species

We provide a small set of build-in species for which the genome and annotation files are automatically downloaded from Ensembl with --autodownload xxx. Please let us know, we can easily add other species.

| Species | three-letter shortcut | Annotation | Genome | | ------------ | --------------------- | ----------------------------------- | --------------------------------------------- | | Homo sapiens | hsa | Homosapiens.GRCh38.98 | Homosapiens.GRCh38.dna.primary_assembly | | Mus musculus | mmu | Musmusculus.GRCm38.99 | Musmusculus.GRCm38.dna.primaryassembly | | Sus scrofa | ssc ^* | Susscrofa.Sscrofa11.1.111 | Susscrofa.Sscrofa11.1.dna.toplevel | | Mesocricetus auratus | mau ^* | Mesocricetusauratus.MesAur1.0.100 | Mesocricetusauratus.MesAur1.0.dna.toplevel | | Escherichia coli | eco | Escherichiacolik12.ASM80076v1.45 | Escherichiacolik_12.ASM80076v1.dna.toplevel |

^* Downstream pathway analysis availible via --pathway xxx.

Multiple-mapped reads

To adjust the handling of multiple-mapped reads during the feature counting process you can use: --featurecounts_additional_params '-t exon -g gene_id -M' The default handling is to only count uniquely mapped reads via featureCounts. With the above flag set featureCounts will also count multi-mapped reads.

Comparisons for DEG analysis

Per default, all possible pairwise comparisons in one direction are performed. Thus, when A is compared against B the pipeline will not automatically compare B vs. A which will anyway only change the direction of the finally resulting fold changes. To change this, please define the needed comparison with --deg comparisons.csv, where each line contains a pairwise comparison:

csv Condition1,Condition2 conditionX,conditionY conditionA,conditionB conditionB,conditionA

The first line is a required header.

Resume your run

You can easily resume your run in case of changes to the parameters or inputs. Nextflow will try to not recalculate steps that are already done:

nextflow run hoelzer-lab/rnaflow -profile test,conda,local -resume

Nextflow will need access to the working directory where temporary calculations are stored. Per default, this is set to work but can be adjusted via -w /path/to/any/workdir. In addition, the .nextflow.log file is needed to resume a run, thus, this will only work if you resume the run from the same folder where you started it.

Workflow control

Preprocessing

bash --skip_sortmerna # skip rRNA removal via SortMeRNA [default false] --skip_read_preprocessing # skip preprocessing with fastp [default: false] --fastp_additional_params # additional parameters for fastp [default '-5 -3 -W 4 -M 20 -l 15 -x -n 5 -z 6'] --hisat2_additional_params # additional parameters for HISAT2 --featurecounts_additional_params # additional parameters for FeatureCounts [default: -t gene -g gene_id]

DEG analysis

bash --strand # strandness for counting with featureCounts: 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default 0] --tpm # threshold for TPM (transcripts per million) filter [default 1] --deg # a CSV file following the pattern: conditionX,conditionY --pathway # perform different downstream pathway analysis for the species hsa|mmu|mau|ssc --feature_id_type # ID type for downstream analysis [default: ensembl_gene_id]

Transcriptome assembly

bash --assembly # switch to transcriptome assembly --busco_db # BUSCO database ['euarchontoglires' or path to existing DB] --dammit_uniref90 # add UniRef90 to dammit databases, takes long [false] --rna # activate directRNA mode for ONT transcriptome assembly [default: false (cDNA)]

Profiles/configuration options

Per default, the pipeline is locally executed with conda dependency management (corresponds to -profile local,conda). Adjust this setting by combining an executer option with an engine option, e.g. -profile local,conda or -profile slurm,conda. We also provide container support, see below.

Executor options...

... or how to schedule your workload.

Currently implemented are local, slurm and lsf executions.

You can customize local with this parameters:

bash --cores # cores for one process [default 1] --max_cores # max. cores used in total [default allAvailable] --memory # max. memory in GB for local use [default 8 GB]

Engine options...

... or in which environment to run the tools.

Currently implemented are conda, Docker and Singularity. For transcriptome assembly some tools need to be run with Docker or Singularity.

You can switch between different engines via -profile, for example:

nextflow run hoelzer-lab/rnaflow -profile test,local,conda nextflow run hoelzer-lab/rnaflow -profile test,local,docker nextflow run hoelzer-lab/rnaflow -profile test,slurm,singularity

As a best practice for a local execution, we recommend to run the pipeline with --cores 1 --max_cores 1 the first time you use Singularity, because we experienced issues when generating the Singularity images in parallel the first time the pipeline is executed with this engine option. It is also possible to run the pipeline once with --setup set. In setup mode all the necessary files (DBs, reference files and images) are being downloaded and set up.

You can customize where conda environments are stored using

bash --condaCacheDir /path/to/dir

and where Singularity images are stored via

bash --singularityCacheDir /path/to/dir

Docker images are stored based on your system configuration.

Monitoring

Monitoring with Nextflow Tower

To monitor your computations the pipeline can be connected to Nextflow Tower. You need an user access token to connect your Tower account with the pipeline. Simply generate a login using your email and then click the link send to this address.

"Nextflow Tower does not require a password or registration procedure. Just provide your email address and we'll send you an authentication link to login. That's all!"

Once logged in, click on your avatar on the top right corner and select "Your tokens". Generate a token or copy the default one and set the environment variable:

bash export TOWER_ACCESS_TOKEN=<YOUR_COPIED_TOKEN> export NXF_VER=20.10.0

You can save this command to your .bashrc or .profile to not need to enter it again.

Now run:

bash nextflow run hoelzer-lab/rnaflow -profile test,local,conda -with-tower

Alternatively, you can also activate the Tower connection within the nextflow.config file located in the root GitHub directory:

java tower { accessToken = '' enabled = true }

You can also directly enter your access token here instead of generating the above environment variable.

Output

The result folder is structured by each step and tool (results/step/tool) as follows:

results/ ├── 01-Trimming │ └── fastp trimmed reads ├── 02-rRNARemoval │ └── SortMeRNA rRNA-free (and trimmed) reads ├── 03-Mapping │ └── HISAT2 mapping results in BAM format with index files (BAI) ├── 04-Counting │ └── featureCounts counting table ├── 05-CountingFilter │ └── TPM counting table with additional TPM value; formatted counting table filtered by TPM ├── 06-Annotation filtered annotation; gene id, name and bio type mapping ├── 07-DifferentialExpression │ └── DESeq2 see below ├── 08-Assembly │ └── de_novo │ └── Trinity Trinity assembly (with --assembly) ├── 09-RNA-Seq_Annotation BUSCO, dammit and StringTie2 results (with --assembly) ├── Logs Nextflow execution timeline and workflow report └── Summary MultiQC report

Please note, that 08-Assembly and 09-RNA-Seq_Annotation are part of the transcriptome assembly branch (--assembly). Here, steps 04 to 07 are currently not applicable.

DESeq2 results

The DESeq2 result is structured as follows:

07-DifferentialExpression/ └── DESeq2 ├── data │ ├── counts normalized, transformed counts; size factors table │ └── input DESeq2 input summary ├── deseq2.Rout R log file ├── MAQCA_vs_MAQCB results for pairwise comparison (here exemplarily for the -profile test data set) │ ├── downstream_analysis │ │ ├── piano piano results │ │ └── WebGestalt WebGestalt results │ ├── input DESeq2 input summary │ ├── plots │ │ ├── heatmaps │ │ ├── MA │ │ ├── PCA │ │ ├── sample2sample │ │ └── volcano │ ├── reports DESeq2 result HTML table; summary report │ └── results raw and filtered DESeq2 result in CSV and XLSX format; DEG analysis summary └── plots heatmaps and PCA of all samples

We provide DESeq2 normalized, regularized log (rlog), variance stabilized (vsd) and log2(n+1) (ntd) transformed count tables (DESeq2/data/counts).

For each comparison (specified with --deg or, per default, all possible pairwise comparisons in one direction), a new folder X_vs_Y is created. This also describes the direction of the comparison, e.g., the log2FoldChange describes the change of a gene A under condition Y with respect to the gene under condition X. For example, a log2FoldChange of +2 for gene A would tell you that this gene is 2-fold upregulated when we compare condition X vs. condition Y. The gene A is higher expressed in samples belonging to condition X.

Downstream analysis (--pathway xxx) are currently provided for some species: GSEA consensus scoring with piano for Homo sapiens (hsa), Mus musculus (mmu), Mesocricetus auratus (mau), and Sus scofa (ssc); and WebGestalt GSEA for Homo sapiens, Mus musculus, and Sus scrofa.

Working offline

In case you don't have an internet connection, here is a workaround to this issue for manual download and copying of external recourses:

Genomes and annotation can also be specified via --genome and --annotation, see here.
For BUSCO it is a simple download, see here with busco_db = 'euarchontoglires_odb9' as default.
For SortMeRNA and dammit the tools must be installed. Version specifications can be found here and there, the code to create the databases here and there with busco_db = 'euarchontoglires_odb9' dammit_uniref90 = false as default.
Downstream analysis with piano and WebGestalt currently need an internet connection in any case. If no connection is available piano and WebGestalt are skipped.

RNAflow looks up the files here:

``` nextflow-autodownload-databases # default: `permanentCacheDir = 'nextflow-autodownload-databases'` └── databases └── busco └── .tar.gz └── dammit └── .tar.gz └── uniref90 # in case of `dammit_uniref90 = true` └── .tar.gz └── sortmerna └── data └── rRNA_databases ```

Help message

click here to see the complete help message

``` Usage examples: nextflow run hoelzer-lab/rnaflow -profile test,local,conda nextflow run hoelzer-lab/rnaflow --cores 4 --reads input.csv --autodownload mmu --pathway mmu nextflow run hoelzer-lab/rnaflow --cores 4 --reads input.csv --autodownload eco --assembly nextflow run hoelzer-lab/rnaflow --cores 4 --reads input.csv --genome fasta_virus.csv --annotation gtf_virus.csv --autodownload hsa --pathway hsa Genomes and annotations from --autodownload, --genome and --annotation are concatenated. Input: --reads A CSV file following the pattern: Sample,R1,R2,Condition,Source,Strandedness - read mode is detected automatically (check terminal output if correctly assigned) Per default, all possible comparisons of conditions in one direction are made. Use --deg to change. --autodownload Specifies the species identifier for automated download [default: ] Currently supported are: - hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98] - eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45] - mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf] - ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ] - mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100] --species Specifies the species identifier for downstream path analysis. (DEPRECATED) If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: ] Currently supported are: - hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98] - eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45] - mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf] - mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100] --genome CSV file with genome reference FASTA files (one path in each line) If set, --annotation must also be set. --annotation CSV file with genome annotation GTF files (one path in each line) --include_species Either --species or --genome/--annotation need to be used. Both input seetings can be also combined to use genome and annotation of supported species in addition to --genome and --annotation [default: false] Preprocessing options: --fastp_additional_params additional parameters for fastp [default: -5 -3 -W 4 -M 20 -l 15 -x -n 5 -z 6] --skip_sortmerna skip rRNA removal via SortMeRNA [default: false] --skip_read_preprocessing skip preprocessing with fastp [default: false] --hisat2_additional_params additional parameters for HISAT2 [default: ] --featurecounts_additional_params additional parameters for FeatureCounts [default: -t gene -g gene_id] DEG analysis options: --strand 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default: 0] This will overwrite the optional strandedness defined in the input CSV file. --tpm Threshold for TPM (transcripts per million) filter. A feature is discared, if for all conditions the mean TPM value of all corresponding samples in this condition is below the threshold. [default: 1] --deg A CSV file following the pattern: conditionX,conditionY Each line stands for one differential gene expression comparison. Must match the 'Condition' labels defined in the CSV file provided via --reads. --pathway Perform different downstream pathway analysis for the species. [default: ] Currently supported are: - hsa | Homo sapiens - mmu | Mus musculus - mau | Mesocricetus auratus - ssc | Sus scrofa --feature_id_type ID type for downstream analysis [default: ensembl_gene_id] Transcriptome assembly options: --assembly Perform de novo and reference-based transcriptome assembly instead of DEG analysis [default: false] --busco_db The database used with BUSCO [default: euarchontoglires_odb9] Full list of available data sets at https://busco-data.ezlab.org/v5/data/lineages/ --dammit_uniref90 Add UniRef90 to the dammit databases (time consuming!) [default: false] --rna Activate directRNA mode for ONT transcriptome assembly [default: false (cDNA)] Computing options: --cores Max cores per process for local use [default: 1] --max_cores Max cores used on the machine for local use [default: 4] --memory Max memory in GB for local use [default: 8 GB] --output Name of the result folder [default: results] Caching: --permanentCacheDir Location for auto-download data like databases [default: nextflow-autodownload-databases] --condaCacheDir Location for storing the conda environments [default: conda] --singularityCacheDir Location for storing the singularity images [default: singularity] --workdir Working directory for all intermediate results [default: null] (DEPRECATED: use `-w your/workdir` instead) --softlink_results Softlink result files instead of copying. --setup Download all necessary DB, reference and image files without running the pipeline. [default: false] Nextflow options: -with-tower Activate monitoring via Nextflow Tower (needs TOWER_ACCESS_TOKEN set). -with-report rep.html CPU / RAM usage (may cause errors). -with-dag chart.html Generates a flowchart for the process tree. -with-timeline time.html Timeline (may cause errors). Execution/Engine profiles: The pipeline supports profiles to run via different Executers and Engines e.g.: -profile local,conda Executer (choose one): local slurm lsf latency Engines (choose one): conda mamba docker singularity Per default: -profile local,conda is executed. For a test run (~ 15 min), add "test" to the profile, e.g. -profile test,local,conda. The command will create all conda environments and download and run test data. We also provide some pre-configured profiles for certain HPC environments: ara (slurm, conda and parameter customization) ```

Known bugs and issues

Don't name your reference genome file `reference.fa`

Internally, this will cause problems because of how RNAflow renames the file for processing. Please use another name (anyway, it's a good idea to be more descriptive in your file names).

Problems with `SortMeRNA`/ `HISAT2` error (#141, #116)

Description

The pipeline fails with something like

this

``` Error executing process > 'preprocess:hisat2 (2)' Caused by: Missing output file(s) `22_rep4_summary.log` expected by process `preprocess:hisat2 (2)` Command executed: hisat2 -x reference -1 22_rep4.R1.other.fastq.gz -2 22_rep4.R2.other.fastq.gz -p 60 --new-summary --summary-file 22_rep4_summary.log | samtools view -bS | samtools sort -o 22_rep4.sorted.bam -T tmp --threads 60 Command exit status: 0 Command output: (empty) Command error: Error: Read AFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ has more quality values than read characters. terminate called after throwing an instance of 'int' Aborted (core dumped) (ERR): hisat2-align exited with value 134 [bam_sort_core] merging from 0 files and 60 in-memory blocks... grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Work dir: /tmp/nextflow-work-as11798/2f/4a5b7060530705c2697bdf3eec73a4 Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line ```

Often encountered when running in screen or tmux
Nextflow's -bg option does not help

Workaround

Skip SortMeRNA with --skip_sortmerna
Reads can be cleand beforhand e.g. with CLEAN

Description

The pipeline fails with something like

this

``` ERROR ~ Error executing process > 'preprocess_illumina:hisat2 (mock_rep3)' Caused by: Missing output file(s) `mock_rep3_summary.log` expected by process `preprocess_illumina:hisat2 (mock_rep3)` Command executed: hisat2 -x reference -U mock_rep3.other.fastq.gz -p 24 --new-summary --summary-file mock_rep3_summary.log | samtools view -bS | samtools sort -o mock_rep3.sorted.bam -T tmp --threads 24 Command exit status: 0 Command output: (empty) Command error: (ERR): mkfifo(/tmp/42.unp) failed. Exiting now ... ```

this is very likely because of HISAT2 creating tmp directories with the same name (https://github.com/DaehwanKimLab/hisat2/issues/438)
causing trouble when executing HISAT2 processes in parallel
to avoid this, a --temp-directory parameter was added to HISAT2 but only in a non-release version of the tool
we might fix this in future versions of the pipeline by using an HISAT2 container without a proper release number but a git tag or so
however, then this will still be a potential problem when using the conda profile where we rely on released tool versions on bioconda

Workaround

if you encounter that problem, try running the pipeline without parallel executions
use the local profile instead of a cluster profile such as slurm
specify something like --max_cores 24 --cores 24 to prevent parallel executions

Latency problems on HPCs, issue (#79)

Description

Latency-related problems with Nextflow might occur when running on HPC systems, where Nextflow expects files to be available before they are fully written to the file system. In these cases, Nextflow might get stuck or report missing output or input files to some processes:

``` ERROR ~ Error executing process > 'some_process'

Caused by: Missing output file(s) some_process.out expected by process some_process

```

Often encountered when running on HPC systems

Workaround

Please try running the pipeline with the latency profile activated, just add it to the profiles you already defined:

-profile slurm,conda,latency

Citation

If you use RNAflow please cite:

RNAflow > Marie Lataretu and Martin Hölzer. "RNAflow: An effective and simple RNA-Seq differential gene expression pipeline using Nextflow". Genes 2020, 11(12), 1487; https://doi.org/10.3390/genes11121487

Owner

Name: hoelzer-lab
Login: hoelzer-lab
Kind: organization

Repositories: 4
Profile: https://github.com/hoelzer-lab

Citation (citations.md)

# RNAflow Citations

## [Nextflow](https://www.ncbi.nlm.nih.gov/pubmed/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

* [fastp](https://pubmed.ncbi.nlm.nih.gov/30423086/)
  > Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PMID: 30423086; PMCID: PMC6129281.

* [SortMeRNA](https://www.ncbi.nlm.nih.gov/pubmed/23071270/)
  > Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data Bioinformatics. 2012 Dec 15;28(24):3211-7. doi: 10.1093/bioinformatics/bts611. Epub 2012 Oct 15. PubMed PMID: 23071270.

* [featureCounts](https://www.ncbi.nlm.nih.gov/pubmed/24227677/)
  > Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014 Apr 1;30(7):923-30. doi: 10.1093/bioinformatics/btt656. Epub 2013 Nov 13. PubMed PMID: 24227677.

* [HISAT2](https://www.ncbi.nlm.nih.gov/pubmed/31375807/)
  > Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019 Aug;37(8):907-915. doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2. PubMed PMID: 31375807.

* [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)
  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

* [MultiQC](https://www.ncbi.nlm.nih.gov/pubmed/27312411/)
  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

* [Trinity](https://pubmed.ncbi.nlm.nih.gov/21572440/)
  > Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011 May 15;29(7):644-52. doi: 10.1038/nbt.1883. PMID: 21572440; PMCID: PMC3571712.

* [BUSCO](https://pubmed.ncbi.nlm.nih.gov/29220515/)
  > Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol Biol Evol. 2018 Mar 1;35(3):543-548. doi: 10.1093/molbev/msx319. PMID: 29220515; PMCID: PMC5850278.

* [dammit](http://www.camillescott.org/dammit)

* [StringTie2](https://pubmed.ncbi.nlm.nih.gov/31842956/)
  > Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019 Dec 16;20(1):278. doi: 10.1186/s13059-019-1910-1. PMID: 31842956; PMCID: PMC6912988.

* [GffRead](https://pubmed.ncbi.nlm.nih.gov/32489650/)
  > Pertea G, Pertea M. GFF Utilities: GffRead and GffCompare. F1000Res. 2020 Apr 28;9:ISCB Comm J-304. doi: 10.12688/f1000research.23297.2. PMID: 32489650; PMCID: PMC7222033.

## R packages

* [R](https://www.R-project.org/)
  > R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

* [DESeq2](https://www.ncbi.nlm.nih.gov/pubmed/25516281/)
  > Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. PubMed PMID: 25516281; PubMed Central PMCID: PMC4302049.

* [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html)
  > H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

* [pheatmap](https://CRAN.R-project.org/package=pheatmap)
  > Raivo Kolde (2018). pheatmap: Pretty Heatmaps.

* [RColorBrewer](https://CRAN.R-project.org/package=RColorBrewer)
  > Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes.

* [piano](https://pubmed.ncbi.nlm.nih.gov/23444143/)
  > Väremo L, Nielsen J, Nookaew I. Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods. Nucleic Acids Res. 2013 Apr;41(8):4378-91. doi: 10.1093/nar/gkt111. Epub 2013 Feb 26. PMID: 23444143; PMCID: PMC3632109.

* [apeglm](https://pubmed.ncbi.nlm.nih.gov/30395178/)
  > Zhu A, Ibrahim JG, Love MI. Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. 2019 Jun 1;35(12):2084-2092. doi: 10.1093/bioinformatics/bty895. PMID: 30395178; PMCID: PMC6581436.

* [EnhancedVolcano](https://github.com/kevinblighe/EnhancedVolcano)

* [regionReport](https://pubmed.ncbi.nlm.nih.gov/27429738/)
  > Collado-Torres L, Jaffe AE, Leek JT. regionReport: Interactive reports for region-level and feature-level genomic analyses. F1000Res. 2015 May 1;4:105. doi: 10.12688/f1000research.6379.2. PMID: 27429738; PMCID: PMC4934510.

* [ReportingTools](https://pubmed.ncbi.nlm.nih.gov/24078713/)
  > Huntley MA, Larson JL, Chaivorapol C, Becker G, Lawrence M, Hackney JA, Kaminker JS. ReportingTools: an automated result processing and presentation toolkit for high-throughput genomic analyses. Bioinformatics. 2013 Dec 15;29(24):3220-1. doi: 10.1093/bioinformatics/btt551. Epub 2013 Sep 29. PMID: 24078713; PMCID: PMC5994940.

* [WebGestaltR](https://pubmed.ncbi.nlm.nih.gov/31114916/)
  > Liao Y, Wang J, Jaehnig EJ, Shi Z, Zhang B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019 Jul 2;47(W1):W199-W205. doi: 10.1093/nar/gkz401. PMID: 31114916; PMCID: PMC6602449.

* [gplots](https://cran.r-project.org/package=gplots)

* [biomaRt](https://pubmed.ncbi.nlm.nih.gov/19617889/)
  > Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009;4(8):1184-91. doi: 10.1038/nprot.2009.97. Epub 2009 Jul 23. PMID: 19617889; PMCID: PMC3159387.

* [openxlsx](https://cran.r-project.org/package=openxlsx)

## Ruby

* [Ruby](https://www.ruby-lang.org/en/)

## Python packages

* [Python](https://www.python.org/)

* [pandas](https://pandas.pydata.org/)

* [NumPy](https://numpy.org/)

## Software packaging/containerisation tools

* [Anaconda](https://anaconda.com)
  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

* [Bioconda](https://www.ncbi.nlm.nih.gov/pubmed/29967506/)
  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

* [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)
  > Boettiger, Carl. "An introduction to Docker for reproducible research." ACM SIGOPS Operating Systems Review 49.1 (2015): 71-79.

* [Singularity](https://www.ncbi.nlm.nih.gov/pubmed/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 9
Release event: 6
Issues event: 13
Watch event: 10
Issue comment event: 54
Push event: 14
Pull request event: 12
Fork event: 2

Last Year

Create event: 9
Release event: 6
Issues event: 13
Watch event: 10
Issue comment event: 54
Push event: 14
Pull request event: 12
Fork event: 2

Committers

Last synced: over 2 years ago

All Time

Total Commits: 665
Total Committers: 10
Avg Commits per committer: 66.5
Development Distribution Score (DDS): 0.595

Past Year

Commits: 23
Committers: 3
Avg Commits per committer: 7.667
Development Distribution Score (DDS): 0.174

Top Committers

Name	Email	Commits
MarieLataretu	m**u@u**e	269
David Fischer	d**9@g**e	197
hoelzer	h**n@g**m	143
David Fischer	8****b	24
MarieLataretu	5****u	19
Florian Mock	f**1@w**e	4
Martin Hölzer	m**n@M**x	3
David	y**u@e**m	3
martin	m**n@m**x	2
Florian Mock	f**k@u**e	1

Committer Domains (Top 20 + Academic)

martins-air.fritz.box: 2 uni-jena.de: 2 gmx.de: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 8
Total pull requests: 8
Average time to close issues: about 1 month
Average time to close pull requests: about 2 months
Total issue authors: 5
Total pull request authors: 3
Average comments per issue: 6.75
Average comments per pull request: 0.38
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 8
Pull requests: 7
Average time to close issues: about 1 month
Average time to close pull requests: 3 days
Issue authors: 5
Pull request authors: 3
Average comments per issue: 6.75
Average comments per pull request: 0.29
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

hoelzer (4)
Artifice120 (4)
DAWells (1)
m-jahani (1)
zhangwenda0518 (1)
kheinze (1)
chrissikath (1)
AnneBoshove (1)
mglgc (1)
brennovmh (1)
uludag (1)

rnaflow

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

RNAflow - An effective and simple RNA-Seq differential gene expression pipeline using Nextflow

Quick installation

In the case you don’t have wget

curl -s https://get.nextflow.io | bash

Quick start

Start a test run

conda active nextflow

Call help

Update the pipeline

Use a certain release

Usage

Input files

Read files (required)

Genomes and annotation

Build-in species

Multiple-mapped reads

Comparisons for DEG analysis

Resume your run

Workflow control

Preprocessing

DEG analysis

Transcriptome assembly

Profiles/configuration options

Executor options...

Engine options...

Monitoring

Output

DESeq2 results

Working offline

Help message

Known bugs and issues

Don't name your reference genome file reference.fa

Problems with SortMeRNA/ HISAT2 error (#141, #116)

Description

Workaround

Description

Workaround

Latency problems on HPCs, issue (#79)

Description

Workaround

Citation

Owner

Citation (citations.md)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Don't name your reference genome file `reference.fa`

Problems with `SortMeRNA`/ `HISAT2` error (#141, #116)