rnaseq_preprocess

Preprocessing of RNA-seq data using salmon and tximport

https://github.com/atpoint/rnaseq_preprocess

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (1.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Preprocessing of RNA-seq data using salmon and tximport

Basic Info
  • Host: GitHub
  • Owner: ATpoint
  • Language: Nextflow
  • Default Branch: main
  • Homepage:
  • Size: 1.69 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 10
Created almost 5 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog Citation

README.md

rnaseq_preprocess

CI Nextflow run with docker run with singularity

Introduction

rnaseq_preprocess is a Nextflow pipeline for RNA-seq quantification with salmon. The full processing steps are fastqc first, optional trimming with seqtk, then quantification with salmon, aggregation to gene level with tximport and a small summary report with MultiQC. Multiple fastq files per sample are supported. These technical replicates will be merged prior to quantification. The input are fastq files, provided via a samplesheet:

bash sample,r1,r2,libtype sampleA,/path/to/r1.fq.gz,/path/to/r2.fq.gz,A (...and...so...on)

Sample is a user-chosen name for this set of fastq files. Fastq files with the same sample entry are concatenated before quantification. Libtype is the library type argument from salmon. "A" means automatic detection.

Details

Indexing

De novo indexing is supported and assumes that a gentrome (genome-decoyed transcriptome) is to be created. For this at minimum we need a genome and transcriptome fasta file and a GTF:

bash NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm --only_idx \ --genome path/to/genome.fa.gz --txtome path/to/txtome.fa.gz --gtf path/to/foo.gtf.gz \ -with-report indexing_report.html -with-trace indexing_report.trace -bg > indexing_report.log

The indexing step must be run first and separately using the --only_idx flag.

--only_idx: trigger the indexing process
--idx_name: name of the produced index, default idx
--idx_dir: name of the directory inside rnaseq_preprocess_results/ storing the index, default salmon_idx
--idx_additional: additional arguments to salmon index beyond the defaults which are --no-version-check -t -d -i -p --gencode
--txtome: path to the gzipped transcriptome fasta
--genome: path to the gzipped genome fasta
--gtf: path to the gzipped GTF file
--transcript_id: name of GTF column storing transcript ID, default transcript_id
--transcript_name: name of GTF column storing transcript name, default transcript_name
--gene_id: name of GTF column storing gene ID, default gene_id
--gene_name: name of GTF column storing gene name, default gene_name
--gene_type: name of GTF column storing gene biotype, default gene_type

For the indexing process, 30GB of RAM and 6 CPUs are required/hardcoded.

Quantification/tximport

Quantification command line:

bash NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm \ --idx path/to/idx/folder/ --tx2gene path/to/tx2gene.txt --samplesheet path/to/samplesheet.csv \ -with-report quant_report.html -with-trace quant_report.trace -bg > quant_report.log

--idx: path to the salmon index folder --tx2gene: path to the tx2gene map matching transcripts to genes
--samplesheet: path to the input samplesheet
--trim_reads: logical, whether to trim reads to a fixed length
--trim_length: numeric, length for trimming
--quant_additional: additional options to salmon quant beyond --gcBias --seqBias --posBias

We hardcoded 30GB RAM and 6 CPUs for the quantification. On our HPC we use:

Other available options

--merge_keep: logical, whether to keep the merged fastq files
--merge_dir: folder inside the output directory to store the merged fastq files
--trim_keep: logical, whether to keep the trimmed fastq files
--trim_dir: folder inside the output directory to store the trimmed fastq files
--skip_fastqc: logical, whether to skip fastqc
--only_fastqc: logical, whether to only run fastqc and skip quantification
--skip_multiqc: logical, whether to skip multiqc
--skip_tximport: logical, whether to skip the tximport process downstream of the quantification
--fastqc_dir: folder inside the output directory to store the fastqc results
--multiqc_dir: folder inside the output directory to store the multiqc results

Output is a folder "rnaseqpreprocessresults with self-explainatory content. See the misc folder which contains the software versions used in the pipeline and the exact command lines. In case of running the pipeline this output will be in the pipeline_info folder of the output directory.

Owner

  • Name: Alexander Bender (né Toenges)
  • Login: ATpoint
  • Kind: user
  • Location: Germany

Postdoc, working in the context of inflammation and cardiovascular disease. Wannabe cyclist and salsa dancer. Dad. Not in that order.

Citation (CITATIONS.md)

# Citations

## Nextflow

- [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

- [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

- [Salmon](https://pubmed.ncbi.nlm.nih.gov/28263959/)

- [seqtk](https://github.com/lh3/seqtk)

## R packages

- [R](https://www.R-project.org/)

- [rtracklayer](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html)

- [tximport](https://f1000research.com/articles/4-1521/)

## Software packaging/container frameworks

- [Anaconda](https://anaconda.com)

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Mambaforge](https://github.com/mamba-org/mamba)

- [Singularity / Apptainer](https://pubmed.ncbi.nlm.nih.gov/28494014/)

GitHub Events

Total
  • Issues event: 1
  • Delete event: 2
  • Push event: 6
  • Pull request event: 3
  • Create event: 2
Last Year
  • Issues event: 1
  • Delete event: 2
  • Push event: 6
  • Pull request event: 3
  • Create event: 2

Dependencies

.github/workflows/CI.yml actions
  • actions/checkout v2 composite
Dockerfile docker
  • condaforge/mambaforge 4.14.0-0 build
environment.yml pypi