rnaseq_preprocess

Preprocessing of RNA-seq data using salmon and tximport

https://github.com/atpoint/rnaseq_preprocess

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (1.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Preprocessing of RNA-seq data using salmon and tximport

Basic Info

Host: GitHub
Owner: ATpoint
Language: Nextflow
Default Branch: main
Homepage:
Size: 1.69 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 10

Created almost 5 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Citation

rnaseq_preprocess

Introduction

rnaseq_preprocess is a Nextflow pipeline for RNA-seq quantification with salmon. The full processing steps are fastqc first, optional trimming with seqtk, then quantification with salmon, aggregation to gene level with tximport and a small summary report with MultiQC. Multiple fastq files per sample are supported. These technical replicates will be merged prior to quantification. The input are fastq files, provided via a samplesheet:

bash sample,r1,r2,libtype sampleA,/path/to/r1.fq.gz,/path/to/r2.fq.gz,A (...and...so...on)

Sample is a user-chosen name for this set of fastq files. Fastq files with the same sample entry are concatenated before quantification. Libtype is the library type argument from salmon. "A" means automatic detection.

Details

Indexing

De novo indexing is supported and assumes that a gentrome (genome-decoyed transcriptome) is to be created. For this at minimum we need a genome and transcriptome fasta file and a GTF:

bash NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm --only_idx \ --genome path/to/genome.fa.gz --txtome path/to/txtome.fa.gz --gtf path/to/foo.gtf.gz \ -with-report indexing_report.html -with-trace indexing_report.trace -bg > indexing_report.log

The indexing step must be run first and separately using the --only_idx flag.

--only_idx: trigger the indexing process
--idx_name: name of the produced index, default idx
--idx_dir: name of the directory inside rnaseq_preprocess_results/ storing the index, default salmon_idx
--idx_additional: additional arguments to salmon index beyond the defaults which are --no-version-check -t -d -i -p --gencode
--txtome: path to the gzipped transcriptome fasta
--genome: path to the gzipped genome fasta
--gtf: path to the gzipped GTF file
--transcript_id: name of GTF column storing transcript ID, default transcript_id
--transcript_name: name of GTF column storing transcript name, default transcript_name
--gene_id: name of GTF column storing gene ID, default gene_id
--gene_name: name of GTF column storing gene name, default gene_name
--gene_type: name of GTF column storing gene biotype, default gene_type

For the indexing process, 30GB of RAM and 6 CPUs are required/hardcoded.

Quantification/tximport

Quantification command line:

bash NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm \ --idx path/to/idx/folder/ --tx2gene path/to/tx2gene.txt --samplesheet path/to/samplesheet.csv \ -with-report quant_report.html -with-trace quant_report.trace -bg > quant_report.log

--idx: path to the salmon index folder --tx2gene: path to the tx2gene map matching transcripts to genes
--samplesheet: path to the input samplesheet
--trim_reads: logical, whether to trim reads to a fixed length
--trim_length: numeric, length for trimming
--quant_additional: additional options to salmon quant beyond --gcBias --seqBias --posBias

We hardcoded 30GB RAM and 6 CPUs for the quantification. On our HPC we use:

Other available options

--merge_keep: logical, whether to keep the merged fastq files
--merge_dir: folder inside the output directory to store the merged fastq files
--trim_keep: logical, whether to keep the trimmed fastq files
--trim_dir: folder inside the output directory to store the trimmed fastq files
--skip_fastqc: logical, whether to skip fastqc
--only_fastqc: logical, whether to only run fastqc and skip quantification
--skip_multiqc: logical, whether to skip multiqc
--skip_tximport: logical, whether to skip the tximport process downstream of the quantification
--fastqc_dir: folder inside the output directory to store the fastqc results
--multiqc_dir: folder inside the output directory to store the multiqc results

Output is a folder "rnaseqpreprocessresults with self-explainatory content. See the misc folder which contains the software versions used in the pipeline and the exact command lines. In case of running the pipeline this output will be in the pipeline_info folder of the output directory.

Owner

Name: Alexander Bender (né Toenges)
Login: ATpoint
Kind: user
Location: Germany

Website: https://www.biostars.org/u/25721/
Repositories: 6
Profile: https://github.com/ATpoint

Postdoc, working in the context of inflammation and cardiovascular disease. Wannabe cyclist and salsa dancer. Dad. Not in that order.

Citation (CITATIONS.md)

# Citations

## Nextflow

- [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

- [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

- [Salmon](https://pubmed.ncbi.nlm.nih.gov/28263959/)

- [seqtk](https://github.com/lh3/seqtk)

## R packages

- [R](https://www.R-project.org/)

- [rtracklayer](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html)

- [tximport](https://f1000research.com/articles/4-1521/)

## Software packaging/container frameworks

- [Anaconda](https://anaconda.com)

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Mambaforge](https://github.com/mamba-org/mamba)

- [Singularity / Apptainer](https://pubmed.ncbi.nlm.nih.gov/28494014/)

GitHub Events

Total

Issues event: 1
Delete event: 2
Push event: 6
Pull request event: 3
Create event: 2

Last Year

Issues event: 1
Delete event: 2
Push event: 6
Pull request event: 3
Create event: 2

Dependencies

.github/workflows/CI.yml actions

actions/checkout v2 composite

Dockerfile docker

condaforge/mambaforge 4.14.0-0 build

environment.yml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

rnaseq_preprocess

Science Score: 31.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

rnaseq_preprocess

Introduction

Details

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Dependencies