nextflow-cager

Nextflow pipeline for CAGE data processing. As a result it creates R CAGEr object for further analysis.

https://github.com/nikitin-p/nextflow-cager

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Nextflow pipeline for CAGE data processing. As a result it creates R CAGEr object for further analysis.

Basic Info

Host: GitHub
Owner: nikitin-p
License: mit
Language: Nextflow
Default Branch: main
Size: 5.85 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Changelog License Citation

β

Introduction

ComputationalRegulatoryGenomicsICL/customcageq is a Nextflow pipeline to process CAGE sequencing data from raw reads to the creation of a CAGEexp (CAGEr) object containing called TSSs. The pipeline is specifically designed to be used upstream of CAGEr.

Input

Either single-end or paired-end raw CAGE reads. Only one type of reads (either single- or paired-end) can be used in one run of the pipeline.

Output

A CAGEexp (CAGEr) object with called TSSs, ready for a downstream analysis with CAGEr. The intermediate and final results are stored in the results directory. The final CAGEexp object is stored in an RDS file in the results/cager directory.

Steps

Merge per-lane FASTQ files with the nf-core/cat_fastq module.
Report raw read quality with FastQC.
Trim adapters with TrimGalore and run FastQC on trimmed reads.
Build the Bowtie2 index of the reference genome FASTA file with bowtie2-build, if the index is not provided.
Map the trimmed reads onto the Bowtie2 index using bowtie2, then filter out unmapped reads and select only uniquelly mapped reads using samtools view with options -b -F 4 -q 20.
Optionally, remove PCR and optical duplicate reads with samtools markdup.
Sort the obtained BAM files with uniquely mapped reads using samtools sort.
Index the sorted BAM files with samtools index.
Assess mapping quality using samtools stats, samtools flagstat and samtools idxstats.
Create a CAGEexp object and call TSSs with CAGEr using a BSgenome package for the respective genome.
Create a MultiQC report.

Usage

Prepare for your first run

Currently, pipeline works with Nextflow v23.04. Make sure that you have the latest version of Docker (if running the pipeline on a laptop / PC) or Singularity (if running on a high-performance cluster).

Prepare your input data

Prepare the sample sheet with the description of input samples. In case of single-end reads, it should look like this:

csv sample,fastq_1,fastq_2,single_end S1,/path/to/fastq/S1_S1_L001_R1_001.fastq.gz,,True S1,/path/to/fastq/S1_S1_L002_R1_001.fastq.gz,,True S2,/path/to/fastq/S2_S2_L001_R1_001.fastq.gz,,True S2,/path/to/fastq/S2_S2_L002_R1_001.fastq.gz,,True

where * sample is a unique identifier of a sample; * fastq_1 (and fastq_2 in the case of paired-end reads) is a full path to the read libraries. In case of paired-end reads, fastq_1 contains the full path to forward reads, while fastq_2 contains the full path to reverse reads. One sample can be represented by more than one library if lanes are stored separately; * single_end should be set to True for single-end reads and to False for paired-end reads.

For paired-end reads, fastq_2 should contain the full path to reverse reads, while single_end should be set to False.

You can generate the input CSV table automatically using the input_reads.sh script. It takes two positional arguments:

bash input_reads.sh /path/to/fastq_dir /path/to/samplesheet.csv

where * /path/to/fastq_dir is a full path to a directory with raw FASTQ files; * /path/to/samplesheet.csv is a file name, with a full path, of a CSV file to create.

To run the script as a standalone executable (that is, without the need to write bash before its name), add execution permissions to the script after cloning the repository:

bash chmod +x input_reads.sh

Toy input data for testing

The pipeline has toy S. cerevisiae CAGE data stored in assets/sacer_fq for testing purposes (single-end reads in the se subfolder and paired-end reads in pe subfolder). The single-end reads were obtained by subsampling the ERR2495152 dataset published by (Börlin et al., 2018), while the paired-end reads were obtained by subsampling the SRR1631657 dataset published by (Chabbert et al., 2015).

The corresponding input spreadsheets can be found in assets: samplesheet_se.csv for single-end reads and samplesheet_pe.csv for paired-end reads. However, you will need to use the input_reads.sh script to regenerate these spreadsheets with your paths to the test FASTQ files.

On these test data, CAGEr is able to call 52 TSSs with the single-end reads and 1,245 TSSs with the paired-end reads.

How to run the pipeline

Clone the repository to your machine and use the following syntax to run the pipeline:

bash nextflow run customcageq/main.nf \ --bsgenome [/path/to/]bsgenome.package[.tar.gz] \ (--fasta /path/to/fasta/genome.fa | --index /path/to/index/bowtie2) \ [--dedup [--dist N]] \ --input samplesheet.csv \ -profile <institution/docker/singularity>

where * --bsgenome specifies the BSgenome R package to use. If it is a file name (which should have a full path and the .tar.gz extension), then the package will be taken from the specified location; otherwise, the pipeline will try to install a BSgenome R package with the name bsgenome.package on the fly (see examples below); * --fasta specifies a full path to a FASTA file containing a reference genome. This option is mandatory, unless --index is set. Remark: This option is mutually exclusive with --index. * --index specifies a directory bowtie2 with a Bowtie2 reference genome index. This is a mandatory option, unless --fasta is set. Remark: This option is mutually exclusive with --fasta. * --dedup switches on PCR duplicate removal. * --dist N sets an optical duplicate distance N to remove optical duplicates, in addition to PCR duplicates (see samtools markdup, option -d). Remark: The argument is optional and requires --dedup. * --input specifies the input CSV samplesheet. * -profile is a Nextflow option that specifies a config file to use with Nextflow on a given machine. See nf-core/configs for ready-to-use institutional configs, including the one for Jex (the high-performance computing cluster of the Laboratory of Medical Sciences). Also, see the Jex wiki on how to run Nextflow on Jex. Alternatively, this option can be used to specify the containerization technology to use.

Examples

Call TSSs from the test yeast single-end CAGE reads using a locally stored reference FASTA file and the BSgenome.Scerevisiae.UCSC.sacCer1 R package. The package is automatically installed within the CAGEr container on the fly and used there with CAGEr:

bash nextflow run customcageq/main.nf \ --bsgenome BSgenome.Scerevisiae.UCSC.sacCer1 \ --fasta /path/to/fasta/sacCer1.fasta \ --input customcageq/assets/samplesheet_se.csv \ -profile docker

Call TSSs from the test yeast paired-end CAGE reads using a locally stored Bowtie2 index and the locally stored BSgenome.Scerevisiae.UCSC.sacCer1 R package. The package is automatically installed from the .tar.gz archive within the CAGEr container and used with CAGEr:

bash nextflow run customcageq/main.nf \ --bsgenome /path/to/bsgenome/BSgenome.Scerevisiae.UCSC.sacCer1_1.4.0.tar.gz \ --index /path/to/index/bowtie2 \ --input customcageq/assets/samplesheet_pe.csv \ -profile docker

Same as example 1, but remove PCR duplicates before mapping QC and TSS calling:

bash nextflow run customcageq/main.nf \ --bsgenome BSgenome.Scerevisiae.UCSC.sacCer1 \ --fasta /path/to/fasta/sacCer1.fasta \ --dedup \ --input customcageq/assets/samplesheet_se.csv \ -profile docker

Same as above, but remove both PCR and optical duplicates (at a maximum distance 100, see samtools markdup) before mapping QC and TSS calling:

bash nextflow run customcageq/main.nf \ --bsgenome BSgenome.Scerevisiae.UCSC.sacCer1 \ --fasta /path/to/fasta/sacCer1.fasta \ --dedup \ --dist 100 \ --input customcageq/assets/samplesheet_se.csv \ -profile docker

To-do for version 2

Implement Damir's strategy using cutadapt to remove the first G and STAR for splice-aware read mapping.
Implement the CAGEr pipeline as a module.
Add plotting motifs around TSSs on both strands to check if a pyrimidine-purine (initiator-like) motif is present.
Check if the nf-validation Nextflow plugin or any other nf-core tools could help the user to create the input CSV.
Make it possible to run the pipeline by providing the GitHub repository name (and, possibly, a version name / commit hash), instead of making the user clone the repository first.
Rename input_reads.sh into make_input_csv.sh for clarity.
Make a "metromap" schematic of the pipeline. See, for example, the metromap for nf-core/cutandrun.
Cite in CITATIONS.md all the tools that we used.

Credits

ComputationalRegulatoryGenomicsICL/customcageq was originally written by Pavel Nikitin (@nikitin-p), Sviatoslav Sidorov (@sidorov-si) and Damir Baranasic (@da-bar).

Citations

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Login: nikitin-p
Kind: user
Location: Moscow

Repositories: 2
Profile: https://github.com/nikitin-p

Student, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University

Citation (CITATIONS.md)

# ComputationalRegulatoryGenomicsICL/customcage: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Last Year

Dependencies

modules/nf-core/bowtie2/align/meta.yml cpan

modules/nf-core/bowtie2/build/meta.yml cpan

modules/nf-core/cat/fastq/meta.yml cpan

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/fastqc/meta.yml cpan

modules/nf-core/multiqc/meta.yml cpan

modules/nf-core/samtools/index/meta.yml cpan

modules/nf-core/samtools/sort/meta.yml cpan

modules/nf-core/trimgalore/meta.yml cpan

dockerfiles/cager/Dockerfile docker

r-base latest build

modules/nf-core/bowtie2/align/environment.yml pypi

modules/nf-core/bowtie2/build/environment.yml pypi

modules/nf-core/cat/fastq/environment.yml pypi

modules/nf-core/samtools/index/environment.yml pypi

modules/nf-core/samtools/sort/environment.yml pypi

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science