nextflow-cager
Nextflow pipeline for CAGE data processing. As a result it creates R CAGEr object for further analysis.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 8 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Repository
Nextflow pipeline for CAGE data processing. As a result it creates R CAGEr object for further analysis.
Basic Info
- Host: GitHub
- Owner: nikitin-p
- License: mit
- Language: Nextflow
- Default Branch: main
- Size: 5.85 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
β
Introduction
ComputationalRegulatoryGenomicsICL/customcageq is a Nextflow pipeline to process CAGE sequencing data from raw reads to the creation of a CAGEexp (CAGEr) object containing called TSSs. The pipeline is specifically designed to be used upstream of CAGEr.
Input
Either single-end or paired-end raw CAGE reads. Only one type of reads (either single- or paired-end) can be used in one run of the pipeline.
Output
A CAGEexp (CAGEr) object with called TSSs, ready for a downstream analysis with CAGEr. The intermediate and final results are stored in the results directory. The final CAGEexp object is stored in an RDS file in the results/cager directory.
Steps
- Merge per-lane FASTQ files with the
nf-core/cat_fastqmodule. - Report raw read quality with
FastQC. - Trim adapters with
TrimGaloreand runFastQCon trimmed reads. - Build the Bowtie2 index of the reference genome FASTA file with
bowtie2-build, if the index is not provided. - Map the trimmed reads onto the Bowtie2 index using
bowtie2, then filter out unmapped reads and select only uniquelly mapped reads usingsamtools viewwith options-b -F 4 -q 20. - Optionally, remove PCR and optical duplicate reads with
samtools markdup. - Sort the obtained BAM files with uniquely mapped reads using
samtools sort. - Index the sorted BAM files with
samtools index. - Assess mapping quality using
samtools stats,samtools flagstatandsamtools idxstats. - Create a CAGEexp object and call TSSs with
CAGErusing a BSgenome package for the respective genome. - Create a MultiQC report.
Usage
Prepare for your first run
Currently, pipeline works with Nextflow v23.04. Make sure that you have the latest version of Docker (if running the pipeline on a laptop / PC) or Singularity (if running on a high-performance cluster).
Prepare your input data
Prepare the sample sheet with the description of input samples. In case of single-end reads, it should look like this:
csv
sample,fastq_1,fastq_2,single_end
S1,/path/to/fastq/S1_S1_L001_R1_001.fastq.gz,,True
S1,/path/to/fastq/S1_S1_L002_R1_001.fastq.gz,,True
S2,/path/to/fastq/S2_S2_L001_R1_001.fastq.gz,,True
S2,/path/to/fastq/S2_S2_L002_R1_001.fastq.gz,,True
where
* sample is a unique identifier of a sample;
* fastq_1 (and fastq_2 in the case of paired-end reads) is a full path to the read libraries. In case of paired-end reads, fastq_1 contains the full path to forward reads, while fastq_2 contains the full path to reverse reads. One sample can be represented by more than one library if lanes are stored separately;
* single_end should be set to True for single-end reads and to False for paired-end reads.
For paired-end reads, fastq_2 should contain the full path to reverse reads, while single_end should be set to False.
You can generate the input CSV table automatically using the input_reads.sh script. It takes two positional arguments:
bash
input_reads.sh /path/to/fastq_dir /path/to/samplesheet.csv
where
* /path/to/fastq_dir is a full path to a directory with raw FASTQ files;
* /path/to/samplesheet.csv is a file name, with a full path, of a CSV file to create.
To run the script as a standalone executable (that is, without the need to write bash before its name), add execution permissions to the script after cloning the repository:
bash
chmod +x input_reads.sh
Toy input data for testing
The pipeline has toy S. cerevisiae CAGE data stored in assets/sacer_fq for testing purposes (single-end reads in the se subfolder and paired-end reads in pe subfolder). The single-end reads were obtained by subsampling the ERR2495152 dataset published by (Börlin et al., 2018), while the paired-end reads were obtained by subsampling the SRR1631657 dataset published by (Chabbert et al., 2015).
The corresponding input spreadsheets can be found in assets: samplesheet_se.csv for single-end reads and samplesheet_pe.csv for paired-end reads. However, you will need to use the input_reads.sh script to regenerate these spreadsheets with your paths to the test FASTQ files.
On these test data, CAGEr is able to call 52 TSSs with the single-end reads and 1,245 TSSs with the paired-end reads.
How to run the pipeline
Clone the repository to your machine and use the following syntax to run the pipeline:
bash
nextflow run customcageq/main.nf \
--bsgenome [/path/to/]bsgenome.package[.tar.gz] \
(--fasta /path/to/fasta/genome.fa | --index /path/to/index/bowtie2) \
[--dedup [--dist N]] \
--input samplesheet.csv \
-profile <institution/docker/singularity>
where
* --bsgenome specifies the BSgenome R package to use. If it is a file name (which should have a full path and the .tar.gz extension), then the package will be taken from the specified location; otherwise, the pipeline will try to install a BSgenome R package with the name bsgenome.package on the fly (see examples below);
* --fasta specifies a full path to a FASTA file containing a reference genome. This option is mandatory, unless --index is set. Remark: This option is mutually exclusive with --index.
* --index specifies a directory bowtie2 with a Bowtie2 reference genome index. This is a mandatory option, unless --fasta is set. Remark: This option is mutually exclusive with --fasta.
* --dedup switches on PCR duplicate removal.
* --dist N sets an optical duplicate distance N to remove optical duplicates, in addition to PCR duplicates (see samtools markdup, option -d). Remark: The argument is optional and requires --dedup.
* --input specifies the input CSV samplesheet.
* -profile is a Nextflow option that specifies a config file to use with Nextflow on a given machine. See nf-core/configs for ready-to-use institutional configs, including the one for Jex (the high-performance computing cluster of the Laboratory of Medical Sciences). Also, see the Jex wiki on how to run Nextflow on Jex. Alternatively, this option can be used to specify the containerization technology to use.
Examples
- Call TSSs from the test yeast single-end CAGE reads using a locally stored reference FASTA file and the
BSgenome.Scerevisiae.UCSC.sacCer1R package. The package is automatically installed within the CAGEr container on the fly and used there with CAGEr:
bash
nextflow run customcageq/main.nf \
--bsgenome BSgenome.Scerevisiae.UCSC.sacCer1 \
--fasta /path/to/fasta/sacCer1.fasta \
--input customcageq/assets/samplesheet_se.csv \
-profile docker
- Call TSSs from the test yeast paired-end CAGE reads using a locally stored Bowtie2 index and the locally stored
BSgenome.Scerevisiae.UCSC.sacCer1R package. The package is automatically installed from the.tar.gzarchive within the CAGEr container and used with CAGEr:
bash
nextflow run customcageq/main.nf \
--bsgenome /path/to/bsgenome/BSgenome.Scerevisiae.UCSC.sacCer1_1.4.0.tar.gz \
--index /path/to/index/bowtie2 \
--input customcageq/assets/samplesheet_pe.csv \
-profile docker
- Same as example 1, but remove PCR duplicates before mapping QC and TSS calling:
bash
nextflow run customcageq/main.nf \
--bsgenome BSgenome.Scerevisiae.UCSC.sacCer1 \
--fasta /path/to/fasta/sacCer1.fasta \
--dedup \
--input customcageq/assets/samplesheet_se.csv \
-profile docker
- Same as above, but remove both PCR and optical duplicates (at a maximum distance 100, see
samtools markdup) before mapping QC and TSS calling:
bash
nextflow run customcageq/main.nf \
--bsgenome BSgenome.Scerevisiae.UCSC.sacCer1 \
--fasta /path/to/fasta/sacCer1.fasta \
--dedup \
--dist 100 \
--input customcageq/assets/samplesheet_se.csv \
-profile docker
To-do for version 2
Implement Damir's strategy using cutadapt to remove the first
Gand STAR for splice-aware read mapping.Implement the CAGEr pipeline as a module.
Add plotting motifs around TSSs on both strands to check if a pyrimidine-purine (initiator-like) motif is present.
Check if the
nf-validationNextflow plugin or any other nf-core tools could help the user to create the input CSV.Make it possible to run the pipeline by providing the GitHub repository name (and, possibly, a version name / commit hash), instead of making the user clone the repository first.
Rename
input_reads.shintomake_input_csv.shfor clarity.Make a "metromap" schematic of the pipeline. See, for example, the metromap for nf-core/cutandrun.
Cite in
CITATIONS.mdall the tools that we used.
Credits
ComputationalRegulatoryGenomicsICL/customcageq was originally written by Pavel Nikitin (@nikitin-p), Sviatoslav Sidorov (@sidorov-si) and Damir Baranasic (@da-bar).
Citations
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Login: nikitin-p
- Kind: user
- Location: Moscow
- Repositories: 2
- Profile: https://github.com/nikitin-p
Student, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University
Citation (CITATIONS.md)
# ComputationalRegulatoryGenomicsICL/customcage: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
Last Year
Dependencies
- r-base latest build