nf-rnachrom

https://github.com/ilnitsky/nf-rnachrom

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: ilnitsky
License: mit
Language: Nextflow
Default Branch: latest
Size: 15.5 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

Introduction

Full documentation is available at (ReadTheDocs)

nf-core/rnachrom is a comprehensive and flexible bioinformatics pipeline designed to process RNA-DNA interactome sequencing data. It efficiently handles large-scale data from various experimental methods including all-to-all approaches (GRID-seq, RADICL-seq, iMARGI, Red-C) and one-to-all approaches (ChART, RAP, CHIRP). The pipeline provides a streamlined workflow for analyzing RNA-DNA interactions, from raw sequencing data to annotated contacts and statistical analyses.

The pipeline supports: - Both all-to-all (ATA) and one-to-all (OTA) RNA-chromatin interaction protocols - Bridge sequence processing for ATA methods like Red-C, ChAR-seq, GRID-seq and RADICL-seq - Multiple alignment strategies for both RNA and DNA components - Comprehensive annotation of interaction sites with genomic features - Statistical analysis and normalization of interaction data - Integration with matching RNA-seq data for enhanced analysis (termed "chromatin potential")

Usage

:information_source: Note: If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Nextflow Windows installation: this page

Running the pipeline

First, prepare a samplesheet with your input data according to your experiment type. The pipeline supports different types of RNA-chromatin interaction experiments:

All-to-all (ATA) methods (e.g., MARGI, RADICL-seq)

A simple ATA samplesheet should include the RNA and DNA parts of the experiment:

csv sample,rna,dna HFFc6_imargi_1,SRR8206679_1.fastq.gz,SRR8206679_2.fastq.gz HFFc6_imargi_2,SRR8206680_1.fastq.gz,SRR8206680_2.fastq.gz

ATA with RNA-seq control

For ATA experiments with matching RNA-seq data, use the following format:

csv sample,rna,dna,description HFFc6_imargi_1,SRR8206679_1.fastq.gz,SRR8206679_2.fastq.gz,tissue:"HFFc6";rnaseq:"rnaseq_HFFc6" HFFc6_imargi_2,SRR8206680_1.fastq.gz,SRR8206680_2.fastq.gz,tissue:"HFFc6";rnaseq:"rnaseq_HFFc6" rnaseq_HFFc6,SRR8206681_1.fastq.gz,SRR8206681_2.fastq.gz,tissue:"HFFc6"

Note that RNA-seq samples should always start with rnaseq_ prefix.

One-to-all (OTA) methods (e.g., ChART, RAP, CHIRP)

For one-to-all methods, use the format:

csv sample,fastq_1,fastq_2,description SCIRT_chart_39,SRR10044362.fastq.gz,,tissue:"K562"

If you have single-end data, leave fastq_2 empty. For paired-end data, include both files.

OTA with input controls

To specify input controls for OTA methods:

csv sample,fastq_1,fastq_2,control,description SCIRT_chart_39,SRR10044362.fastq.gz,,SCIRT_INPUT,tissue:"K562" SCIRT_INPUT,SRR10044359.fastq.gz,,

Use the control column to specify which sample serves as the input control.

Running the pipeline

Check if you have Apptainer, Conda, or Docker installed: - Apptainer (formerly Singularity): Installation Guide - Conda: Installation Guide - Docker: Installation Guide

Clone the repository with Nextflow project: bash git clone https://github.com/ilnitsky/nf-rnachrom.git

Now, you can run the pipeline using:

bash nextflow run ./nf-rnachrom \ -profile <docker/apptainer/conda/...> \ --input samplesheet.csv \ --exp_type <redc/imargi/radicl/chart/rap/chirp/...> \ --outdir <OUTDIR>

:::warning Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs. :::

Test Run with External Data

To test the pipeline functionality using pre-configured test data, download test dataset:

bash wget http://bioinf.fbb.msu.ru/ken/nextflow/test_data_nf-rnachrom.tar.gz && \ tar -xzvf test_data_nf-rnachrom.tar.gz

And run:

bash nextflow run ./nf-rnachrom \ -profile test_full,apptainer \ --outdir test_results

This test run will download example GRID-seq data from bioinf.fbb.msu.ru/ken and process the data through all pipeline stages. It will prepare GRCh38 index for genome.

The test_full profile automatically configures all necessary parameters including reference genomes, annotation files, and processing options optimized for this test dataset.

Note: The full test requires approximately 8GB of RAM and 4 CPU cores. At least 10 GB of free disk space is required: the download size is around 5000MB, the apptainer image size is 4000MB. The test should complete in about 30-45 minutes on a standard workstation.

Warning: Nextflow automatically manages the file system mounts whenever a container is launched depending on the process input files. However, when a process input is a symbolic link, the linked file must be stored in the same folder where the symlink is located, or a sub-folder of it. Otherwise the process execution will fail because the launched container won’t be able to access the linked file.

Pipeline Workflow

The pipeline implements a multi-stage workflow that handles different types of RNA-chromatin interaction data:

Data Preprocessing
- Quality control with FastQC
- Adapter trimming with FastP
- Optional deduplication for PCR duplicates
Protocol-Specific Processing
- One-to-all methods (ChART, RAP, CHIRP): Direct trimming and alignment
- All-to-all methods:
  - Separate RNA/DNA reads: Sorting and processing RNA and DNA parts separately
  - Chimera reads with linker: Bridge sequence identification and processing
  - iMARGI: BWA alignment and specialized processing
Read Processing and Alignment
- Short read trimming and filtering
- PEAR merging for paired-end reads
- Bridge splitting with custom tools
- Alignment using appropriate tools (STAR, BWA, Bowtie2, HISAT2)
Contact Extraction and Processing
- BAM to contacts conversion
- Edit distance filtering and CIGAR filtering
- Integration of RNA-DNA parts into contact tables
- Contact normalization and deduplication
Annotation and Analysis
- RNA-DNA contact annotation with genomic features
- RNA part annotation with transcriptome data
- Strand detection for RNA components
- Optional chromatin potential calculation with RNA-seq data
Downstream Analysis
- Peak calling with MACS2 or BARDIC
- Statistical analysis and visualization
- Final contact table generation with header information
- Comprehensive reporting and data visualization

The workflow adaptively handles different experimental protocols and data formats, providing a complete solution from raw sequencing data to biologically interpretable results.

Nextflow Modules implemented in this pipeline

This pipeline includes modules from both local and nf-core sources:

Local modules (./modules/local): Custom implementations specific to RNA-chromatin interaction analysis
nf-core modules (./modules/nf-core): Standard bioinformatics modules from the nf-core library

Local Modules

Quality Control
- fastqc: Quality control of FASTQ files
- multiqc: Aggregates bioinformatics analyses into a single report
- fastp: All-in-one FASTQ preprocessor
Deduplication
- fastq-dupaway:
- fastuniq:
- clumpify: Removes duplicate reads from FASTQ files
- seqkit_rmdup:
Trimming
- trimgalore: Trim adapter sequences and low quality regions
- trimmomatic: Flexible read trimming tool
- bbduk: BBMap suite
Managing bridge sequence
- bitap:
- chartools:
- tagdust:
Preprocessing
- dedup: Removes duplicates from sequencing data
- debridge: Processes bridge sequences in MARGI/RADICL-seq data
- rsites: Analyzes restriction enzyme sites
- bam_to_contacts: Converts aligned BAM files to RNA-DNA contact files
- filter_contacts: Filters RNA-DNA contact files based on various criteria
Alignment
- star: RNA-seq read alignment
- hisat2: Hierarchical indexing for spliced alignment of transcripts
- bwa: Burrows-Wheeler Aligner for DNA sequences
- bowtie2: Ultrafast short read aligner
Analysis
- annotation: Annotates RNA-DNA contacts with genomic features
- normalisation: Performs normalization of contact data
- chromatin_potential: Computes chromatin interaction potential
- bardic: Implements BARDIC algorithm for interaction analysis
- xrna_assembly: Assembles novel RNA transcripts from RNA parts of contacts
- detect_strand: Determines strand orientation for RNA-DNA contacts
Peak Calling
- macs2: Model-based Analysis of ChIP-Seq data
- bardic: Implements BARDIC algorithm for peak calling

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

The pipeline produces the following outputs:

Analysis Results

Raw RNA-DNA contact tables in TSV format
Annotated and normalized interaction data (contacts) with genomic features
Peak calls
Chromatin potential calculations (if RNA-seq provided)
MultiQC reports aggregating pipeline statistics
Statistical summaries and visualization data
Log files for troubleshooting

Credits

nf-core/rnachrom was originally written by Ivan Ilnitskiy.

We thank the following people for their extensive assistance in the development of this pipeline: Ivan Markov (Moscow State University), Arina Nikolskaya (Moscow State University), Anastasia Zharikova

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #rnachrom channel (you can join with this invite).

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: Ivan Ilnitskiy
Login: ilnitsky
Kind: user
Location: Nowhere

Repositories: 1
Profile: https://github.com/ilnitsky

Citation (CITATIONS.md)

# nf-core/rnachrom: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

nf-rnachrom

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Introduction

Usage

Running the pipeline

All-to-all (ATA) methods (e.g., MARGI, RADICL-seq)

ATA with RNA-seq control

One-to-all (OTA) methods (e.g., ChART, RAP, CHIRP)

OTA with input controls

Running the pipeline

Test Run with External Data

Pipeline Workflow

Nextflow Modules implemented in this pipeline

Local Modules

Pipeline output

Analysis Results

Credits

Contributions and Support

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year