nf-rnachrom
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ilnitsky
- License: mit
- Language: Nextflow
- Default Branch: latest
- Size: 15.5 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Introduction
Full documentation is available at (ReadTheDocs)
nf-core/rnachrom is a comprehensive and flexible bioinformatics pipeline designed to process RNA-DNA interactome sequencing data. It efficiently handles large-scale data from various experimental methods including all-to-all approaches (GRID-seq, RADICL-seq, iMARGI, Red-C) and one-to-all approaches (ChART, RAP, CHIRP). The pipeline provides a streamlined workflow for analyzing RNA-DNA interactions, from raw sequencing data to annotated contacts and statistical analyses.
The pipeline supports: - Both all-to-all (ATA) and one-to-all (OTA) RNA-chromatin interaction protocols - Bridge sequence processing for ATA methods like Red-C, ChAR-seq, GRID-seq and RADICL-seq - Multiple alignment strategies for both RNA and DNA components - Comprehensive annotation of interaction sites with genomic features - Statistical analysis and normalization of interaction data - Integration with matching RNA-seq data for enhanced analysis (termed "chromatin potential")
Usage
:information_source: Note:
If you are new to Nextflow and nf-core, please refer to this page on how
to set-up Nextflow. Make sure to test your setup
with -profile test before running the workflow on actual data.
Nextflow Windows installation: this page
Running the pipeline
First, prepare a samplesheet with your input data according to your experiment type. The pipeline supports different types of RNA-chromatin interaction experiments:
All-to-all (ATA) methods (e.g., MARGI, RADICL-seq)
A simple ATA samplesheet should include the RNA and DNA parts of the experiment:
csv
sample,rna,dna
HFFc6_imargi_1,SRR8206679_1.fastq.gz,SRR8206679_2.fastq.gz
HFFc6_imargi_2,SRR8206680_1.fastq.gz,SRR8206680_2.fastq.gz
ATA with RNA-seq control
For ATA experiments with matching RNA-seq data, use the following format:
csv
sample,rna,dna,description
HFFc6_imargi_1,SRR8206679_1.fastq.gz,SRR8206679_2.fastq.gz,tissue:"HFFc6";rnaseq:"rnaseq_HFFc6"
HFFc6_imargi_2,SRR8206680_1.fastq.gz,SRR8206680_2.fastq.gz,tissue:"HFFc6";rnaseq:"rnaseq_HFFc6"
rnaseq_HFFc6,SRR8206681_1.fastq.gz,SRR8206681_2.fastq.gz,tissue:"HFFc6"
Note that RNA-seq samples should always start with rnaseq_ prefix.
One-to-all (OTA) methods (e.g., ChART, RAP, CHIRP)
For one-to-all methods, use the format:
csv
sample,fastq_1,fastq_2,description
SCIRT_chart_39,SRR10044362.fastq.gz,,tissue:"K562"
If you have single-end data, leave fastq_2 empty. For paired-end data, include both files.
OTA with input controls
To specify input controls for OTA methods:
csv
sample,fastq_1,fastq_2,control,description
SCIRT_chart_39,SRR10044362.fastq.gz,,SCIRT_INPUT,tissue:"K562"
SCIRT_INPUT,SRR10044359.fastq.gz,,
Use the control column to specify which sample serves as the input control.
Running the pipeline
Check if you have Apptainer, Conda, or Docker installed: - Apptainer (formerly Singularity): Installation Guide - Conda: Installation Guide - Docker: Installation Guide
Clone the repository with Nextflow project:
bash
git clone https://github.com/ilnitsky/nf-rnachrom.git
Now, you can run the pipeline using:
bash
nextflow run ./nf-rnachrom \
-profile <docker/apptainer/conda/...> \
--input samplesheet.csv \
--exp_type <redc/imargi/radicl/chart/rap/chirp/...> \
--outdir <OUTDIR>
:::warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those
provided by the -c Nextflow option can be used to provide any configuration except for parameters;
see docs.
:::
Test Run with External Data
To test the pipeline functionality using pre-configured test data, download test dataset:
bash
wget http://bioinf.fbb.msu.ru/ken/nextflow/test_data_nf-rnachrom.tar.gz && \
tar -xzvf test_data_nf-rnachrom.tar.gz
And run:
bash
nextflow run ./nf-rnachrom \
-profile test_full,apptainer \
--outdir test_results
This test run will download example GRID-seq data from bioinf.fbb.msu.ru/ken and process the data through all pipeline stages. It will prepare GRCh38 index for genome.
The test_full profile automatically configures all necessary parameters including reference genomes, annotation files, and processing options optimized for this test dataset.
Note: The full test requires approximately 8GB of RAM and 4 CPU cores. At least 10 GB of free disk space is required: the download size is around 5000MB, the apptainer image size is 4000MB. The test should complete in about 30-45 minutes on a standard workstation.
Warning: Nextflow automatically manages the file system mounts whenever a container is launched depending on the process input files. However, when a process input is a symbolic link, the linked file must be stored in the same folder where the symlink is located, or a sub-folder of it. Otherwise the process execution will fail because the launched container won’t be able to access the linked file.
Pipeline Workflow
The pipeline implements a multi-stage workflow that handles different types of RNA-chromatin interaction data:
Data Preprocessing
- Quality control with FastQC
- Adapter trimming with FastP
- Optional deduplication for PCR duplicates
Protocol-Specific Processing
- One-to-all methods (ChART, RAP, CHIRP): Direct trimming and alignment
- All-to-all methods:
- Separate RNA/DNA reads: Sorting and processing RNA and DNA parts separately
- Chimera reads with linker: Bridge sequence identification and processing
- iMARGI: BWA alignment and specialized processing
Read Processing and Alignment
- Short read trimming and filtering
- PEAR merging for paired-end reads
- Bridge splitting with custom tools
- Alignment using appropriate tools (STAR, BWA, Bowtie2, HISAT2)
Contact Extraction and Processing
- BAM to contacts conversion
- Edit distance filtering and CIGAR filtering
- Integration of RNA-DNA parts into contact tables
- Contact normalization and deduplication
Annotation and Analysis
- RNA-DNA contact annotation with genomic features
- RNA part annotation with transcriptome data
- Strand detection for RNA components
- Optional chromatin potential calculation with RNA-seq data
Downstream Analysis
- Peak calling with MACS2 or BARDIC
- Statistical analysis and visualization
- Final contact table generation with header information
- Comprehensive reporting and data visualization
The workflow adaptively handles different experimental protocols and data formats, providing a complete solution from raw sequencing data to biologically interpretable results.
Nextflow Modules implemented in this pipeline
This pipeline includes modules from both local and nf-core sources:
- Local modules (
./modules/local): Custom implementations specific to RNA-chromatin interaction analysis - nf-core modules (
./modules/nf-core): Standard bioinformatics modules from the nf-core library
Local Modules
Quality Control
fastqc: Quality control of FASTQ filesmultiqc: Aggregates bioinformatics analyses into a single reportfastp: All-in-one FASTQ preprocessor
Deduplication
fastq-dupaway:fastuniq:clumpify: Removes duplicate reads from FASTQ filesseqkit_rmdup:
Trimming
trimgalore: Trim adapter sequences and low quality regionstrimmomatic: Flexible read trimming toolbbduk: BBMap suite
Managing bridge sequence
bitap:chartools:tagdust:
Preprocessing
dedup: Removes duplicates from sequencing datadebridge: Processes bridge sequences in MARGI/RADICL-seq datarsites: Analyzes restriction enzyme sitesbam_to_contacts: Converts aligned BAM files to RNA-DNA contact filesfilter_contacts: Filters RNA-DNA contact files based on various criteria
Alignment
star: RNA-seq read alignmenthisat2: Hierarchical indexing for spliced alignment of transcriptsbwa: Burrows-Wheeler Aligner for DNA sequencesbowtie2: Ultrafast short read aligner
Analysis
annotation: Annotates RNA-DNA contacts with genomic featuresnormalisation: Performs normalization of contact datachromatin_potential: Computes chromatin interaction potentialbardic: Implements BARDIC algorithm for interaction analysisxrna_assembly: Assembles novel RNA transcripts from RNA parts of contactsdetect_strand: Determines strand orientation for RNA-DNA contacts
Peak Calling
macs2: Model-based Analysis of ChIP-Seq databardic: Implements BARDIC algorithm for peak calling
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Pipeline output
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
The pipeline produces the following outputs:
Analysis Results
- Raw RNA-DNA contact tables in TSV format
- Annotated and normalized interaction data (contacts) with genomic features
- Peak calls
Chromatin potential calculations (if RNA-seq provided)
MultiQC reports aggregating pipeline statistics
Statistical summaries and visualization data
Log files for troubleshooting
Credits
nf-core/rnachrom was originally written by Ivan Ilnitskiy.
We thank the following people for their extensive assistance in the development of this pipeline: Ivan Markov (Moscow State University), Arina Nikolskaya (Moscow State University), Anastasia Zharikova
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #rnachrom channel (you can join with this invite).
Citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: Ivan Ilnitskiy
- Login: ilnitsky
- Kind: user
- Location: Nowhere
- Repositories: 1
- Profile: https://github.com/ilnitsky
Citation (CITATIONS.md)
# nf-core/rnachrom: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Push event: 24
- Create event: 1
Last Year
- Push event: 24
- Create event: 1
