nf-site-seq

Nextflow bioinformatics pipeline for processing data generated by the SITE-Seq® assay

https://github.com/caribou-biosciences/nf-site-seq

Last synced: 9 months ago · JSON representation ·

Repository

Nextflow bioinformatics pipeline for processing data generated by the SITE-Seq® assay

Basic Info

Host: GitHub
Owner: Caribou-Biosciences
License: agpl-3.0
Language: Python
Default Branch: main
Size: 6.28 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 2

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License Citation

SITE-Seq® Assay Analysis Pipeline

Introduction

The SITE-Seq® assay is a multistep biochemical assay that identifies on-and off-target CRISPR-Cas cleavage sites. The method involves tagging, enriching, sequencing, and mapping genomic DNA to a reference genome to identify cleavage sites that could lead to off-target editing in cells. The SITE-Seq assay analysis pipeline is a Nextflow pipeline that processes and analyzes data generated by the SITE-Seq assay to identify CRISPR-mediated cleavage sites. The pipeline takes a samplesheet and FASTQ files as input, performs quality control (QC), UMI extraction, genome alignment, read deduplication, alignment processing, and site calling to produce a list of on-target and candidate off-target sites with various reports.

SITE-Seq® assay pipeline workflow diagram

Read QC (FastP)
Quality-filter UMIs
Extract UMIs (UMI-tools)
Genome alignment (Bowtie2)
Deduplicate alignments (UMI-tools)
Count read starts
Merge read starts
Calculate raw signals
Calculate raw background signals
Calculate corrected signals
Identify signal peaks
Call sites
Aggregate sites
Create the SITE-Seq® assay HTML report
Create IGV session
Present QC for raw read, alignment, and deduplication (MultiQC)

Documentation

Pages containing detailed documentation are linked below.

Usage
- An overview of how the pipeline works, running it, and a description of available command-line options.
Output
- An overview of results generated by the pipeline and how to interpret them.
SITE-Seq® Assay Report
- A detailed breakdown of the core output of the pipeline, the SITE-Seq assay report.
Site-Calling Algorithm
- A description of the site-calling algorithm.
Recommendations & Troubleshooting
- Troubleshooting tips and recommendations for running the assay and pipeline.

Dependencies

Nextflow version ≥24.04.2
Docker version ≥20
Additional software packages are installed automatically during build. See CITATIONS.md for a complete list.

Usage

Note: The SITE-Seq assay analysis pipeline runs on Nextflow, a workflow tool to run tasks across different compute infrastructures in a portable manner. It uses Docker containers making installation trivial and results highly reproducible. If you are new to Nextflow, please refer to this page on how to get started. We recommend testing your installation using -profile test before running the pipeline with real data.

First, clone this repository, build the necessary Docker image, and run the test profile to test your installation:

bash git clone https://github.com/Caribou-Biosciences/nf-SITE-Seq.git cd nf-SITE-Seq bash toolkits/build_image.sh nextflow run main.nf -profile test,docker --outdir example_results

You only need to build the Docker image once upon installation and once before running a new release of the pipeline. The test profile runs the pipeline on a subset of sequencing reads from a SITE-Seq assay run used to detect on- and off-target cleavage using Cas9 with an AAVS1 targeting guide. Please note that the majority of reads were removed from these FASTQ files to minimize download size, but the resulting report is similar to what would be seen in a normal experiment.

To run the pipeline on your own data, prepare a samplesheet with your input data that looks as follows:

csv sample_group,control_group,concentration,replicate,on_target_motif,on_target_location,fastq_1,fastq_2 CONTROL,,0,1,,,/data/CONTROL_Rep1_R1.fastq.gz,/data/CONTROL_Rep1_R2.fastq.gz CONTROL,,0,2,,,/data/CONTROL_Rep2_R1.fastq.gz,/data/CONTROL_Rep2_R2.fastq.gz CONTROL,,0,3,,,/data/CONTROL_Rep3_R1.fastq.gz,/data/CONTROL_Rep3_R2.fastq.gz AAVS1,CONTROL,16,1,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep1_R1.fastq.gz,/data/AAVS1_16_Rep1_R2.fastq.gz AAVS1,CONTROL,16,2,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep2_R1.fastq.gz,/data/AAVS1_16_Rep2_R2.fastq.gz AAVS1,CONTROL,16,3,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep3_R1.fastq.gz,/data/AAVS1_16_Rep3_R2.fastq.gz AAVS1,CONTROL,128,1,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep1_R1.fastq.gz,/data/AAVS1_128_Rep1_R2.fastq.gz AAVS1,CONTROL,128,2,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep2_R1.fastq.gz,/data/AAVS1_128_Rep2_R2.fastq.gz AAVS1,CONTROL,128,3,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep3_R1.fastq.gz,/data/AAVS1_128_Rep3_R2.fastq.gz FANCF,CONTROL,16,1,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep1_R1.fastq.gz,/data/FANCF_16_Rep1_R2.fastq.gz FANCF,CONTROL,16,2,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep2_R1.fastq.gz,/data/FANCF_16_Rep2_R2.fastq.gz FANCF,CONTROL,16,3,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep3_R1.fastq.gz,/data/FANCF_16_Rep3_R2.fastq.gz FANCF,CONTROL,128,1,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep1_R1.fastq.gz,/data/FANCF_128_Rep1_R2.fastq.gz FANCF,CONTROL,128,2,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep2_R1.fastq.gz,/data/FANCF_128_Rep2_R2.fastq.gz FANCF,CONTROL,128,3,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep3_R1.fastq.gz,/data/FANCF_128_Rep3_R2.fastq.gz

Each row represents a FASTQ file (single-end) or a pair of FASTQ files (paired-end). Rows with the same sample_group, concentration, and replicate are considered technical replicates and merged automatically. An example samplesheet has been provided with the pipeline. See the usage documentation for an explanation of each samplesheet column.

Now, you can run the pipeline using:

bash nextflow run main.nf \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta /path/to/genome.fasta \ -profile docker

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, including definitions of each samplesheet column and all pipeline parameters, please refer to the usage documentation.

Pipeline output

For more details about the output files and reports, please refer to the output documentation. There is also a dedicated page explaining the core output of the pipeline, the SITE-Seq® assay report. Below is a screenshot of an example report for an AAVS1 guide.

Screenshot of the SITE-Seq® assay report for an AAVS1 guide

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: Caribou Biosciences
Login: Caribou-Biosciences
Kind: organization

Repositories: 1
Profile: https://github.com/Caribou-Biosciences

Citation (CITATIONS.md)

# cariboubio/siteseq: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastP](https://github.com/OpenGene/fastp)

> Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107

- [UMI-tools](https://github.com/CGATOxford/UMI-tools)

> Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976.

- [Bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)

> Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [attrs](https://github.com/python-attrs/attrs)

> Schlawack, H. attrs [Computer software]. https://doi.org/10.5281/zenodo.6925130

- [pandas](https://pandas.pydata.org/)

> The pandas development team. pandas-dev/pandas: Pandas [Computer software]. https://doi.org/10.5281/zenodo.3509134

- [NumPy](https://numpy.org/)

> Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2.

- [Parasail](https://github.com/jeffdaily/parasail)

> Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z

- [plotly.py](https://github.com/plotly/plotly.py)

> Kruchten, N., Seier, A., & Parmer, C. (2024). An interactive, open-source, and browser-based graphing library for Python (Version 5.24.1) [Computer software]. https://doi.org/10.5281/zenodo.14503524

- [pysam](https://github.com/pysam-developers/pysam)

- [samtools](https://github.com/samtools/samtools)

> Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

- [HTSlib](https://github.com/samtools/htslib)

> James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies, HTSlib: C library for reading/writing high-throughput sequencing data, GigaScience, Volume 10, Issue 2, February 2021, giab007, https://doi.org/10.1093/gigascience/giab007

> [nlohmann/json](https://github.com/nlohmann/json)

> Lohmann, N. (2023). JSON for Modern C++ (Version 3.11.3) [Computer software]. https://github.com/nlohmann

## Software packaging/containerisation tools

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

## Related tools

- [IGV](https://igv.org/)

> Robinson, J. et al. Integrative genomics viewer. Nature Biotechnology. 29, 24–26 (2011).

GitHub Events

Total

Release event: 2
Delete event: 2
Push event: 3
Pull request event: 2
Create event: 5

Last Year

Release event: 2
Delete event: 2
Push event: 3
Pull request event: 2
Create event: 5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science