nf-site-seq

Nextflow bioinformatics pipeline for processing data generated by the SITE-Seq® assay

https://github.com/caribou-biosciences/nf-site-seq

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Nextflow bioinformatics pipeline for processing data generated by the SITE-Seq® assay

Basic Info
  • Host: GitHub
  • Owner: Caribou-Biosciences
  • License: agpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 6.28 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created about 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

SITE-Seq® Assay Analysis Pipeline

Nextflow run with docker

Table of Contents

Introduction

The SITE-Seq® assay is a multistep biochemical assay that identifies on-and off-target CRISPR-Cas cleavage sites. The method involves tagging, enriching, sequencing, and mapping genomic DNA to a reference genome to identify cleavage sites that could lead to off-target editing in cells. The SITE-Seq assay analysis pipeline is a Nextflow pipeline that processes and analyzes data generated by the SITE-Seq assay to identify CRISPR-mediated cleavage sites. The pipeline takes a samplesheet and FASTQ files as input, performs quality control (QC), UMI extraction, genome alignment, read deduplication, alignment processing, and site calling to produce a list of on-target and candidate off-target sites with various reports.

SITE-Seq® assay pipeline workflow diagram

  1. Read QC (FastP)
  2. Quality-filter UMIs
  3. Extract UMIs (UMI-tools)
  4. Genome alignment (Bowtie2)
  5. Deduplicate alignments (UMI-tools)
  6. Count read starts
  7. Merge read starts
  8. Calculate raw signals
  9. Calculate raw background signals
  10. Calculate corrected signals
  11. Identify signal peaks
  12. Call sites
  13. Aggregate sites
  14. Create the SITE-Seq® assay HTML report
  15. Create IGV session
  16. Present QC for raw read, alignment, and deduplication (MultiQC)

Documentation

Pages containing detailed documentation are linked below.

  • Usage
    • An overview of how the pipeline works, running it, and a description of available command-line options.
  • Output
    • An overview of results generated by the pipeline and how to interpret them.
  • SITE-Seq® Assay Report
    • A detailed breakdown of the core output of the pipeline, the SITE-Seq assay report.
  • Site-Calling Algorithm
    • A description of the site-calling algorithm.
  • Recommendations & Troubleshooting
    • Troubleshooting tips and recommendations for running the assay and pipeline.

Dependencies

  • Nextflow version ≥24.04.2
  • Docker version ≥20
  • Additional software packages are installed automatically during build. See CITATIONS.md for a complete list.

Usage

Note: The SITE-Seq assay analysis pipeline runs on Nextflow, a workflow tool to run tasks across different compute infrastructures in a portable manner. It uses Docker containers making installation trivial and results highly reproducible. If you are new to Nextflow, please refer to this page on how to get started. We recommend testing your installation using -profile test before running the pipeline with real data.

First, clone this repository, build the necessary Docker image, and run the test profile to test your installation:

bash git clone https://github.com/Caribou-Biosciences/nf-SITE-Seq.git cd nf-SITE-Seq bash toolkits/build_image.sh nextflow run main.nf -profile test,docker --outdir example_results

You only need to build the Docker image once upon installation and once before running a new release of the pipeline. The test profile runs the pipeline on a subset of sequencing reads from a SITE-Seq assay run used to detect on- and off-target cleavage using Cas9 with an AAVS1 targeting guide. Please note that the majority of reads were removed from these FASTQ files to minimize download size, but the resulting report is similar to what would be seen in a normal experiment.

To run the pipeline on your own data, prepare a samplesheet with your input data that looks as follows:

csv sample_group,control_group,concentration,replicate,on_target_motif,on_target_location,fastq_1,fastq_2 CONTROL,,0,1,,,/data/CONTROL_Rep1_R1.fastq.gz,/data/CONTROL_Rep1_R2.fastq.gz CONTROL,,0,2,,,/data/CONTROL_Rep2_R1.fastq.gz,/data/CONTROL_Rep2_R2.fastq.gz CONTROL,,0,3,,,/data/CONTROL_Rep3_R1.fastq.gz,/data/CONTROL_Rep3_R2.fastq.gz AAVS1,CONTROL,16,1,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep1_R1.fastq.gz,/data/AAVS1_16_Rep1_R2.fastq.gz AAVS1,CONTROL,16,2,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep2_R1.fastq.gz,/data/AAVS1_16_Rep2_R2.fastq.gz AAVS1,CONTROL,16,3,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep3_R1.fastq.gz,/data/AAVS1_16_Rep3_R2.fastq.gz AAVS1,CONTROL,128,1,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep1_R1.fastq.gz,/data/AAVS1_128_Rep1_R2.fastq.gz AAVS1,CONTROL,128,2,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep2_R1.fastq.gz,/data/AAVS1_128_Rep2_R2.fastq.gz AAVS1,CONTROL,128,3,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep3_R1.fastq.gz,/data/AAVS1_128_Rep3_R2.fastq.gz FANCF,CONTROL,16,1,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep1_R1.fastq.gz,/data/FANCF_16_Rep1_R2.fastq.gz FANCF,CONTROL,16,2,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep2_R1.fastq.gz,/data/FANCF_16_Rep2_R2.fastq.gz FANCF,CONTROL,16,3,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep3_R1.fastq.gz,/data/FANCF_16_Rep3_R2.fastq.gz FANCF,CONTROL,128,1,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep1_R1.fastq.gz,/data/FANCF_128_Rep1_R2.fastq.gz FANCF,CONTROL,128,2,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep2_R1.fastq.gz,/data/FANCF_128_Rep2_R2.fastq.gz FANCF,CONTROL,128,3,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep3_R1.fastq.gz,/data/FANCF_128_Rep3_R2.fastq.gz

Each row represents a FASTQ file (single-end) or a pair of FASTQ files (paired-end). Rows with the same sample_group, concentration, and replicate are considered technical replicates and merged automatically. An example samplesheet has been provided with the pipeline. See the usage documentation for an explanation of each samplesheet column.

Now, you can run the pipeline using:

bash nextflow run main.nf \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta /path/to/genome.fasta \ -profile docker

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, including definitions of each samplesheet column and all pipeline parameters, please refer to the usage documentation.

Pipeline output

For more details about the output files and reports, please refer to the output documentation. There is also a dedicated page explaining the core output of the pipeline, the SITE-Seq® assay report. Below is a screenshot of an example report for an AAVS1 guide.

Screenshot of the SITE-Seq® assay report for an AAVS1 guide

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: Caribou Biosciences
  • Login: Caribou-Biosciences
  • Kind: organization

Citation (CITATIONS.md)

# cariboubio/siteseq: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastP](https://github.com/OpenGene/fastp)

> Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107

- [UMI-tools](https://github.com/CGATOxford/UMI-tools)

> Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976.

- [Bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)

> Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [attrs](https://github.com/python-attrs/attrs)

> Schlawack, H. attrs [Computer software]. https://doi.org/10.5281/zenodo.6925130

- [pandas](https://pandas.pydata.org/)

> The pandas development team. pandas-dev/pandas: Pandas [Computer software]. https://doi.org/10.5281/zenodo.3509134

- [NumPy](https://numpy.org/)

> Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2.

- [Parasail](https://github.com/jeffdaily/parasail)

> Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z

- [plotly.py](https://github.com/plotly/plotly.py)

> Kruchten, N., Seier, A., & Parmer, C. (2024). An interactive, open-source, and browser-based graphing library for Python (Version 5.24.1) [Computer software]. https://doi.org/10.5281/zenodo.14503524

- [pysam](https://github.com/pysam-developers/pysam)

- [samtools](https://github.com/samtools/samtools)

> Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

- [HTSlib](https://github.com/samtools/htslib)

> James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies, HTSlib: C library for reading/writing high-throughput sequencing data, GigaScience, Volume 10, Issue 2, February 2021, giab007, https://doi.org/10.1093/gigascience/giab007

> [nlohmann/json](https://github.com/nlohmann/json)

> Lohmann, N. (2023). JSON for Modern C++ (Version 3.11.3) [Computer software]. https://github.com/nlohmann

## Software packaging/containerisation tools

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

## Related tools

- [IGV](https://igv.org/)

> Robinson, J. et al. Integrative genomics viewer. Nature Biotechnology. 29, 24–26 (2011).

GitHub Events

Total
  • Release event: 2
  • Delete event: 2
  • Push event: 3
  • Pull request event: 2
  • Create event: 5
Last Year
  • Release event: 2
  • Delete event: 2
  • Push event: 3
  • Pull request event: 2
  • Create event: 5