nf-site-seq
Nextflow bioinformatics pipeline for processing data generated by the SITE-Seq® assay
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 10 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Repository
Nextflow bioinformatics pipeline for processing data generated by the SITE-Seq® assay
Basic Info
- Host: GitHub
- Owner: Caribou-Biosciences
- License: agpl-3.0
- Language: Python
- Default Branch: main
- Size: 6.28 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
SITE-Seq® Assay Analysis Pipeline
Table of Contents
Introduction
The SITE-Seq® assay is a multistep biochemical assay that identifies on-and off-target CRISPR-Cas cleavage sites. The method involves tagging, enriching, sequencing, and mapping genomic DNA to a reference genome to identify cleavage sites that could lead to off-target editing in cells. The SITE-Seq assay analysis pipeline is a Nextflow pipeline that processes and analyzes data generated by the SITE-Seq assay to identify CRISPR-mediated cleavage sites. The pipeline takes a samplesheet and FASTQ files as input, performs quality control (QC), UMI extraction, genome alignment, read deduplication, alignment processing, and site calling to produce a list of on-target and candidate off-target sites with various reports.

- Read QC (
FastP) - Quality-filter UMIs
- Extract UMIs (
UMI-tools) - Genome alignment (
Bowtie2) - Deduplicate alignments (
UMI-tools) - Count read starts
- Merge read starts
- Calculate raw signals
- Calculate raw background signals
- Calculate corrected signals
- Identify signal peaks
- Call sites
- Aggregate sites
- Create the SITE-Seq® assay HTML report
- Create IGV session
- Present QC for raw read, alignment, and deduplication (
MultiQC)
Documentation
Pages containing detailed documentation are linked below.
- Usage
- An overview of how the pipeline works, running it, and a description of available command-line options.
- Output
- An overview of results generated by the pipeline and how to interpret them.
- SITE-Seq® Assay Report
- A detailed breakdown of the core output of the pipeline, the SITE-Seq assay report.
- Site-Calling Algorithm
- A description of the site-calling algorithm.
- Recommendations & Troubleshooting
- Troubleshooting tips and recommendations for running the assay and pipeline.
Dependencies
- Nextflow version ≥24.04.2
- Docker version ≥20
- Additional software packages are installed automatically during build. See CITATIONS.md for a complete list.
Usage
Note: The SITE-Seq assay analysis pipeline runs on Nextflow, a workflow tool to run tasks across different compute infrastructures in a portable manner. It uses Docker containers making installation trivial and results highly reproducible. If you are new to Nextflow, please refer to this page on how to get started. We recommend testing your installation using
-profile testbefore running the pipeline with real data.
First, clone this repository, build the necessary Docker image, and run the test profile to test your installation:
bash
git clone https://github.com/Caribou-Biosciences/nf-SITE-Seq.git
cd nf-SITE-Seq
bash toolkits/build_image.sh
nextflow run main.nf -profile test,docker --outdir example_results
You only need to build the Docker image once upon installation and once before running a new release of the pipeline. The test profile runs the pipeline on a subset of sequencing reads from a SITE-Seq assay run used to detect on- and off-target cleavage using Cas9 with an AAVS1 targeting guide. Please note that the majority of reads were removed from these FASTQ files to minimize download size, but the resulting report is similar to what would be seen in a normal experiment.
To run the pipeline on your own data, prepare a samplesheet with your input data that looks as follows:
csv
sample_group,control_group,concentration,replicate,on_target_motif,on_target_location,fastq_1,fastq_2
CONTROL,,0,1,,,/data/CONTROL_Rep1_R1.fastq.gz,/data/CONTROL_Rep1_R2.fastq.gz
CONTROL,,0,2,,,/data/CONTROL_Rep2_R1.fastq.gz,/data/CONTROL_Rep2_R2.fastq.gz
CONTROL,,0,3,,,/data/CONTROL_Rep3_R1.fastq.gz,/data/CONTROL_Rep3_R2.fastq.gz
AAVS1,CONTROL,16,1,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep1_R1.fastq.gz,/data/AAVS1_16_Rep1_R2.fastq.gz
AAVS1,CONTROL,16,2,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep2_R1.fastq.gz,/data/AAVS1_16_Rep2_R2.fastq.gz
AAVS1,CONTROL,16,3,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_16_Rep3_R1.fastq.gz,/data/AAVS1_16_Rep3_R2.fastq.gz
AAVS1,CONTROL,128,1,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep1_R1.fastq.gz,/data/AAVS1_128_Rep1_R2.fastq.gz
AAVS1,CONTROL,128,2,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep2_R1.fastq.gz,/data/AAVS1_128_Rep2_R2.fastq.gz
AAVS1,CONTROL,128,3,GGGGCCACTAGGGACAGGATNGG,chr19:55115749-55115771,/data/AAVS1_128_Rep3_R1.fastq.gz,/data/AAVS1_128_Rep3_R2.fastq.gz
FANCF,CONTROL,16,1,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep1_R1.fastq.gz,/data/FANCF_16_Rep1_R2.fastq.gz
FANCF,CONTROL,16,2,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep2_R1.fastq.gz,/data/FANCF_16_Rep2_R2.fastq.gz
FANCF,CONTROL,16,3,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_16_Rep3_R1.fastq.gz,/data/FANCF_16_Rep3_R2.fastq.gz
FANCF,CONTROL,128,1,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep1_R1.fastq.gz,/data/FANCF_128_Rep1_R2.fastq.gz
FANCF,CONTROL,128,2,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep2_R1.fastq.gz,/data/FANCF_128_Rep2_R2.fastq.gz
FANCF,CONTROL,128,3,GGAATCCCTTCTGCAGCACCNGG,chr11:22625786-22625808,/data/FANCF_128_Rep3_R1.fastq.gz,/data/FANCF_128_Rep3_R2.fastq.gz
Each row represents a FASTQ file (single-end) or a pair of FASTQ files (paired-end). Rows with the same sample_group, concentration, and replicate are considered technical replicates and merged automatically. An example samplesheet has been provided with the pipeline. See the usage documentation for an explanation of each samplesheet column.
Now, you can run the pipeline using:
bash
nextflow run main.nf \
--input samplesheet.csv \
--outdir <OUTDIR> \
--fasta /path/to/genome.fasta \
-profile docker
[!WARNING] Please provide pipeline parameters via the CLI or Nextflow
-params-fileoption. Custom config files including those provided by the-cNextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, including definitions of each samplesheet column and all pipeline parameters, please refer to the usage documentation.
Pipeline output
For more details about the output files and reports, please refer to the output documentation. There is also a dedicated page explaining the core output of the pipeline, the SITE-Seq® assay report. Below is a screenshot of an example report for an AAVS1 guide.

Citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: Caribou Biosciences
- Login: Caribou-Biosciences
- Kind: organization
- Repositories: 1
- Profile: https://github.com/Caribou-Biosciences
Citation (CITATIONS.md)
# cariboubio/siteseq: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [FastP](https://github.com/OpenGene/fastp) > Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107 - [UMI-tools](https://github.com/CGATOxford/UMI-tools) > Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976. - [Bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml) > Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359. - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. - [attrs](https://github.com/python-attrs/attrs) > Schlawack, H. attrs [Computer software]. https://doi.org/10.5281/zenodo.6925130 - [pandas](https://pandas.pydata.org/) > The pandas development team. pandas-dev/pandas: Pandas [Computer software]. https://doi.org/10.5281/zenodo.3509134 - [NumPy](https://numpy.org/) > Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. - [Parasail](https://github.com/jeffdaily/parasail) > Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z - [plotly.py](https://github.com/plotly/plotly.py) > Kruchten, N., Seier, A., & Parmer, C. (2024). An interactive, open-source, and browser-based graphing library for Python (Version 5.24.1) [Computer software]. https://doi.org/10.5281/zenodo.14503524 - [pysam](https://github.com/pysam-developers/pysam) - [samtools](https://github.com/samtools/samtools) > Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008 - [HTSlib](https://github.com/samtools/htslib) > James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies, HTSlib: C library for reading/writing high-throughput sequencing data, GigaScience, Volume 10, Issue 2, February 2021, giab007, https://doi.org/10.1093/gigascience/giab007 > [nlohmann/json](https://github.com/nlohmann/json) > Lohmann, N. (2023). JSON for Modern C++ (Version 3.11.3) [Computer software]. https://github.com/nlohmann ## Software packaging/containerisation tools - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. ## Related tools - [IGV](https://igv.org/) > Robinson, J. et al. Integrative genomics viewer. Nature Biotechnology. 29, 24–26 (2011).
GitHub Events
Total
- Release event: 2
- Delete event: 2
- Push event: 3
- Pull request event: 2
- Create event: 5
Last Year
- Release event: 2
- Delete event: 2
- Push event: 3
- Pull request event: 2
- Create event: 5