crispr_pipeline

https://github.com/pinellolab/crispr_pipeline

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: pinellolab
License: mit
Language: Python
Default Branch: main
Size: 27.3 MB

Statistics

Stars: 9
Watchers: 5
Forks: 5
Open Issues: 8
Releases: 1

Created about 1 year ago · Last pushed 6 months ago

Metadata Files

Readme Changelog License Code of conduct Citation

CRISPR Pipeline

A comprehensive pipeline for single-cell Perturb-Seq analysis that enables robust processing and analysis of CRISPR screening data at single-cell resolution.

Prerequisites

Nextflow and Singularity must be installed before running the pipeline:

Nextflow (version > 24)

Workflow manager for executing the pipeline:

bash conda install bioconda::nextflow

Singularity

Container platform that must be available in your execution environment.

Nextflow Tower Integration

This is a seamless pipeline execution monitoring system that offers a web-based interface for workflow management.

To enable Nextflow Tower, we require a TOWERACCESSTOKEN.

To obtain your token: 1. Create/login to your account at cloud.tower.nf 2. Navigate to Settings > Your tokens 3. Click "Add token" and generate a new token 4. Set as environment variable: export TOWER_ACCESS_TOKEN=your_token_here

Pipeline Installation

To install the pipeline:

bash git clone https://github.com/pinellolab/CRISPR_Pipeline.git

Input Requirements

File Descriptions

FASTQ Files

{sample}_R1.fastq.gz: Contains cell barcode and UMI sequences
{sample}_R2.fastq.gz: Contains transcript sequences

YAML Configuration Files (see example_data/)

rna_seqspec.yml: Defines RNA sequencing structure and parameters
guide_seqspec.yml: Specifies guide RNA detection parameters
hash_seqspec.yml: Defines cell hashing structure (required if using cell hashing)
barcode_onlist.txt: List of valid cell barcodes

Metadata Files (see example_data/)

guide_metadata.tsv: Contains guide RNA information and annotations
hash_metadata.tsv: Cell hashing sample information (required if using cell hashing)
pairs_to_test.csv: Defines perturbation pairs for comparison analysis (required if testing predefined pairs)

For detailed specifications, see our documentation.

Running the Pipeline

Pipeline Configuration

Before running the pipeline, customize the configuration files for your environment:

1. Data and Analysis Parameters (`nextflow.config`)

Update the pipeline-specific parameters in the params section, for example:

```groovy // Input data paths input = "/path/to/your/samplesheet.csv"

// Analysis parameters (adjust for your experiment) QCmingenespercell = 500 QCmincellspergene = 3 QCpctmito = 20 GUIDEASSIGNMENTmethod = 'sceptre' // or 'cleanser' INFERENCE_method = 'perturbo' // or 'sceptre' ```

2. Compute Environment Configuration

Choose and configure your compute profile by updating the relevant sections:

🖥️ Local

```groovy // Resource limits (adjust based on your machine) maxcpus = 8 // Number of CPU cores available maxmemory = '32.GB' // RAM available for the pipeline

// Run with: nextflow run main.nf -profile local ```

🏢 SLURM Cluster

```groovy // Resource limits (adjust based on cluster specs) maxcpus = 128 maxmemory = '512.GB'

// Update SLURM partitions in profiles section: slurm { process { queue = 'short,normal,long' // Replace with your partition names } }

// Run with: nextflow run main.nf -profile slurm ```

☁️ Google Cloud Platform

```groovy // Update GCP settings googlebucket = 'gs://your-bucket-name' googleproject = 'your-gcp-project-id' google_region = 'us-central1' // Choose your preferred region

// Resource limits maxcpus = 128 maxmemory = '512.GB'

// Run with (see more in GCPusernotebook.ipynb): // export GOOGLEAPPLICATIONCREDENTIALS="/path/to/your/pipeline-service-key.json" // nextflow run main.nf -profile google ```

3. Container Configuration

The pipeline uses pre-built containers. Update if you have custom versions:

groovy containers { base = 'sjiang9/conda-docker:0.2' // Main analysis tools cleanser = 'sjiang9/cleanser:0.3' // Guide assignment sceptre = 'sjiang9/sceptre-igvf:0.1' // Guide assignment/Perturbation inference perturbo = 'loganblaine/perturbo:latest' // Perturbation inference }

🎯 Resource Sizing Guidelines

Recommended Starting Values:

| Environment | maxcpus | maxmemory | Notes | |------------|----------|------------|--------| | Local (development) | 4-8 | 16-32GB | For testing small datasets | | Local (full analysis) | 8-16 | 64-128GB | For complete runs | | SLURM cluster | 64-128 | 256-512GB | Adjust based on node specs | | Google Cloud | 128+ | 512GB+ | Can scale dynamically |

🔧 Testing Your Configuration

Validate syntax: bash nextflow config -profile local # Test local profile nextflow config -profile slurm # Test SLURM profile
Test with small dataset: ```bash

Start with a subset of your data

Make all scripts executable (required for pipeline execution)

chmod +x bin/*

RUN THE PIPELINE

nextflow run main.nf -profile local --input small_test.tsv -outdir ./Outputs ```

💡 Pro Tips

Start conservative: Begin with lower resource limits and increase as needed
Profile-specific limits: The pipeline automatically scales resources based on retry attempts
Development workflow: Use local profile for code testing, cluster/cloud for production runs

🚨 Common Issues

Memory errors: Increase max_memory if you see out-of-memory failures
Queue timeouts: Adjust SLURM partition names to match your cluster
Permission errors: Ensure your Google Cloud service account has proper permissions
Container issues: Verify Singularity is available on your system
Missing files: Double-check paths in nextflow.config and actual files in example_data

Output Description

The output files will be generated in the pipeline_outputs and pipeline_dashboard directory.

Generated Files

Within the pipeline_outputs directory, you will find:

inference_mudata.h5mu - MuData format output
perelementoutput.tsv - Per-element analysis
perguideoutput.tsv - Per-guide analysis

Structure:

📁 pipeline_outputs/ ├── 📄 inference_mudata.h5mu ├── 📄 per_element_output.tsv └── 📄 per_guide_output.tsv

For details, see our documentation.

Generated Figures

The pipeline produces several figures:

Within the pipeline_dashboard directory, you will find:

Evaluation Output:
- network_plot.png: Gene interaction networks visualization.
- volcano_plot.png: gRNA-gene pairs analysis.
- IGV files (.bedgraph and bedpe): Genome browser visualization files.
Analysis Figures:
- knee_plot_scRNA.png: Knee plot of UMI counts vs. barcode index.
- scatterplot_scrna.png: Scatterplot of total counts vs. genes detected, colored by mitochondrial content.
- violin_plot.png: Distribution of gene counts, total counts, and mitochondrial content.
- scRNA_barcodes_UMI_thresholds.png: Number of scRNA barcodes using different Total UMI thresholds.
- guides_per_cell_histogram.png: Histogram of guides per cell.
- cells_per_guide_histogram.png: Histogram of cells per guide.
- guides_UMI_thresholds.png: Simulating the final number of cells with assigned guides using different minimal number thresholds (at least one guide > threshold value). (Use it to inspect how many cells would have assigned guides. This can be used to check if the final number of cells with guides fit with your expected number of cells)
- guides_UMI_thresholds.png: Histogram of the number of sgRNA represented per cell
- cells_per_htp_barplot.png: Number of Cells across Different HTOs
- umap_hto.png: UMAP Clustering of Cells Based on HTOs (The dimensions represent the distribution of HTOs in each cell)
- umap_hto_singlets.png: UMAP Clustering of Cells Based on HTOs (multiplets removed)
seqSpec Plots:

seqSpec_check_plots.png: The frequency of each nucleotides along the Read 1 (Use to inspect the expected read parts with their expected signature) and Read 2 (Use to inspect the expected read parts with their expected signature).

Structure: 📁 pipeline_dashboard/ ├── 📄 dashboard.html │ ├── 📁 evaluation_output/ │ ├── 🖼️ network_plot.png │ ├── 🖼️ volcano_plot.png │ ├── 📄 igv.bedgraph │ └── 📄 igv.bedpe │ ├── 📁 figures/ │ ├── 🖼️ knee_plot_scRNA.png │ ├── 🖼️ scatterplot_scrna.png │ ├── 🖼️ violin_plot.png │ ├── 🖼️ scRNA_barcodes_UMI_thresholds.png │ ├── 🖼️ guides_per_cell_histogram.png │ ├── 🖼️ cells_per_guide_histogram.png │ ├── 🖼️ guides_UMI_thresholds.png │ ├── 🖼️ cells_per_htp_barplot.png │ ├── 🖼️ umap_hto.png │ └── 🖼️ umap_hto_singlets.png │ ├── 📁 guide_seqSpec_plots/ │ └── 🖼️ seqSpec_check_plots.png │ └── 📁 hashing_seqSpec_plots/ └── 🖼️ seqSpec_check_plots.png

Pipeline Testing Guide

To ensure proper pipeline functionality, we provide two extensively validated datasets for testing purposes.

Available Test Datasets

1. TFPerturbSeq_Pilot Dataset (Gary-Hon Lab)

The TFPerturbSeq_Pilot dataset was generated by the Gary-Hon Lab and is available through the IGVF Data Portal under Analysis Set ID: IGVFDS4389OUWU. To access the fastq files, you need to:

First, register for an account on the IGVF Data Portal to obtain your access credentials.
Once you have your credentials, you can use our provided Python script to download all necessary FASTQ files:

bash cd example_data python download_fastq.py \ --sample per-sample_file.tsv \ --access-key YOUR_ACCESS_KEY \ --secret-key YOUR_SECRET_KEY

💡 Note: You'll need to replace YOUR_ACCESS_KEY and YOUR_SECRET_KEY with the credentials from your IGVF portal account. These credentials can be found in your IGVF portal profile settings.

All other required input files for running the pipeline with this dataset are already included in the repository under the example_data directory.

2. Gasperini et al. Dataset

This dataset comes from a large-scale CRISPR screen study published in Cell (Gasperini et al., 2019: "A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens") and provides an excellent resource for testing the pipeline. The full dataset, including raw sequencing data and processed files, is publicly available through GEO under accession number GSE120861.

Step-by-Step Testing Instructions

Environment Setup ```bash

Clone and enter the repository

git clone https://github.com/pinellolab/CRISPRPipeline.git cd CRISPRPipeline ```
Choose Your Dataset and Follow the Corresponding Instructions:

#### Option A: TFPerturbSeq_Pilot Dataset ```bash # Run with LOCAL nextflow run main.nf \ -profile local \ --input samplesheet.tsv \ --outdir ./outputs/

# Run with SLURM nextflow run main.nf \ -profile slurm \ --input samplesheet.tsv \ --outdir ./outputs/

# Run with GCP nextflow run main.nf \ -profile google \ --input samplesheet.tsv \ --outdir gs://igvf-pertub-seq-pipeline-data/scratch/ # Path to your GCP bucket ```

#### Option B: Gasperini Dataset

Set up the configuration files:

bash # Copy configuration files and example data cp example_gasperini/nextflow.config nextflow.config cp -r example_gasperini/example_data/* example_data/
Obtain sequencing data:
- Download a subset of the dataset gasperini in your own server.
- Place files in example_data/fastq_files directory
``` NTHREADS=16 wget https://github.com/10XGenomics/bamtofastq/releases/download/v1.4.1/bamtofastqlinux; chmod +x bamtofastqlinux wget https://sra-pub-src-1.s3.amazonaws.com/SRR7967488/pilothighmoiscreen.1CGTTACCG.grna.bam.1;mv pilothighmoiscreen.1CGTTACCG.grna.bam.1 pilothighmoiscreen.1CGTTACCG.grna.bam ./bamtofastqlinux --nthreads="$NTHREADS" pilothighmoiscreen.1CGTTACCG.grna.bam bampilotguide1

wget https://sra-pub-src-1.s3.amazonaws.com/SRR7967482/pilothighmoiscreen.1SIGAG1.bam.1;mv pilothighmoiscreen.1SIGAG1.bam.1 pilothighmoiscreen.1SIGAG1.bam ./bamtofastqlinux --nthreads="$NTHREADS" pilothighmoiscreen.1SIGAG1.bam bampilotscrna1 ``Now you should see thebampilotguide1andbampilotscrna1directories inside theexampledata/fastqfilesdirectory. Insidebampilotguide1andbampilotscrna1`, there are multiple sets of FASTQ files.
Prepare the whitelist: bash # Extract the compressed whitelist file unzip example_data/yaml_files/3M-february-2018.txt.zip Now you should see 3M-february-2018.txt inside example_data/yaml_files/ directory.
Launch the pipeline: ```bash # Run with LOCAL nextflow run main.nf \ -profile local \ --input samplesheet.tsv \ --outdir ./outputs/

# Run with SLURM nextflow run main.nf \ -profile slurm \ --input samplesheet.tsv \ --outdir ./outputs/

# Run with GCP nextflow run main.nf \ -profile google \ --input samplesheet.tsv \ --outdir gs://igvf-pertub-seq-pipeline-data/scratch/ # Path to your GCP bucket ```

Expected Outputs

The pipeline generates two directories upon completion: - pipeline_outputs: Contains all analysis results - pipeline_dashboard: Houses interactive visualization reports

Troubleshooting

If you encounter any issues during testing: 1. Review log files and intermediate results in the work/ directory 2. Verify that all input files meet the required format specifications

For additional support or questions, please open an issue on our GitHub repository.

Credits

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the [Slack #fg-crispr channel]

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: Pinello Lab
Login: pinellolab
Kind: organization
Email: lpinello@mgh.harvard.edu
Location: Boston

Website: pinellolab.org
Repositories: 22
Profile: https://github.com/pinellolab

Massachusetts General Hospital/ Harvard Medical School

Citation (CITATIONS.md)

# nf-core/crispr: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools





## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 4
Release event: 1
Issues event: 5
Watch event: 7
Issue comment event: 12
Push event: 136
Public event: 1
Pull request event: 32
Fork event: 4

Last Year

Create event: 4
Release event: 1
Issues event: 5
Watch event: 7
Issue comment event: 12
Push event: 136
Public event: 1
Pull request event: 32
Fork event: 4

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 19
Average time to close issues: 2 days
Average time to close pull requests: 4 days
Total issue authors: 5
Total pull request authors: 6
Average comments per issue: 0.83
Average comments per pull request: 0.32
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 19
Average time to close issues: 2 days
Average time to close pull requests: 4 days
Issue authors: 5
Pull request authors: 6
Average comments per issue: 0.83
Average comments per pull request: 0.32
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

LucasSilvaFerreira (2)
Patrick-McKeever (1)
zhaoshuoxp (1)
RaiRuhi (1)

Pull Request Authors

ian-whaling (6)
logan-blaine (5)
LucasSilvaFerreira (3)
ekatsevi (2)
alexbarrera (2)
zhaoshuoxp (1)

Top Labels

Issue Labels

bug (1) enhancement (1)

Pull Request Labels

enhancement (1)

Dependencies

docker-images/conda-docker/Dockerfile docker

condaforge/miniforge3 24.11.3-2 build

subworkflows/nf-core/utils_nextflow_pipeline/meta.yml cpan

subworkflows/nf-core/utils_nfcore_pipeline/meta.yml cpan

download_development/requirements.txt pypi

PyYAML ==6.0.2 development
cachetools ==5.5.2 development
certifi ==2025.8.3 development
charset-normalizer ==3.4.2 development
google-api-core ==2.25.1 development
google-auth ==2.40.3 development
google-cloud-core ==2.4.3 development
google-cloud-storage ==3.2.0 development
google-crc32c ==1.7.1 development
google-resumable-media ==2.7.2 development
googleapis-common-protos ==1.70.0 development
idna ==3.10 development
numpy ==2.2.6 development
pandas ==2.3.1 development
proto-plus ==1.26.1 development
protobuf ==6.31.1 development
pyasn1 ==0.6.1 development
pyasn1_modules ==0.4.2 development
python-dateutil ==2.9.0.post0 development
pytz ==2025.2 development
requests ==2.32.4 development
rsa ==4.9.1 development
six ==1.17.0 development
tqdm ==4.67.1 development
tzdata ==2025.2 development
urllib3 ==2.5.0 development