crispr_analysis
A pipeline that facilitates the CRISPR analysis
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Repository
A pipeline that facilitates the CRISPR analysis
Basic Info
- Host: GitHub
- Owner: Leran10
- License: mit
- Language: Python
- Default Branch: main
- Size: 27.3 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
CRISPR Screen Analysis Pipeline
A Snakemake-based pipeline for CRISPR screen analysis, integrating BBDuk.sh for trimming and MAGeCK for statistical analysis.
Overview
This pipeline processes CRISPR screening data through the following steps: 1. Read trimming and filtering using BBDuk.sh (from BBTools) 2. sgRNA counting and analysis using MAGeCK
Requirements
- Conda or Mamba
- Snakemake
- 16+ GB RAM recommended
Installation
Clone this repository:
bash git clone https://github.com/username/crispr_analysis.git cd crispr_analysisCreate and activate the conda environment:
bash conda env create -f environment.yaml conda activate crispr_analysis
Input Data
FASTQ Files
Place your FASTQ files in the main directory. The sample names in config.yaml should match the prefixes of your FASTQ files.
CRISPR Library Files
The pipeline provides two ways to specify the sgRNA library:
Use a predefined library by name:
- Predefined libraries are stored in the
libraries/directory - Each library includes optimal trimming parameters for that specific library design
- In
config.yaml, specify the library by name:yaml # Use a predefined library library_name: "GeCKOv2_Human"
- Predefined libraries are stored in the
Provide a custom library file:
- Create a library file in MAGeCK format (tab-delimited):
- Column 1: sgRNA ID
- Column 2: sgRNA sequence
- Column 3: Gene ID
- In
config.yaml, specify the path to your file:yaml # Use a custom library file library_file: "path/to/your/library.txt"
- Create a library file in MAGeCK format (tab-delimited):
Available predefined libraries include GeCKOv2 (human/mouse) and Brunello. See libraries/README.md for a complete list.
Configuration
Edit the config.yaml file to customize the pipeline:
- Samples: List all sample file names (without the
.fastqextension) - Sample Labels: Map full sample names to shorter labels for reporting
- Comparisons: Define test vs. control sample comparisons for MAGeCK analysis:
yaml comparisons: SampleAvsB: # Name for this comparison (will be used in output filenames) test: "SampleB" # Treatment sample(s) - passed to MAGeCK with -t flag control: "SampleA" # Control sample(s) - passed to MAGeCK with -c flagYou can define multiple comparisons, and the pipeline will run separate MAGeCK analyses for each one. For each comparison, specify which sample is the test (experimental) group and which is the control (reference) group. Library Selection (choose one option):
- Predefined library:
yaml library_name: "GeCKOv2_Human" # Uses optimal parameters for this library design - Custom library file:
yaml library_file: "path/to/your/library.txt"
- Predefined library:
BBDuk Trimming Parameters (critical to customize for your CRISPR construct - automatically set if using a predefined library):
vector_sequence: The 5' vector sequence preceding your sgRNA (for left trimming)backbone_sequence: The 3' backbone sequence following your sgRNA (for right trimming)- Additional parameters for fine-tuning:
bbduk_kmer_length: Length of k-mers for matching (default: 8)bbduk_reverse_complement: Whether to consider reverse complement (f=false, t=true)bbduk_mismatch: Allow mismatches? (f=false, t=true)bbduk_hamming_distance: Number of mismatches allowed (0=exact match)bbduk_min_length/bbduk_max_length: Expected length of sgRNA after trimming
Resources: Configure computational resources (threads, memory)
Running the Pipeline
Dry run to check workflow without execution:
bash snakemake -nRun the pipeline:
bash snakemake --cores <number_of_cores>Run with conda environments (recommended):
bash snakemake --cores <number_of_cores> --use-condaFor cluster execution (SLURM example):
bash snakemake --cluster "sbatch --ntasks={threads} --mem={resources.mem_mb}M" --jobs 20
Pipeline Steps
BBDuk Trimming
The pipeline trims reads in 4 steps: 1. Left filter: Filter reads containing the vector sequence 2. Left trim: Trim off the vector sequence 3. Right filter: Filter reads containing the backbone sequence 4. Right trim: Trim off the backbone sequence and keep only reads of the expected length
MAGeCK Analysis
- Count: Count sgRNAs in each sample using the processed FASTQ files
- Test: Perform statistical analysis of sgRNA enrichment/depletion
For the MAGeCK test step, the pipeline determines which samples are test (treatment) and which are control samples from your config.yaml file. In the configuration file, you define comparisons like this:
yaml
comparisons:
VirusVsControl: # Name of the comparison
test: "Virus" # Treatment sample (will be passed to MAGeCK with -t flag)
control: "Mock" # Control sample (will be passed to MAGeCK with -c flag)
DrugVsVirus:
test: "Drug"
control: "Virus"
You can also specify multiple samples for each group by using a list:
yaml
comparisons:
MultipleReplicates:
test: ["VirusRep1", "VirusRep2", "VirusRep3"] # Multiple treatment replicates
control: ["MockRep1", "MockRep2", "MockRep3"] # Multiple control replicates
The pipeline will automatically run separate MAGeCK test analyses for each comparison you define. This allows you to perform multiple comparisons in a single pipeline run.
Additional MAGeCK parameters can be configured in the config.yaml file:
```yaml
MAGeCK parameters
mageckfdr: 0.05 # False discovery rate threshold magecknormalization_method: "median" # Options: median, total, control ```
Output
Results are saved in the results directory (or as configured in config.yaml):
BBDuk Results:
results/bbduk_output/*_rtrim.fastq: Final trimmed FASTQ fileslogs/: BBDuk log filesbbduk_summary_report.html: Interactive HTML report with statistics and visualizationsbbduk_summary_stats.tsv: Tab-separated file with detailed statisticsfigures/: Directory containing plots from the summary report
MAGeCK Results:
results/mageck_output/mageck_counts.count.txt: sgRNA count table*.gene_summary.txt: Gene-level statistics for each comparison*.sgrna_summary.txt: sgRNA-level statistics for each comparison
BBDuk Summary Report
The pipeline automatically generates a comprehensive summary report after the BBDuk processing steps. This HTML report includes:
Processing Statistics:
- Initial read counts for each sample
- Reads retained at each processing step
- Final retention rates
- Detailed statistics from BBDuk logs
Visualizations:
- Bar plots showing read counts at each stage
- Retention percentage for each sample
- Reference line at 50% to identify potential issues
Troubleshooting Guidance:
- Suggestions for improving retention rates
- Links to diagnostic information
This report helps you diagnose any potential issues with the trimming process and verify that the library-specific trimming parameters are correct for your data.
Customizing for Different CRISPR Libraries
The most important parameters to customize are the vector and backbone sequences that surround your sgRNA sequence. Different CRISPR libraries use different vector designs. Here are examples for common CRISPR libraries:
Example: Customizing for GeCKO v2 Library
yaml
vector_sequence: "CACCG" # 5' vector sequence
backbone_sequence: "GTTTAAGAGC" # 3' backbone sequence
bbduk_min_length: 20 # sgRNA length
bbduk_max_length: 20
Example: Customizing for Brunello Library
yaml
vector_sequence: "AAACACCG"
backbone_sequence: "GTTTAAGAGC"
bbduk_min_length: 20
bbduk_max_length: 20
Example: Allowing 1 Mismatch
If your data has sequencing errors, you might want to allow 1 mismatch:
yaml
bbduk_mismatch: "t"
bbduk_hamming_distance: 1
Troubleshooting
- BBDuk errors: Check logs in
results/bbduk_output/logs/ - Missing library file: Ensure the CRISPR library file exists and is formatted correctly
- Memory issues: Increase memory in
config.yamlfor bbduk or mageck as needed - Low sgRNA counts: Verify that the vector and backbone sequences match your library design
Citation
If you use this pipeline, please cite: - BBTools/BBDuk: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/ - MAGeCK: Li W, et al. "MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens." Genome Biology 15.12 (2014): 554. - Snakemake: Köster, J. and Rahmann, S. "Snakemake - A scalable bioinformatics workflow engine." Bioinformatics 28.19 (2012): 2520-2522.
Owner
- Login: Leran10
- Kind: user
- Repositories: 1
- Profile: https://github.com/Leran10
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Wang"
given-names: "Leran"
orcid: "https://orcid.org/YOUR-ORCID-ID" # Replace with your ORCID if you have one
title: "TE-Analysis-Pipeline"
version: 1.0.0
date-released: 2025-03-05
url: "https://github.com/yourusername/te-analysis-pipeline"
repository-code: "https://github.com/yourusername/te-analysis-pipeline"
license: "MIT"
description: "A Snakemake pipeline for transposable element analysis in RNA-seq data using STAR, RepeatMasker, and TEtranscripts"
keywords:
- transposable-elements
- rna-seq
- bioinformatics
- differential-expression
- repeatmasker
- tetranscripts
- snakemake
references:
- authors:
- family-names: "Jin"
given-names: "Y"
- family-names: "Tam"
given-names: "OH"
- family-names: "Paniagua"
given-names: "E"
- family-names: "Hammell"
given-names: "M"
title: "TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets"
type: article
doi: "10.1093/bioinformatics/btv422"
year: 2015
journal: "Bioinformatics"
- authors:
- family-names: "Köster"
given-names: "Johannes"
- family-names: "Rahmann"
given-names: "Sven"
title: "Snakemake—a scalable bioinformatics workflow engine"
type: article
doi: "10.1093/bioinformatics/bts480"
year: 2012
journal: "Bioinformatics"
GitHub Events
Total
- Push event: 1
- Create event: 2
Last Year
- Push event: 1
- Create event: 2