Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: hukai916
- License: mit
- Language: Python
- Default Branch: main
- Size: 6.5 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Table of Contents
Introduction
Pipeline summary
Quick start
Documentation
Credits
Bug report/Support
Citations
Release notes
Introduction
Gene editing via CRISPR/Cas9 technologies has emerged as a promoising strategy to treating certain Repeat Expansion Diseases (REDs) including Huntington's Disease (HD) by permanantly reducing the length of pathogenic expansions in DNA regions. One crucial step in this process is evaluating the editing results, which involves checking for undesired Insertions and Deletions (INDELs) near the targeted editing sites and determining the edited length of the DNA repeat expansion through sequencing. However, accurately determining the editing outcomes remains a challenge due to PCR artifects caused by polymerase slippage in repetitive DNA regions, decreased efficiency when amplifying larger fragments, and sequencing errors. The URLpipe (UMI-based Repeat Length analyssi pipeline) tackles this problem by leveraging Unique Molecular Identifier (UMIs) to improve the accuracy of inferring gene editing outcomes.
Powered by Nextflow, URLpipe is designed to be user-friendly and portable, enabling execution across various compute infrastructures through Docker/Singularity technologies. URLpipe takes raw fastq files as input and generates statistical tables and plots that summarize the editing outcomes. Below is an overview of the design and the implemented sub-workflows/modules in URLpipe.
The development of the pipeline is guided by nf-core TEMPLATE.
Pipeline Summary
URLpipe supports sequencing reads from both Illumina and Nanopore platforms. The relevant sub-workflows/modules are illustrated in the diagram below. For detailed instructions on configuring your analysis and examples, refer to the usage documentation.
Illumina reads
The Illumina branch of the pipeline is structured into eight distinct sub-workflows, each with a specific role in processing data:
- INPUT_CHECK:
- Validate the input files and configurations to ensure they meet the requirements for analysis.
- PREPROCESS_QC:
- Perform preprocessing and quality control on the raw data.
- Result folder: 1_preprocess and 2_qc_and_umi
- CLASSIFY_READ:
- Categorize reads into different classes to facilitate downstream analysis.
- Result folder: 3_read_category
- REPEATSTATDEFAULT and REPEATSTATMERGE:
- Determine repeat lengths by leveraging UMI.
- Result folder: 4_repeat_statistics
- INDEL_STAT:
- Analyze patterns of insertions and deletions around the repeat region.
- Result folder: 5_indel_statistics
- GET_SUMMARY:
- Generate tables and plots summarizing the editng outcome.
- Result folder: 6_summary
Selected sub-workflows and their functionalities are summarized below. Refer to output - Results for more details.
PREPROCESS_QC:
- Merge fastq files from different lanes (if any) that belong to the same library (
1a_lane_merge). - Extract UMI from each read and append it to the read name (
1b_umi_extract). - Trim adapter sequences (
1c_cutadapt). - Quality control using FastQC (
2a_fastqc). - Quality control by plotting read count per UMI (
2b_read_per_umi_cutadapt).
CLASSIFY_READ:
- Determine if read is mapped to the predefined target region (on-locus) (
3a_classify_locus). - Classify on-locus reads based on the presence of INDELs around the repeat region (non-indel) (
3b_classify_indel). - Classify non-indel reads For each non-indel read, determine if it covers the entire repeat region (readthrough) (
3c_classify_readthrough).
The readthrough reads will be used towards determining the repeat lengths.
REPEATSTATDEFAULT/MERGE:
In URLpipe, repeat length determination can be performed in two modes: DEFAULT mode, which uses only R1 reads, and MERGE mode, which merges R1 and R2 reads. For UMI correction, four methods are currently available: "mode", "mean", and "least distance", and "square distance".
- Figure out repeat length distribution (
4a_repeat_length_distribution). - Perform UMI correction to refine repeat length measurements (
4a_repeat_length_distribution). - Plot the repeat length distribution per UMI (
4b_4a_repeat_length_distribution_per_umi).
INDEL_STAT:
Gather statistic information for reads containing INDELs.
GET_SUMMARY:
Obtain summary statistics from CLASSIFY_READ, REPEAT_STAT_DEFAULT/MERGE, and INDEL_STAT results.
- Generate master statistic tables (
6a_master_table). - Generate summary plots (
6b_bin_plot).
Nanopore reads
The setup for the Nanopore branch is quite similar to that of the Illumina branch, with the main difference being the inclusinog of an optional PREPROCESS_NANOPORE sub-workflow specifically designed for pre-processing Nanopore data.
Quick start
Install
nextflow(>=23.10.0).To avoid potential issues with dependency installation, all URLpipe dependencies are built into images. Therefore, you should install either
DockerorSingularity (Apptainer)and specify-profile singularityor-profile docker, respectively, when running URLpipe. Otherwise, you will need to manually install all dependencies and ensure they are available on you local PATH, which is unlikely to be the case!Download the pipeline:
bash git clone https://github.com/hukai916/URLpipe.git cd URLpipeDownload a minimal test dataset:
- The dataset1 comprises a subset of samples from a CRISPR editing experiment using the HQ50 (Human) cell line. HQ50 cells contain two HTT alleles with differing CAG-repeat lengths: one with approximately 18CAG/20Q and the other with approximately 48CAG/50Q. The objective of the experiment is to examine the editing outcomes when treated with various DNA damage repair inhibitors. For demonstration purposes, six samples (three conditions, each with two replicates) have been selected:
- Two samples with no electroporation (no_E), serving as the unedited control.
- Two samples with no inhibitor (noINH_DMSO), serving as the edited control.
- Two samples with the D103 inhibitor (D103_10uM) to assess its effect on the editing outcome.
bash
wget https://www.dropbox.com/scl/fi/b4xspm0ydq4y1p8s9u55g/sample_dataset1.zip
unzip sample_dataset1.zip
Edit the
replace_with_full_pathin the "assets/samplesheet_dataset1.csv" file to use the actual full path.Test the pipeline with this minimal dataset1:
- At least 8GB memory is recommended for dataset1.
- By default, the local executor (your local computer) will be used (
-profile local) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g.-profile docker,lsf. Please check nf-core/configs to see what other custom configurations can be supplied.
Example command for run URLpipe with Docker and local executor:
nextflow nextflow run main.nf -c conf/sample_dataset1.config -profile docker,local
By executing the above command:
- The "local executor" (-profile local) will be used.
- The "docker" (-profile docker) will be leveraged.
- The configurations specified via -c conf/sample_dataset1.config will be applied, which includes:
- input = "./assets/samplesheet_dataset1.csv": input samplesheet file path
- outdir = "./results_dataset1": output directory path
- ref = "./assets/IlluminaHsQ50FibTrim_Ref.fa": reference file path
- ref_repeat_start = 69: 1-based repeat start coordinate in reference
- ref_repeat_end = 218: 1-based repeat end coordinate in reference
- ref_repeat_unit = "CAG": repeat unit in reference
- length_mode = "reference_align": repeat length determination method
- umi_cutoffs = "1,3,5,7,10,30,100": UMI cutoffs for correction
- umi_correction_method = "least_distance": UMI correction method
- repeat_bins = "[(0,50), (51,60), (61,137), (138,154), (155,1000)]": number and range of bins to plot
- allele_number = 2: number of alleles in reference
- max_memory = "16.GB": maximum memory to use, do not exceed what your system has
- max_cpus = 16: maximum number of cpu to use, do not exceed what your system has
- max_time = "240.h": maximum running time
- other module-specific configurations
Detailed explanations, refer to usage.
- Example command for running URLpipe with Singularity and LSF executor:
nextflow nextflow run main.nf -c conf/sample_dataset1.config -profile singularity,local
Like the first example, the above command directs the pipeline to use Singularity and LSF executor rather than Docker and local executor by -profile singularity,lsf.
- Note, Singularity images will be downloaded and saved to work/singularity directory by default. It is recommended to configure the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir settings to store the images in a central location.
- Run your own analysis:
- Typical commands:
```nextflow
# Supply configurations through command flags
nextflow run main.nf -profile
--input --outdir --allelenumber <1/2> --lengthmode --ref ...
- Typical commands:
```nextflow
# Supply configurations through command flags
nextflow run main.nf -profile
Or include configurations into a single file, e.g. test.config
nextflow run main.nf -profile
- For help: # todo
nextflow nextflow run main.nf --help
See documentation usage for all of the available options.
Documentation
The URLpipe workflow includes comprehensive documention, covering both usage and output.
Credits
URLpipe was originally designed and written by Kai Hu, Michael Brodsky, and Lihua Julie Zhu. We also extend our gratitude to Rui Li, Haibo Liu, Junhui Li for their extensive assistance in the development of this tool.
Bug report/Support
For help, bug reports, or feature requests, please create a GitHub issue by clicking here. If you would like to extend URLpipe for your own use, feel free to fork the repository.
Citations: todo
Please cite URLpipe if you use it for your research.
A Template of Method can be found here.
A complete list of references for the tools used by URLpipe can be found here.
Release notes
v0.1.0
* initial releaseOwner
- Login: hukai916
- Kind: user
- Repositories: 6
- Profile: https://github.com/hukai916
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use `nf-core tools` in your work, please cite the `nf-core` publication"
authors:
- family-names: Ewels
given-names: Philip
- family-names: Peltzer
given-names: Alexander
- family-names: Fillinger
given-names: Sven
- family-names: Patel
given-names: Harshil
- family-names: Alneberg
given-names: Johannes
- family-names: Wilm
given-names: Andreas
- family-names: Ulysse Garcia
given-names: Maxime
- family-names: Di Tommaso
given-names: Paolo
- family-names: Nahnsen
given-names: Sven
title: "The nf-core framework for community-curated bioinformatics pipelines."
version: 2.4.1
doi: 10.1038/s41587-020-0439-x
date-released: 2022-05-16
url: https://github.com/nf-core/tools
prefered-citation:
type: article
authors:
- family-names: Ewels
given-names: Philip
- family-names: Peltzer
given-names: Alexander
- family-names: Fillinger
given-names: Sven
- family-names: Patel
given-names: Harshil
- family-names: Alneberg
given-names: Johannes
- family-names: Wilm
given-names: Andreas
- family-names: Ulysse Garcia
given-names: Maxime
- family-names: Di Tommaso
given-names: Paolo
- family-names: Nahnsen
given-names: Sven
doi: 10.1038/s41587-020-0439-x
journal: nature biotechnology
start: 276
end: 278
title: "The nf-core framework for community-curated bioinformatics pipelines."
issue: 3
volume: 38
year: 2020
url: https://dx.doi.org/10.1038/s41587-020-0439-x