urlpipe

https://github.com/hukai916/urlpipe

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: hukai916
License: mit
Language: Python
Default Branch: main
Size: 6.5 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created almost 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog License Citation

README.md

Introduction
Pipeline summary
Quick start
Documentation
Credits
Bug report/Support
Citations
Release notes

Introduction

Gene editing via CRISPR/Cas9 technologies has emerged as a promoising strategy to treating certain Repeat Expansion Diseases (REDs) including Huntington's Disease (HD) by permanantly reducing the length of pathogenic expansions in DNA regions. One crucial step in this process is evaluating the editing results, which involves checking for undesired Insertions and Deletions (INDELs) near the targeted editing sites and determining the edited length of the DNA repeat expansion through sequencing. However, accurately determining the editing outcomes remains a challenge due to PCR artifects caused by polymerase slippage in repetitive DNA regions, decreased efficiency when amplifying larger fragments, and sequencing errors. The URLpipe (UMI-based Repeat Length analyssi pipeline) tackles this problem by leveraging Unique Molecular Identifier (UMIs) to improve the accuracy of inferring gene editing outcomes.

Powered by Nextflow, URLpipe is designed to be user-friendly and portable, enabling execution across various compute infrastructures through Docker/Singularity technologies. URLpipe takes raw fastq files as input and generates statistical tables and plots that summarize the editing outcomes. Below is an overview of the design and the implemented sub-workflows/modules in URLpipe.

The development of the pipeline is guided by nf-core TEMPLATE.

Pipeline Summary

URLpipe supports sequencing reads from both Illumina and Nanopore platforms. The relevant sub-workflows/modules are illustrated in the diagram below. For detailed instructions on configuring your analysis and examples, refer to the usage documentation.

Illumina reads

The Illumina branch of the pipeline is structured into eight distinct sub-workflows, each with a specific role in processing data: - INPUT_CHECK: - Validate the input files and configurations to ensure they meet the requirements for analysis. - PREPROCESS_QC: - Perform preprocessing and quality control on the raw data. - Result folder: 1_preprocess and 2_qc_and_umi - CLASSIFY_READ: - Categorize reads into different classes to facilitate downstream analysis. - Result folder: 3_read_category - REPEATSTATDEFAULT and REPEATSTATMERGE: - Determine repeat lengths by leveraging UMI. - Result folder: 4_repeat_statistics - INDEL_STAT: - Analyze patterns of insertions and deletions around the repeat region. - Result folder: 5_indel_statistics - GET_SUMMARY: - Generate tables and plots summarizing the editng outcome. - Result folder: 6_summary

Selected sub-workflows and their functionalities are summarized below. Refer to output - Results for more details.

PREPROCESS_QC:

Merge fastq files from different lanes (if any) that belong to the same library (1a_lane_merge).
Extract UMI from each read and append it to the read name (1b_umi_extract).
Trim adapter sequences (1c_cutadapt).
Quality control using FastQC (2a_fastqc).
Quality control by plotting read count per UMI (2b_read_per_umi_cutadapt).

CLASSIFY_READ:

Determine if read is mapped to the predefined target region (on-locus) (3a_classify_locus).
Classify on-locus reads based on the presence of INDELs around the repeat region (non-indel) (3b_classify_indel).
Classify non-indel reads For each non-indel read, determine if it covers the entire repeat region (readthrough) (3c_classify_readthrough).

The readthrough reads will be used towards determining the repeat lengths.

REPEATSTATDEFAULT/MERGE:

In URLpipe, repeat length determination can be performed in two modes: DEFAULT mode, which uses only R1 reads, and MERGE mode, which merges R1 and R2 reads. For UMI correction, four methods are currently available: "mode", "mean", and "least distance", and "square distance".

Figure out repeat length distribution (4a_repeat_length_distribution).
Perform UMI correction to refine repeat length measurements (4a_repeat_length_distribution).
Plot the repeat length distribution per UMI (4b_4a_repeat_length_distribution_per_umi).

INDEL_STAT:

Gather statistic information for reads containing INDELs.

GET_SUMMARY:

Obtain summary statistics from CLASSIFY_READ, REPEAT_STAT_DEFAULT/MERGE, and INDEL_STAT results.

Generate master statistic tables (6a_master_table).
Generate summary plots (6b_bin_plot).

Nanopore reads

The setup for the Nanopore branch is quite similar to that of the Illumina branch, with the main difference being the inclusinog of an optional PREPROCESS_NANOPORE sub-workflow specifically designed for pre-processing Nanopore data.

Quick start

Install nextflow(>=23.10.0).
To avoid potential issues with dependency installation, all URLpipe dependencies are built into images. Therefore, you should install either Docker or Singularity (Apptainer) and specify -profile singularity or -profile docker, respectively, when running URLpipe. Otherwise, you will need to manually install all dependencies and ensure they are available on you local PATH, which is unlikely to be the case!
Download the pipeline: bash git clone https://github.com/hukai916/URLpipe.git cd URLpipe
Download a minimal test dataset:
- The dataset1 comprises a subset of samples from a CRISPR editing experiment using the HQ50 (Human) cell line. HQ50 cells contain two HTT alleles with differing CAG-repeat lengths: one with approximately 18CAG/20Q and the other with approximately 48CAG/50Q. The objective of the experiment is to examine the editing outcomes when treated with various DNA damage repair inhibitors. For demonstration purposes, six samples (three conditions, each with two replicates) have been selected:
- Two samples with no electroporation (no_E), serving as the unedited control.
- Two samples with no inhibitor (noINH_DMSO), serving as the edited control.
- Two samples with the D103 inhibitor (D103_10uM) to assess its effect on the editing outcome.

bash wget https://www.dropbox.com/scl/fi/b4xspm0ydq4y1p8s9u55g/sample_dataset1.zip unzip sample_dataset1.zip

Edit the replace_with_full_path in the "assets/samplesheet_dataset1.csv" file to use the actual full path.
Test the pipeline with this minimal dataset1:

At least 8GB memory is recommended for dataset1.
By default, the local executor (your local computer) will be used (-profile local) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g. -profile docker,lsf.
Please check nf-core/configs to see what other custom configurations can be supplied.
Example command for run URLpipe with Docker and local executor: nextflow nextflow run main.nf -c conf/sample_dataset1.config -profile docker,local

By executing the above command: - The "local executor" (-profile local) will be used. - The "docker" (-profile docker) will be leveraged. - The configurations specified via -c conf/sample_dataset1.config will be applied, which includes: - input = "./assets/samplesheet_dataset1.csv": input samplesheet file path - outdir = "./results_dataset1": output directory path - ref = "./assets/IlluminaHsQ50FibTrim_Ref.fa": reference file path - ref_repeat_start = 69: 1-based repeat start coordinate in reference - ref_repeat_end = 218: 1-based repeat end coordinate in reference - ref_repeat_unit = "CAG": repeat unit in reference - length_mode = "reference_align": repeat length determination method - umi_cutoffs = "1,3,5,7,10,30,100": UMI cutoffs for correction - umi_correction_method = "least_distance": UMI correction method - repeat_bins = "[(0,50), (51,60), (61,137), (138,154), (155,1000)]": number and range of bins to plot - allele_number = 2: number of alleles in reference - max_memory = "16.GB": maximum memory to use, do not exceed what your system has - max_cpus = 16: maximum number of cpu to use, do not exceed what your system has - max_time = "240.h": maximum running time - other module-specific configurations

Detailed explanations, refer to usage.

Example command for running URLpipe with Singularity and LSF executor: nextflow nextflow run main.nf -c conf/sample_dataset1.config -profile singularity,local

Like the first example, the above command directs the pipeline to use Singularity and LSF executor rather than Docker and local executor by -profile singularity,lsf. - Note, Singularity images will be downloaded and saved to work/singularity directory by default. It is recommended to configure the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir settings to store the images in a central location.

Run your own analysis:
- Typical commands: ```nextflow # Supply configurations through command flags nextflow run main.nf -profile --input --outdir --allelenumber <1/2> --lengthmode --ref ...

Or include configurations into a single file, e.g. test.config

nextflow run main.nf -profile -c test.config ```

For help: # todo nextflow nextflow run main.nf --help

See documentation usage for all of the available options.

Documentation

The URLpipe workflow includes comprehensive documention, covering both usage and output.

Credits

URLpipe was originally designed and written by Kai Hu, Michael Brodsky, and Lihua Julie Zhu. We also extend our gratitude to Rui Li, Haibo Liu, Junhui Li for their extensive assistance in the development of this tool.

Bug report/Support

For help, bug reports, or feature requests, please create a GitHub issue by clicking here. If you would like to extend URLpipe for your own use, feel free to fork the repository.

Citations: todo

Please cite URLpipe if you use it for your research.

A Template of Method can be found here.

A complete list of references for the tools used by URLpipe can be found here.

Release notes

v0.1.0

* initial release

Owner

Login: hukai916
Kind: user

Repositories: 6
Profile: https://github.com/hukai916

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use `nf-core tools` in your work, please cite the `nf-core` publication"
authors:
  - family-names: Ewels
    given-names: Philip
  - family-names: Peltzer
    given-names: Alexander
  - family-names: Fillinger
    given-names: Sven
  - family-names: Patel
    given-names: Harshil
  - family-names: Alneberg
    given-names: Johannes
  - family-names: Wilm
    given-names: Andreas
  - family-names: Ulysse Garcia
    given-names: Maxime
  - family-names: Di Tommaso
    given-names: Paolo
  - family-names: Nahnsen
    given-names: Sven
title: "The nf-core framework for community-curated bioinformatics pipelines."
version: 2.4.1
doi: 10.1038/s41587-020-0439-x
date-released: 2022-05-16
url: https://github.com/nf-core/tools
prefered-citation:
  type: article
  authors:
    - family-names: Ewels
      given-names: Philip
    - family-names: Peltzer
      given-names: Alexander
    - family-names: Fillinger
      given-names: Sven
    - family-names: Patel
      given-names: Harshil
    - family-names: Alneberg
      given-names: Johannes
    - family-names: Wilm
      given-names: Andreas
    - family-names: Ulysse Garcia
      given-names: Maxime
    - family-names: Di Tommaso
      given-names: Paolo
    - family-names: Nahnsen
      given-names: Sven
  doi: 10.1038/s41587-020-0439-x
  journal: nature biotechnology
  start: 276
  end: 278
  title: "The nf-core framework for community-curated bioinformatics pipelines."
  issue: 3
  volume: 38
  year: 2020
  url: https://dx.doi.org/10.1038/s41587-020-0439-x

GitHub Events

Total

Last Year

Dependencies

modules/nf-core/modules/cat/fastq/meta.yml cpan

modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/modules/cutadapt/meta.yml cpan

modules/nf-core/modules/cutadapt_fastqs/meta.yml cpan

modules/nf-core/modules/fastqc/meta.yml cpan

modules/nf-core/modules/multiqc/meta.yml cpan

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science