Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: hukai916
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 9.41 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

Introduction

sikiclass is a bioinformatics pipeline designed for downstream classification analysis of UMI-collapsed reads produced by sikipipe.

The classification process involves several steps: - classify_tag (Step 1): This step categorizes reads based on the occurrence of tag fragments.
- classify_single_tag (Step 2): This step further subdivides reads containing a single tag into specific categories such as "precisetag," "5' INDEL," "3' INDEL," etc.
- `classify
no_tag` (Step 3): This step further divides reads without tags into categories like "Deletion," "Insertion," etc.

Additionally, within the "precise_tag" category, reads are separated and counted based on different SNPs (Step 4). Finally, statistical tables are generated to summarize the proportions of each read category and to detail the size and distribution of INDELs.

The diagram below illustrates the method in detail. Specifially: - classify_tag (Step 1): To determine if a tag appears in each read sequence, we use the short-read aligner BWA, treating the tag as the query and each read as the reference. By parsing the resulting BAM file, reads are classified as follows:
- "Without tag": No tag appears in the read.
- "With single tag": Each tag segment appears at most once in the read.
- "With multiple tags": Any tag segment appears more than once in the read.
- classify_single_tag (Step 2) and classify_no_tag (Step 3): For further classification of "with single tag" reads and "without tag" reads respectively.
- "With single tag" reads are mapped against a reference with a precise tag inserted using the long-read aligner minimap2. Reads are classified as "with precise tag" if there are no INDELs in the tag and a predefined flanking region; otherwise, they are considered to have INDELs.
- "Without tag" reads are mapped against a wild-type reference that lacks the tag insertion. These reads are classified into sub-categories based on the occurrence of INDELs.
- stat_snp (Step 4): For "precise tag" reads, they are split and counted according to the base species at a given SNP site.

In addition to the split FASTQ files, sikiclass generates tables summarizing the proportions of each sub-category and the distributions of INDELs.

Refer to the Usage section for instructions on how to run sikiclass, and the Output section for a detailed description of the result folders and files.

Usage

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Now, you can run the pipeline using:

```bash git clone git@github.com:hukai916/sikiclass.git cd sikiclass

update conf/test_local.config by providing the correct samplesheet path, etc.

nextflow run main.nf -profile docker,arm -c conf/test_local.config ```

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Core sikiclass parameters are described below.

  • input [null] String. Path to input sample sheet in CSV format. E.g., 'assets/samplesheet_local.csv'.
  • refwithtag [null] String. Path to reference fasta file containing precise tag insert. E.g., 'assets/test/h3f3d_preciseInsert.fa'.
  • ref_wt [null] String. Path to reference fasta file without precise tag insert. E.g., 'assets/test/h3f3d_wt.fa'.
  • tag_fq [null] Path to an artifical fastq file containing the tag sequence. E.g., './assets/test/tag.fastq'.
  • tagstartrefwithtag [int] Integer. 1-based coordinate of tag start pos in refwithtag. E.g., 1003.
  • tagendrefwithtag [int] Integer. 1-based coordinate of tag end pos in refwithtag. E.g., 1089.
  • tag_flanking [int] Integer. Number of tag-flanking bases to check to determine "precise tag" reads. E.g., 60.
  • pamstartref_wt [int] Integer. 1-based coordinate of pam pos in ref_wt. E.g., 1113.
  • snp_pos [null] Integer or null. Specifies whether to count ratio of reads with different SNPs for reads with precise tag. Supply 1-based coordinate on refwithtag or null to skip this step. If set, must also supply "snp_wt" and "snp_mut" argument below.
  • snp_wt [str] String. The base species at the "snp_pos" site in the wild type reference. E.g., 'A'.
  • snp_mut [str] String. The base species at the "snp_pos" site in the mutant reference. E.g., 'G'.
  • indelrangetoscansingle_tag [str] String. Specifies whether to restrict indels within a given range on refwithtag for classifying "single tag" reads. Define the range using the pattern 'start:end' with 1-based coordinates. Defaults to ':', which means the full range will be used. E.g., '950:1116'.
  • indelrangetoscanno_tag [str] String. Specifies whether to restrict indels within a given range on ref_wt for classifying "no tag" reads. Define the range using the pattern 'start:end' with 1-based coordinates. Defaults to ':', which means the full range will be used. E.g., '980:1003'.
  • classifynotagfiltercontrol_indel [null] String or null. String or null. Specifies the names of control (uninjected) samples separated by comma. When provided, indels occurring in these control samples will be filtered out during the classification of "no tag" reads. Set to null to skip this filtering. E.g., 'SpCas9_uninjected_SIKI2_1_10, SpCas9_uninjected_SIKI2_1_5, SpCas9_uninjected_SIKI2_2_10, SpCas9_uninjected_SIKI2_2_5, SpCas9_uninjected_SIKI2_3_10, SpCas9_uninjected_SIKI2_3_5'. The sample names must match those specified in the sample sheet.

Output

By default, all results are saved in the "./results" folder, as specified by the outdir = "./results" parameter in the master configuration file "nextflow.config". Results from different steps are organized into corresponding subdirectories (E.g., "./results/01classifytag", "./results/02classifysingletag", etc.), the "./results/00stat" folder contains summary statistical tables.

Nextflow implements a caching mechanism that stores all intermediate and final results in the "./work/" directory. By default, files in the "./results/" are symbolic links to that in the "./work/". To switch from symlink to copy, use publish_dir_mode = copy argument. Below summarizes the main contents of each result folder.

00_stat

This directory contains summary tables generated using the results of steps 1-4:

  • fq_class_ratios.tsv: Provides the count fractions of each read class for each sample.
  • fq_class_ratios_indel_filter_control.tsv: Relevant when classify_no_tag_filter_control_indel argument is set. Provides the count fractions of each read class for each sample. For "no tag" reads, classify them using indels after excluding those that appear in control samples.
  • precise_tag_snp_fraction.tsv: Provides the count fractions of different base species at given SNP site for "precise tag" reads for each sample.
  • no_tag_reads_indel_info/: Provides indel information for "no tag" reads.
    • no_tag_indel_size_distribution.tsv: Shows the distribution of INDEL sizes for "no tag" reads.
    • no_tag_indel_size_location_to_pam: Lists the INDEL size and their locations relative to the PAM site for "no tag" reads for each sample.
  • no_tag_reads_indel_info_filter_control/: Relevant when classify_no_tag_filter_control_indel argument is set. Provides indel information for "no tag" reads classified using indels after excluding those that appear in control samples.
    • no_tag_indel_size_distribution.tsv: Shows the distribution of INDEL sizes for "no tag" reads.
    • no_tag_indel_size_location_to_pam: Lists the INDEL size and their locations relative to the PAM site for "no tag" reads for each sample.

01classifytag

This directory contains fastq files categorized based on tag occurrence:

  • 01a_no_tag/: Fastq file for reads classified as having "no tag" for each sample.
  • 01b_single_tag/: Fastq file for reads classified as having "single tag" for each sample.
  • 01c_multiple_tag/: Fastq file for reads classified as having "multiple tag" for each sample.
  • 01d_any_tag/: Fastq file for reads classified as having "any tag" for each sample (a union of 01b and 01c).
  • 01e_tmp_fasta/: Intermediate fasta file.
  • 01f_tmp_bam/: Intermedaite BAM file (from BWA).

02classifysingle_tag

This directory contains fastq files categorized based on INDEL occurrence for "single tag" reads.

  • 02a_precise_tag/: Fastq file for reads with "precise tag" for each sample.
  • 02a_precise_tag_snp_wt/: Fastq file for reads classified as having "precise tag" that also contain WT base at given SNP site for each sample.
  • 02a_precise_tag_snp_mut/: Fastq file for reads classified as having "precise tag" that also contain mutant base at given SNP site for each sample.
  • 02b_5indel/: Fastq file for reads classified as having "INDELs in 5' region" for each sample.
  • 02c_3indel/: Fastq file for reads classified as having "INDELs in 3' region" for each sample.
  • 02d_any_indel/: Fastq file for reads classified as having "INDELs in any region" for each sample.
  • 02e_tmp_bam/: Intermedaite BAM file (from minimap2).
  • 02f_tmp_indel_pos/: Intermedaite file containing INDEL size and location information for each sample.

03classifyno_tag

This directory contains fastq files categorized based on INDEL occurrence for "no tag" reads.

  • control_indels.tsv: Relevant when classify_no_tag_filter_control_indel argument is set. Stores the indel information retrieved from control (uninjected) samples.
  • 03a_indel/: Fastq file for "no tag" reads that are also with indels for each sample.
  • 03a_indel_filter_control/: Relevant when classify_no_tag_filter_control_indel argument is set. Fastq file for "no tag" reads that are also with indels after excluding those appear in control samples for each sample.
  • 03b_deletion_only/: Fastq file for reads classified as having "deletions in any region" for each sample. Those contain insertions are excluded.
  • 03b_deletion_only_filter_control/: Relevant when classify_no_tag_filter_control_indel argument is set. Fastq file for reads classified as having "deletions in any region" after excluding those appear in control samples for each sample. Those contain insertions are excluded.
  • 03c_insertion_only/: Fastq file for reads classified as having "insertions in any region" for each sample. Those contain deletions are excluded.
  • 03c_insertion_only_filter_control/: Relevant when classify_no_tag_filter_control_indel argument is set. Fastq file for reads classified as having "insertions in any region" after excluding those appear in control samples for each sample. Those contain deletions are excluded.
  • 03d_complex/: Fastq file for reads classified as having "both deletions and insertions in any region" for each sample.
  • 03d_complex_filter_control/: Relevant when classify_no_tag_filter_control_indel argument is set. Fastq file for reads classified as having "both deletions and insertions in any region" after excluding those appear in control samples for each sample.
  • 03e_tmp_bam/: Intermedaite BAM file (from minimap2).
  • 03f_tmp_indel_pos/: Intermedaite file containing INDEL size and location information for each sample.
  • 03f_tmp_indel_pos_control/: Relevant when classify_no_tag_filter_control_indel argument is set. Intermedaite file containing INDEL (after excluding those appear in control samples) size and location information for each sample.

Credits

sikiclass was originally designed and written by Kai Hu, Nathan Lawson, and Julie Zhu.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Login: hukai916
  • Kind: user

Citation (CITATIONS.md)

# hukai916/sikiclass: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2

Dependencies

modules/nf-core/fastqc/meta.yml cpan
modules/nf-core/multiqc/meta.yml cpan
subworkflows/nf-core/utils_nextflow_pipeline/meta.yml cpan
subworkflows/nf-core/utils_nfcore_pipeline/meta.yml cpan
subworkflows/nf-core/utils_nfvalidation_plugin/meta.yml cpan
pyproject.toml pypi
modules/nf-core/fastqc/environment.yml conda
  • fastqc 0.12.1.*
modules/nf-core/multiqc/environment.yml conda
  • multiqc 1.21.*