microsec

MicroSEC filtering pipeline for FFPE artifacts

https://github.com/mano-b/microsec

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

MicroSEC filtering pipeline for FFPE artifacts

Basic Info
  • Host: GitHub
  • Owner: MANO-B
  • License: other
  • Language: R
  • Default Branch: main
  • Size: 12.9 MB
Statistics
  • Stars: 7
  • Watchers: 0
  • Forks: 1
  • Open Issues: 2
  • Releases: 42
Created over 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.md

MicroSEC logo

Version history

2.1.6: Bug with simple repeat annotation fixed. Downloadable from CRAN.
2.1.5: Bug fixed.
2.1.3: Example files renewed. Package in CRAN available.
2.1.1: Bug fixed.
2.1.0: Compatible with large indels in very short leads.
2.0.0: Processing time reduced to 30% due to changes in search algorithm.

Docker and Apptainer

MicroSEC docker file can be download via Docker-hub.
https://hub.docker.com/r/ikegamitky/microsec/tags docker pull ikegamitky/microsec:v2.1.6

Apptainer cantainer can be built via Docker-hub. apptainer pull docker://ikegamitky/microsec:v2.1.6

Apptainer container can be built with a definition file (takes 30 min).
wget https://raw.githubusercontent.com/MANO-B/MicroSEC/main/MicroSEC.def sudo apptainer build MicroSEC.sif MicroSEC.def

Run docker container or apptainer container.
```

mount host directory to container directory

docker run -it --rm -v [hostdir]:[containerdir] ikegamitky/microsec:v2.1.6 bash

download script file

wget https://raw.githubusercontent.com/MANO-B/MicroSEC/main/MicroSEC.R

run the script

Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N] ```

Analysis script file

MicroSEC has been improved to dramatically reduce memory usage.
The speed of analysis is also improved by deleting parts of the BAM file that are not relevant to the mutations prior to analysis by MicroSEC.
Samtools is now mandatory.
Please download and use the new version of MicroSEC.R.
wget https://raw.githubusercontent.com/MANO-B/MicroSEC/main/MicroSEC.R

Known limitation

PIK3CA E545A (chr3:179218304A>C, NM_006218.4:c.1634A>C) pathogenic mutation might be called as an artifact by MicroSEC, which may be a false positive error. A PIK3CA pseudogene (LOC100422375) in chromosome 22 harbors a base substitution which was identical with the mutation in PIK3CA c.1634A>C.
Please check the reads manually with IGV.

MicroSEC pipeline for FFPE artifacts

This pipeline is designed for filtering sequence errors found in formalin-fixed and paraffin-embedded (FFPE) samples.
This repository contains all the codes to regenerate results from our paper:
"MicroSEC: Sequence error filtering pipeline for formalin-fixed and paraffin-embedded samples"

Supplementary Code

M. Ikegami et al., "MicroSEC filters sequence errors for formalin-fixed and paraffin-embedded samples", Commun Biol 4, 1396 (2021). https://doi.org/10.1038/s42003-021-02930-4

Contents

Overview

This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples.
The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable.
Two files are necessary for the analysis: mutation information file, BAM file.
A mutation supporting read ID information file is desirable but not necessary.
Prepare a sample information tsv file.

File 1: mutation information tsv file (required)

This tsv file should contain at least these contents (any number of other columns are allowed):
Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence SL_1010-N6-B 1-snv chr1 108130741 C T N CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
- Notation:
- Muttype: [altered bases]-[mutation type (snv, ins, or del)]. 1-snv, 3-snv, 3-del, 1-ins, etc.
- Chr: chr1-22, chrX, and chrY for hg19 and hg38. 1-20, X, and Y for mm10.
MicroSEC handle both "chr1" and "1" format chromosomes.
- Pos/Ref/Alt: There are some difference from HGVS nomenclature as follows:
- 131C>T -> Pos/Ref/Alt 131/C/T.
- 514
515insC and 514A -> Pos/Ref/Alt 514/A/AC
- 244delC and 243A -> Pos/Ref/Alt 243/AC/A
- SimpleRepeatTRF: Whether the mutation locates at a simple repeat sequence or not (Y or N).
- Neighborhood
sequence: [5'-20 bases] + [Alt sequence] + [3'-20 bases].
- Sample, Chr, Pos, Ref, and Alt should be set exactly.
- If you do not know Muttype, SimpleRepeatTRF, or Neighborhood_sequence, enter "-".

File 2: BAM file (required)

This file should contain at least these contents (always included in standard BAM files):
- QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, ISIZE, and QUAL.

I only have experience processing BAM files of up to 20 gigabytes.
It is recommended that huge BAM files be split up for processing.

For example,
samtools view -b sort.bam chr1:100-1000 > output_chr1.bam

It is better to split the file by chromosome.
If the file size is still too large, split it further at a distance from the mutation.
Deleting regions where there are no mutations will lighten the process.
Please download and use MicroSEC.R script file.

File 3: sample information tsv file (required)

From seven to ten columns are necessary. Three optional columns can be omitted.
The file contains no header. [sample name] [mutation information tsv file] [BAM file] [read length] [adapter sequence read 1] [optional: adapter sequence read 2] [sample type: Human or Mouse] [panel name] [optional: reference genome fasta file] [optional: simple repeat region bed file] PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Human TOP A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa ./reference/simpleRepeat.bed.gz

  • Reference genome: Human (hg38), Mouse (mm10), hg19, hg38, or mm10. Support for other organisms is easy, and will be done as needed upon request.

File 4: Reference genome

Reference fastq file is necessary when you using CRAM files.

File 5: simple repeat region bed file (optional, but mandatory to detect simple repeat derived artifacts)

simpleRepeat.bed.gz file is necessary when you check simple repeat derived errors.
Bed file can be obtained from https://genome.ucsc.edu/cgi-bin/hgTables

Filtering detail

This pipeline contains 8 filtering processes.

  • Filter 1 : Shorter-supporting lengths distribute too short to occur (1-1 and 1-2).
    - Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)).
    - Filter 1-2: The shorter-supporting lengths distributed over less than 75% of the read length.
  • Filter 2 : Hairpin-structure induced error detection (2-1 and 2-2).
    - Filter 2-1: Palindromic sequences exist within 200 bases.
    - Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary sequence of the opposite strand consisting >= 15 bases.
  • Filter 3 : 3'-/5'-supporting lengths are too densely distributed to occur (3-1 and 3-2).
    - Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)).
    - Filter 3-2: The distributions of 3'-/5'-supporting lengths are within 75% of the read length.
  • Filter 4 : >=15% mutations were called by chimeric reads comprising two distant regions.
  • Filter 5 : >=50% mutations were called by soft-clipped reads.
  • Filter 6 : Mutations locating at simple repeat sequences.
  • Filter 7 : Indel mutations locating at a >=15 homopolymer.
  • Filter 8 : >=10% of bases are low quality (Quality score <18) in the mutation supporting reads.

Filter 1, 2, 3, and 4 detect possible FFPE artifacts.
Filter 5 may also be FFPE artifacts or mapping errors.
Filter 6, 7, and 8 detect frequent errors caused by the next generation sequencing platform.
Supporting lengths are adjusted considering small repeat sequences around the mutations.

Results are saved in a tsv file.

github url: https://github.com/MANO-B/MicroSEC

System Requirements

Hardware Requirements

The scripts requires only a standard computer with enough RAM to support the operations defined by a user. For minimal performance, this will be a computer with about 4 GB of RAM (depending on the size of BAM file and the number of mutations). For optimal performance, we recommend a computer with the following specs:

RAM: 4+ GB
CPU: 2+ cores, 2.6+ GHz/core

The runtimes below are generated using a computer with the recommended specs (16 GB RAM, M1 Macbook air) and internet of speed 40 Mbps.

Software Requirements

Samtools

Samtools is used for pre-processing to remove reads that are not related to mutations. Version 1.12 is what I am using, but I think older versions will work if they support multi-core processing.

R language

This script files runs on R for Windows, Mac, or Linux, which requires the R version 3.5.0 or later. If you use version 3.4 or lower of R, you will have some difficulty installing the packages, but it is not impossible.

Package dependencies

Users should install the following packages prior to use the scripts, from an R terminal:

``` if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } install.packages(c('stringr', 'tidyverse', 'remotes'), dependencies = TRUE) BiocManager::install(c("Rsamtools", "Biostrings", "GenomicAlignments", "GenomeInfoDb"), update=FALSE)

install necessary genomes

BiocManager::install("BSgenome.Hsapiens.UCSC.hg38", update=FALSE)

BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", update=FALSE)

BiocManager::install("BSgenome.Mmusculus.UCSC.mm10", update=FALSE)

```

which will install in about 30 minutes on a recommended machine.

Package Versions

The program does not use any of the functions specific to the following version of the packages, so there is no problem if you use the latest version of the package.

```

packageVersion("stringr") [1] '1.4.0' packageVersion("dplyr") [1] '1.0.2' packageVersion("Biostrings") [1] '2.54.0' packageVersion("BSgenome.Hsapiens.UCSC.hg38") [1] '1.4.1' packageVersion("GenomicAlignments") [1] '1.22.1' packageVersion("Rsamtools") [1] '2.0.3' packageVersion("remotes") [1] '2.5.0' packageVersion("GenomeInfoDb") [1] '1.22.1' ```

Instructions for Use

See also https://rdrr.io/cran/MicroSEC/ - How to install ```

Installing a package from github requires a github account.

1. Sign in github.

2. Access Settings - Developer Settings in the Dashboard.

3. Generate a Personal access token (classic) without any checkboxes.

4. Copy the generated token.

5. Run following command in R console.

Sys.setenv(GITHUBPAT = "mygithubpathere")

Stable version (v2.1.6) from github (recommended)

remotes::install_github("MANO-B/MicroSEC", ref = 'v2.1.6')

Developmental version from github

remotes::install_github("MANO-B/MicroSEC")

Stable version (v2.1.6) from CRAN (recommended)

install.packages('MicroSEC', dependencies = FALSE) - How to use in command line

download only once

wget https://raw.githubusercontent.com/MANO-B/MicroSEC/main/MicroSEC.R

run the script

Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N] - Example Rscript MicroSEC.R /mnt/HDD8TB/MicroSEC /mnt/HDD8TB/MicroSEC/source/Sample_list.txt Y
- How to use in R Console

See also the vignette file.

Necessary packages

library(MicroSEC)

options(show.error.messages = FALSE, warn = -1) wd <- "mnt/HDD8TB/MicroSEC" # set your working/output directory samplelist <- "mnt/HDD8TB/MicroSEC/source/Samplelist.txt" progress_bar <- "Y"

setwd(wd)

load sample information tsv file

sampleinfo <- read.csv(samplelist, header = FALSE, stringsAsFactors = FALSE, sep = "\t")

initialize

referencegenome <- NULL simplerepeatlist <- "" msec <- NULL homologysearch <- NULL mut_depth <- NULL

for (sample in seqlen(dim(sampleinfo)[1])) { samplename <- sampleinfo[sample, 1] mutationfile <- sampleinfo[sample, 2] bamfile <- sampleinfo[sample, 3] readlength <- as.integer(sampleinfo[sample, 4]) adapter1 <- sampleinfo[sample, 5] if (sampleinfo[sample, 6] %in% c("Human", "Mouse", "hg19", "hg38", "mm10")) { adapter2 <- adapter1 organism <- sampleinfo[sample, 6] panel <- sampleinfo[sample, 7] if (dim(sampleinfo)[2] == 8) { referencegenome <- sampleinfo[sample, 8] } if (dim(sampleinfo)[2] == 9) { referencegenome <- sampleinfo[sample, 8] simplerepeatlist <- sampleinfo[sample, 9] } } else{ adapter2 <- sampleinfo[sample, 6] organism <- sampleinfo[sample, 7] panel <- sampleinfo[sample, 8] if (dim(sampleinfo)[2] == 9) { referencegenome <- sampleinfo[sample, 9] } if (dim(sampleinfo)[2] == 10) { referencegenome <- sampleinfo[sample, 9] simplerepeatlist <- sampleinfo[sample, 10] } } bamfileslim <- paste0(bamfile, ".SLIM") bamfiletmp = paste0(bam_file, ".tmp.bam")

# load genomic sequence refgenome <- funloadgenome(organism) chrno <- funloadchrno(organism) if (refgenome@userseqnames[[1]] == "chr1") { chromosomes <- paste0("chr", c(seqlen(chrno - 2),"X", "Y")) } if (refgenome@userseqnames[[1]] == "1") { chromosomes <- paste0("", c(seqlen(chr_no - 2),"X", "Y")) }

# load mutation information dfmutation <- funloadmutation(mutationfile, samplename, refgenome, 24, simplerepeatlist)

if (tools::fileext(bamfile) == "bam") { bamfilebai <- paste0(bamfile, ".bai") } else if (tools::fileext(bamfile) == "cram") { bamfilebai <- paste0(bamfile, ".crai") } if (!file.exists(bamfilebai) & !file.exists(bamfileslim)) { print("Sorting a BAM/CRAM file...") if (tools::fileext(bamfile) == "bam") { bamfilesort <- paste0(bamfile, "sort.bam") syscom <- paste0("samtools sort -@ 4 -o ", bamfilesort, " ", bamfile) } else if (tools::fileext(bamfile) == "cram") { bamfilesort <- paste0(bamfile, "sort.cram") syscom <- paste0("samtools sort -@ 4 -O cram -o ", bamfilesort," ", bamfile) }
system(syscom) syscom <- paste0("samtools index ", bamfilesort) system(syscom) bamfile <- bamfile_sort }

if (!file.exists(paste0(bamfileslim, ".bai"))) { print("Trimming a BAM/CRAM file...") dfmutationsave <- dfmutation downloadregion <- data.frame(matrix(rep(NA,3),nrow=1))[numeric(0),] colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") print(paste0(samplename, ": ", dim(dfmutationsave)[1], " mutations")) for (i in chromosomes) { dfmutation <- dfmutationsave[dfmutationsave$Chr == i,] continuous = FALSE for (mutno in seqlen(dim(dfmutation)[1])) { if (mutno == 1 & mutno != dim(dfmutation)[1]) { if (dfmutation$Chr[mutno + 1] != dfmutation$Chr[mutno] | dfmutation$Pos[mutno + 1] > dfmutation$Pos[mutno] + 400) { downloadregion <- rbind( downloadregion, c(dfmutation$Chr[mutno], max(1, dfmutation$Pos[mutno] - 200) - 1, dfmutation$Pos[mutno] + 200)) colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") continuous <- FALSE } else { continuous <- TRUE poslast <- max(1, dfmutation$Pos[mutno] - 200) } } else if (mutno == 1 & mutno == dim(dfmutation)[1]) { downloadregion <- rbind( downloadregion, c(dfmutation$Chr[mutno], max(1, dfmutation$Pos[mutno] - 200) - 1, dfmutation$Pos[mutno] + 200)) colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") } else if (mutno == dim(dfmutation)[1]) { if (continuous) { downloadregion <- rbind( downloadregion, c(dfmutation$Chr[mutno], poslast - 1, dfmutation$Pos[mutno] + 200)) colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") } else { downloadregion = rbind( downloadregion, c(dfmutation$Chr[mutno], max(1, dfmutation$Pos[mutno] - 200) - 1, dfmutation$Pos[mutno] + 200)) colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") } } else { if (continuous) { if (dfmutation$Chr[mutno + 1] != dfmutation$Chr[mutno] | dfmutation$Pos[mutno + 1] > dfmutation$Pos[mutno] + 400) { downloadregion = rbind( downloadregion, c(dfmutation$Chr[mutno], poslast - 1, dfmutation$Pos[mutno] + 200)) colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") continuous = FALSE } } else { if (dfmutation$Chr[mutno + 1] != dfmutation$Chr[mutno] | dfmutation$Pos[mutno + 1] > dfmutation$Pos[mutno] + 400) { downloadregion <- rbind( downloadregion, c(dfmutation$Chr[mutno], max(1, dfmutation$Pos[mutno] - 200) - 1, dfmutation$Pos[mutno] + 200)) colnames(downloadregion) <- c("chrom", "chromStart", "chromEnd") } else { continuous <- TRUE poslast <- max(1, dfmutation$Pos[mutno] - 200) } } } } } writetsv(x = downloadregion, file = paste0(bamfile,".bed"), progress = F, colnames = F) rm(downloadregion) systemout <- 1 if (tools::fileext(bamfile) == "bam") { syscom <- paste0("samtools view -h --no-PG ", bamfile, " -L ", paste0(bamfile,".bed"), " > ", bamfileslim) } else if (tools::fileext(bamfile) == "cram") { filecrai <- paste0(bamfile, ".crai") syscom <- paste0("samtools view -h --no-PG ", bamfile, " -T ", referencegenome, " ", "-X ", filecrai, " -L ", paste0(bamfile,".bed"), " > ", bamfileslim) } systemout = (system(syscom)) if(systemout == 0){ systemout <- 1 syscom <- paste0("samtools sort -o ", bamfiletmp, " ", bamfileslim) print("Sorting BAM file...") systemout <- (system(syscom)) if(systemout == 0){ systemout <- 1 syscom = paste0("samtools view -bS ", bamfiletmp, " > ", bamfileslim) print("Compressing BAM file...") systemout <- (system(syscom)) if(systemout == 0){ systemout <- 1 syscom = paste0("samtools index ", bamfileslim) print("Indexing BAM file...") systemout <- (system(syscom)) if(systemout == 0){ print(paste0("Slimmed BAM files were saved as ", bamfileslim)) #file.remove(bamfile) file.remove(bamfiletmp) } } } } dfmutation <- dfmutationsave rm(dfmutation_save) }

bamfile <- bamfileslim dfbam <- funloadbam(bam_file)

# analysis result <- funreadcheck(dfmutation = exampleMutation, dfbam = exampleBAM, refgenome = refgenome, samplename = samplename, readlength = readlength, adapter1 = adapter1, adapter2 = adapter2, shorthomologysearchlength = 4, minhomologysearch = 40, progressbar = progressbar) msec <- rbind(msec, result[[1]]) homologysearch <- rbind(homologysearch, result[[2]]) mutdepth <- list(rbind(mutdepth[[1]], result[[3]][[1]]), rbind(mutdepth[[2]], result[[3]][[2]]), rbind(mut_depth[[3]], result[[3]][[3]])) }

search homologous sequences

msec = funhomology(msec, homologysearch, minhomologysearch = 40, refgenome, chrno, progressbar = progressbar)

statistical analysis

msec = funsummary(msec) msec = funanalysis(msec, mutdepth, shorthomologysearchlength = 4, minhomologysearch = 40, thresholdp = 10 ^ (-6), thresholdhairpinratio = 0.50, thresholdshortlength = 0.75, thresholddistanthomology = 0.15, thresholdsoftclipratio = 0.50, thresholdlowqualityrate = 0.1, homopolymerlength = 15)

save the results

funsave(msec, paste0(wd, "/", sampleinfo[1,1], ".tsv")) ```

  • Output files
    A tsv file is saved in the working/output directory. [Sample name].tsv

  • Sample name is set to the sample of interest in the Sample column of the mutation information file.

  • Confirm the read length in the platform.

  • Confirm the adapter sequence; Illumina sequencer ("AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" and "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT").

  • If you input only one adapter sequence, the sequence will be used for read 1 and read 2.

  • If you want to know the progress visually, [progress bar Y/N] should be Y.

Reproducibility

The source code is available in MicroSEC.R. Source data will be available at the Japanese Genotype-Phenotype Archive (http://trace.ddbj.nig.ac.jp/jga), which is hosted by the DNA Data Bank of Japan (the accession number is written in our paper). Each filtering process takes about 5–120 minutes per sample on a recommended machine, according to the depth and mutation amount of the sample.

Owner

  • Login: MANO-B
  • Kind: user

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Packages

  • Total packages: 1
  • Total downloads:
    • cran 203 last-month
  • Total docker downloads: 5,547
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
cran.r-project.org: MicroSEC

Sequence Error Filter for Formalin-Fixed and Paraffin-Embedded Samples

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 203 Last month
  • Docker Downloads: 5,547
Rankings
Stargazers count: 20.3%
Forks count: 21.8%
Dependent packages count: 28.5%
Dependent repos count: 35.1%
Average: 38.4%
Downloads: 86.3%
Maintainers (1)
Last synced: about 1 year ago

Dependencies

DESCRIPTION cran
  • R >= 3.4.0 depends
  • BiocGenerics * imports
  • Biostrings * imports
  • GenomeInfoDb * imports
  • GenomicAlignments * imports
  • R.utils * imports
  • Rsamtools * imports
  • data.table * imports
  • dplyr * imports
  • gtools * imports
  • magrittr * imports
  • openxlsx * imports
  • stringr * imports
  • tidyr * imports
  • BSgenome.Hsapiens.UCSC.hg19 * suggests
  • BSgenome.Hsapiens.UCSC.hg38 * suggests
  • BSgenome.Mmusculus.UCSC.mm10 * suggests
  • knitr * suggests
  • rmarkdown * suggests
.github/workflows/rhub.yaml actions
  • r-hub/actions/checkout v1 composite
  • r-hub/actions/platform-info v1 composite
  • r-hub/actions/run-check v1 composite
  • r-hub/actions/setup v1 composite
  • r-hub/actions/setup-deps v1 composite
  • r-hub/actions/setup-r v1 composite