MitoHEAR

MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets - Published in JOSS (2022)

https://github.com/scialdonelab/mitohear

Keywords

bioinformatics-analysis heteroplasmy open-source rstats-package

Last synced: 6 months ago · JSON representation

Repository

MitoHEAR (Mitochondrial HEteroplasmy AnalyzeR) is an R package that allows the estimation as well as downstream statistical analysis of the mtDNA heteroplasmy calculated from single-cell datasets.

Basic Info

Host: GitHub
Owner: ScialdoneLab
License: gpl-3.0
Language: R
Default Branch: master
Homepage: https://scialdonelab.github.io/MitoHEAR/
Size: 95.6 MB

Statistics

Stars: 1
Watchers: 1
Forks: 3
Open Issues: 0
Releases: 2

Topics

bioinformatics-analysis heteroplasmy open-source rstats-package

Created over 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License

MitoHEAR

MitoHEAR (Mitochondrial HEteroplasmy AnalyzeR) is an R package that allows the estimation as well as downstream statistical analysis of the mtDNA heteroplasmy calculated from single-cell datasets.

The package has been used in a recently published paper (Lima et al., 2021, Nature Metabolism), where we revealed that cells with higher levels of heteroplasmy are eliminated by cell competition in mouse embryos and are characterized by specific gene expression patterns.

Installation

You can install the released version of MitoHEAR from CRAN with:

r if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MitoHEAR")

And the development version from GitHub with: r if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") devtools::install_github("https://github.com/ScialdoneLab/MitoHEAR/tree/master") library(MitoHEAR) For installing also the vignettes provided within the package, please run the following: r if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") devtools::install_github("https://github.com/ScialdoneLab/MitoHEAR/tree/master", build_vignettes = TRUE) library(MitoHEAR)

Getting started

The package has two main functions: getrawcounts_allele and get_heteroplasmy.

getrawcounts_allele

r get_raw_counts_allele(bam_input, path_fasta, cell_names, cores_number = 1)

bam_input: character vector of sorted bam files (one for each sample) with full path.
path_fasta: fasta file of the genomic region of interest with full path.
cell_names: character vector with the names of the samples.
cores_number: Number of cores to use.

In the same location of the sorted bam file, also the corresponding index bam file (.bai) should be present

Below an example of input using the development version of MitoHEAR from GitHub. The example is based on single cell RNA seq mouse embryo data from Lima et al., Nature Metabolism, 2021 :

First we download the input bam files (5 samples in the example):

```r currentwd <- getwd() url <- "https://hmgubox2.helmholtz-muenchen.de/index.php/s/7P9C57RxfKnH5Qx/download/inputbamfiles.tar.gz" destfile <- paste0(currentwd, "inputbamfiles.tar.gz") download.file(url, destfile, quiet = FALSE) untar(destfile, exdir=current_wd)

load(system.file("extdata", "afterqc.Rda", package = "MitoHEAR")) cellnames <- as.vector(afterqc$newname) cellnames <- cellnames[1:5] cellnames[1:5] [1] "24538814" "24538823" "24538839" "24538840" "24538847" ``` where afterqc is a dataframe with number of rows equal to the number of samples and with columns related to meta data information (i.e. cluster and batch).

r path_to_bam <- paste0(current_wd, "input_bam_files/") bam_input <- paste(path_to_bam, cell_names, ".unique.bam", sep = "") path_fasta <- system.file("extdata", "Mus_musculus.GRCm38.dna.chromosome.MT.fa", package = "MitoHEAR") output_SNP_mt <- get_raw_counts_allele(bam_input, path_fasta, cell_names)

The example of input bam files (with 5 samples) is available also here.

The output of getrawcounts_allele is a list with three elements: 1. matrixallelecounts: matrix with rows equal to cell_names and with columns equal to the bases in the path_fasta file with the four possible alleles. For each pair sample-base there is the information about the counts on the alleles A,C,G and T 2. namepositionallele: character vectors with length equal to the number of columns in matrixallelecounts with information about the name of the bases and the corresponding allele 3. name_position: character vectors with information about the name of the bases.

```r

development version from GitHub

load(system.file("extdata", "outputSNPmt.Rda", package = "MitoHEAR")) matrixallelecounts <- outputSNPmt[[1]]

In this example we have 723 cells and 65196 columns (4 possible alleles for the 16299 bases in the mouse MT genome)

dim(matrixallelecounts) [1] 723 65196 head(matrixallelecounts[1:5,1:5]) 1AGMT 1CGMT 1GGMT 1TGMT 2ATMT 24538814 0 0 0 0 0 24538823 0 0 0 0 0 24538839 0 0 0 0 0 24538840 0 0 0 0 0 245388_47 0 0 0 0 0

namepositionallele <- outputSNPmt[[2]] namepositionallele[1:8] [1] "1AGMT" "1CGMT" "1GGMT" "1TGMT" "2ATMT" "2CTMT" "2GTMT" "2TTMT"

nameposition <- outputSNPmt[[3]] nameposition[1:8] [1] "1MT" "1MT" "1MT" "1MT" "2MT" "2MT" "2MT" "2MT" ```

get_heteroplasmy

r get_heteroplasmy(raw_counts_allele, name_position_allele, name_position, number_reads, number_positions, filtering = 1, my.clusters = NULL) starts from the output of getrawcounts and performs a two step filtering procedure, the first on the cells and the second on the bases. The aim is to keep only the cells that have more than number_reads counts in more than number_positions bases and to keep only the bases that are covered by more than number_reads counts in all the cells (filtering=1) or in at least 50% of cells in each cluster (filtering=2, with cluster specified by my.clusters).

An example of input could be (using the development version from GitHub):

```r load(system.file("extdata", "outputSNPmt.Rda", package = "MitoHEAR")) load(system.file("extdata", "after_qc.Rda", package = "MitoHEAR"))

We compute heteroplasmy only for cells that are in the condition "Cell competition OFF" and belong to cluster 1, 3 or 4

row.names(afterqc) <- afterqc$newname cellsfmkepi <- afterqc[(afterqc$condition == "Cell competition OFF") & (afterqc$cluster == 1 | afterqc$cluster == 3 | afterqc$cluster == 4), "newname"] afterqcfmkepi <- afterqc[cellsfmkepi, ] my.clusters <- afterqcfmkepi$cluster

matrixallelecounts <- outputSNPmt[[1]] namepositionallele <- outputSNPmt[[2]] nameposition <- outputSNP_mt[[3]]

epiblastcellcompetition <- getheteroplasmy(matrixallelecounts[cellsfmkepi, ], namepositionallele, nameposition, numberreads=50, numberpositions=2000, filtering = 2, my.clusters) ``` The output of get_heteroplasmy is a list with five elements. The most relevant elements are the matrix with heteroplasmy values (heteroplasmy_matrix) and the matrix with allele frequencies (allele_matrix), for all the cells and bases that pass the two step filtering procedures. The heteroplasmy is computed as 1-max(f), where f are the frequencies of the four alleles for every sample-base pair. For more info about the output see ?get_heteroplasmy.

```r heteroplasmymatrix <- epiblastcell_competition[[3]]

261 cells and 5736 bases pass the two step filtering procedures

dim(heteroplasmymatrix) [1] 261 5376 head(heteroplasmymatrix[1:5,1:5]) 136MT 138MT 140MT 141MT 142MT 24538814 0 0.0000000 0 0.000000000 0 24538839 0 0.0000000 0 0.000000000 0 24538840 0 0.0000000 0 0.008695652 0 24538847 0 0.0952381 0 0.000000000 0 245388_69 0 0.0000000 0 0.000000000 0

allelematrix <- epiblastcell_competition[[4]]

261 cells and 21504 allele-base (4 possible alleles for the 5736 bases).

dim(allelematrix) [1] 261 21504 head(allelematrix[1:4,1:4]) 136AGMT 136CGMT 136GGMT 136TGMT 24538814 0 0 1 0 24538839 0 0 1 0 24538840 0 0 1 0 24538847 0 0 1 0

```

Down-stream analysis

MitoHEAR offers several ways to extrapolate relevant information from heteroplasmy measurement. For the identification of most different bases according to heteroplasmy between two group of cells (i.e. two clusters), an unpaired two-samples Wilcoxon test is performed with the function getwilcoxtest. The heteroplasmy and the corresponding allele frequencies for a specific base can be plotted with plot_heteroplasmy and plotallelefrequency. The example is based again on single cell RNA seq mouse embryo data from Lima et al., Nature Metabolism, 2021 and is possible to run it using the development version of MitoHEAR.

Pre-processing of the dataset following the same steps described in the section Getting started: ```r load(system.file("extdata", "outputSNPmt.Rda", package = "MitoHEAR")) load(system.file("extdata", "afterqc.Rda", package = "MitoHEAR")) row.names(afterqc) <- afterqc$newname cellsfmkepi <- afterqc[(afterqc$condition == "Cell competition OFF") & (afterqc$cluster == 1 | afterqc$cluster == 3 | afterqc$cluster == 4), "newname"] afterqcfmkepi <- afterqc[cellsfmkepi, ] my.clusters <- afterqcfmk_epi$cluster

matrixallelecounts <- outputSNPmt[[1]] namepositionallele <- outputSNPmt[[2]] nameposition <- outputSNP_mt[[3]]

epiblastcellcompetition <- getheteroplasmy(matrixallelecounts[cellsfmkepi, ], namepositionallele, nameposition, numberreads=50, numberpositions=2000, filtering = 2, my.clusters) ```

```r summatrix <- epiblastcellcompetition[[1]] summatrixqc <- epiblastcellcompetition[[2]] heteroplasmymatrixci <- epiblastcellcompetition[[3]] allelematrixci <- epiblastcellcompetition[[4]] clusterci <- as.character(afterqc[row.names(heteroplasmymatrixci), ]$cluster) clusterci[clusterci == "1"] <- "Winner Epiblast" clusterci[clusterci == "3"] <- "Intermediate" clusterci[clusterci == "4"] <- "Loser Epiblast" indexci <- epiblastcellcompetition[[5]]

namepositionalleleqc <- namepositionallele[nameposition %in% colnames(summatrixqc)] namepositionqc <- nameposition[nameposition %in% colnames(summatrixqc)] ```

It is possible to perform an additional filtering step on the bases keeping only the ones with an heteroplasmy value above min_heteroplasmy in more than min_cells. r relevant_bases <- filter_bases(heteroplasmy_matrix_ci, min_heteroplasmy = 0.01, min_cells = 10, index_ci) For detecting the difference in heteroplasmy values between two group of cells (i.e. two clusters), an unpaired two-samples Wilcoxon test is performed. In this case we run the test between the clusters Winner Epiblast and Loser Epiblast. As output, for each base, there is the adjusted p valued (FDR). r p_value_wilcox_test <- get_wilcox_test(heteroplasmy_matrix_ci[, relevant_bases], cluster_ci, "Winner Epiblast", "Loser Epiblast" , index_ci) p_value_wilcox_test_sort <- sort(p_value_wilcox_test, decreasing = F) The heteroplasmy and the corresponding allele frequencies for the most relevant base (according to Wilcoxon test) is shown. r plot_heteroplasmy(names(p_value_wilcox_test_sort)[1], heteroplasmy_matrix_ci, cluster_ci, index_ci) plot_allele_frequency(names(p_value_wilcox_test_sort)[1], heteroplasmy_matrix_ci, allele_matrix_ci, cluster_ci, name_position_qc, name_position_allele_qc, 5, index_ci)

If for each sample a diffusion pseudo time information is available, then it is possible to detect the bases whose heteroplasmy changes in a significant way along pseudo-time with dpt_test and to plot the trend with plot_dpt. To run the function dpt_test with parameter method equal to GAM, the package gam need to be installed. ```r time <- afterqc[row.names(heteroplasmymatrixci), ]$pseudotime

install.packages('gam')

dptanalysis <- dpttest(heteroplasmymatrixci[, relevantbases], time, indexci, method = "GAM") plotdpt(dptanalysis$Position[1], heteroplasmymatrixci, clusterci, time, dptanalysis, index_ci) ```

It is also possible to perform a cluster analysis on the samples based on distance matrix obtained from allele frequencies with clusteringangulardistance and to visualize an heatmap of the distance matrix with samples sorted according to the cluster result with plotdistancematrix. This approach could be useful for lineage tracing analysis. The data shown in the example below (reproducible with the development version of MitoHEAR) is bulk RNA seq mouse data from two mtDNA cell lines labelled Loser and Winner (Lima et al., Nature Metabolism, 2021 ).

Pre-processing of the dataset following the same steps described in the section Getting started: ```r load(system.file("extdata", "metadataanafinalbig.Rda", package = "MitoHEAR")) load(system.file("extdata", "metadataanafinalsmall.Rda", package = "MitoHEAR")) cellnames <- as.vector(metadataanafinalbig$namecell) cellnamesbulk <- cellnames load(system.file("extdata", "outputSNPanamt.Rda", package = "MitoHEAR")) matrixallelecounts <- outputSNPanamt[[1]] namepositionallele <- outputSNPanamt[[2]] nameposition <- outputSNPanamt[[3]] row.names(metadataanafinalbig) <- metadataanafinalbig$namecell metadataanafinalbig <- metadataanafinalbig[row.names(matrixallelecounts),] deleteduplicate <- metadataanafinalbig$namecell[seq(1,48,3)] metadataanafinalbigfilter <- metadataanafinalbig[deleteduplicate,] bulksample <- metadataanafinalbigfilter$namecelloriginal matrixallelecounts <- matrixallelecounts[deleteduplicate,] row.names(matrixallelecounts) <- bulksample row.names(metadataanafinalsmall) <- metadataanafinalsmall$namefastq

bulkdatacompetition <- getheteroplasmy(matrixallelecounts[bulksample,],namepositionallele,nameposition,50,2000,filtering = 1) heteroplasmymatrixbulk <- bulkdatacompetition[[3]] allelematrixbulk <- bulkdatacompetition[[4]] clusterbulk <- as.character(metadataanafinalsmall[row.names(heteroplasmymatrixbulk),]$status) indexbulk <- bulkdata_competition[[5]] ```

Unsupervised hierarchical clustering on the samples based on a distance matrix with the function clusteringdistang.

```r resultclusteringsc <- clusteringangulardistance(heteroplasmymatrixbulk, allelematrixbulk, clusterbulk, length(colnames(heteroplasmymatrixbulk)), deepSplitparam = 1, minClusterSizeparam = 8, 0.2, minvalue = 0.001, index = indexbulk, relevantbases = NULL)

oldnewclassification <- resultclusteringsc[[1]] distmatrixsc <- resultclusteringsc[[2]] topdist <- resultclusteringsc[[3]] oldclassification <- as.vector(oldnewclassification[, 1])

plotdistancematrix(distmatrixsc, old_classification) ```

For more exhaustive information about the functions offered by MitoHEAR see Vignettes section below and the help page of the single functions. (?function_name).

Vignettes

The following vignettes are provided within the package MitoHEAR (development version from GitHub) and are accessible within R.

For a complete overview of the vignettes without the need of using R, please click here

cellcompetitionmtexamplenotebook.Rmd:

r utils::vignette("cell_competition_mt_example_notebook") This tutorial uses single cell RNA seq mouse embryo data (Lima et al., Nature Metabolism, 2021)(Smart-Seq2 protocol). The heteroplasmy is computed for the mouse mitochondrial genome. Identification and plotting of most different bases according to heteroplasmy between clusters (with getwilcoxtest, plot_heteroplasmy and plotallelefrequency) and along pseudo time (with dpt_test and plot_dpt) are shown. The top 10 bases with highest variation in heteroplasmy belong to the genes mt-Rnr1 and mt-Rnr2 and in these positions the heteroplasmy always increases with the diffusion pseudo time.

cellcompetitionbulkdatamtexamplenotebook.Rmd:

r utils::vignette("cell_competition_bulk_data_mt_example_notebook") This tutorial uses bulk RNA seq data from data from two mtDNA cell lines( Lima et al., Nature Metabolism, 2021 ). Cluster analysis among samples based on allele frequency values (done with clusteringangulardistance) reveals that we can perfectly distinguish between the two cell lines only by looking at the heteroplasmy values of the mitochondrial bases.

lineagetracingexample_notebook.Rmd:

r utils::vignette("lineage_tracing_example_notebook") This tutorial uses single cell RNA seq mouse embryo data from Goolam et al., Cell, 2016 (Smart-Seq2 protocol). There are embryos at different stages from 2-cells to 8-cells stage. At each stage, for every cell it is known the embryo of origin. We illustrate how unsupervised cluster analysis based on allele frequencies information (performed with clusteringangulardistance) could be used in order to perform a lineage tracing analysis, by grouping together cells which are from the same embryo.

Ludwigetalexamplenotebook.Rmd:

r utils::vignette("Ludwig_et_al_example_notebook") This tutorial uses two single cell RNA seq human cells dataset from Ludwig et al., Cell, 2019 . We illustrate how unsupervised cluster analysis based on allele frequencies information (performed with clusteringangulardistance) can be used in order to aggregate cells. The result from unsupervised cluster analysis are consistent with previously available information (colonies of cells, donors).

Citation

Lubatti et al., (2022). MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets. Journal of Open Source Software, 7(74), 4265, https://doi.org/10.21105/joss.04265

Contributions and Support

Contributions in the form of feedback, comments, code and bug report are welcome. * For any contributions, feel free to fork the source code and submit a pull requests. * Please report any issues or bugs here: https://github.com/ScialdoneLab/MitoHEAR/issues. Any questions and requests for support can also be directed to the package maintainer (gabriele[dot]lubatti[at]helmholtz-muenchen[dot]de).

Owner

Name: Scialdone Lab
Login: ScialdoneLab
Kind: organization
Email: antonio.scialdone@helmholtz-muenchen.de

Repositories: 7
Profile: https://github.com/ScialdoneLab

JOSS Publication

MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets

Published

June 01, 2022

DOI

10.21105/joss.04265

Volume 7, Issue 74, Page 4265

Authors

Gabriele Lubatti
Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Munich, Germany, Institute of Functional Epigenetics, Helmholtz Zentrum München, Neuherberg, Germany, Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany

Elmir Mahammadov
Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Munich, Germany, Institute of Functional Epigenetics, Helmholtz Zentrum München, Neuherberg, Germany, Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany

Antonio Scialdone
Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Munich, Germany, Institute of Functional Epigenetics, Helmholtz Zentrum München, Neuherberg, Germany, Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany

Editor

Charlotte Soneson

GitHub Events

Total

Issues event: 1

Last Year

Issues event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 272
Total Committers: 4
Avg Commits per committer: 68.0
Development Distribution Score (DDS): 0.033

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
gabrielelubatti	6****i	263
Batool Almarzouq	5****M	7
Daniel S. Katz	d**z@i**g	1
Charlotte Soneson	c**n@g**m	1

Committer Domains (Top 20 + Academic)

ieee.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 3
Average time to close issues: 1 day
Average time to close pull requests: about 15 hours
Total issue authors: 3
Total pull request authors: 3
Average comments per issue: 0.75
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

MitoHEAR

Science Score: 95.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

MitoHEAR

Installation

Getting started

getrawcounts_allele

development version from GitHub

In this example we have 723 cells and 65196 columns (4 possible alleles for the 16299 bases in the mouse MT genome)

get_heteroplasmy

We compute heteroplasmy only for cells that are in the condition "Cell competition OFF" and belong to cluster 1, 3 or 4

261 cells and 5736 bases pass the two step filtering procedures

261 cells and 21504 allele-base (4 possible alleles for the 5736 bases).

Down-stream analysis

install.packages('gam')

Vignettes

cellcompetitionmtexamplenotebook.Rmd:

cellcompetitionbulkdatamtexamplenotebook.Rmd:

lineagetracingexample_notebook.Rmd:

Ludwigetalexamplenotebook.Rmd:

Citation

Contributions and Support

Owner

JOSS Publication

MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies