MitoHEAR
MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets - Published in JOSS (2022)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: nature.com, joss.theoj.org -
✓Committers with academic emails
1 of 4 committers (25.0%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
MitoHEAR (Mitochondrial HEteroplasmy AnalyzeR) is an R package that allows the estimation as well as downstream statistical analysis of the mtDNA heteroplasmy calculated from single-cell datasets.
Basic Info
- Host: GitHub
- Owner: ScialdoneLab
- License: gpl-3.0
- Language: R
- Default Branch: master
- Homepage: https://scialdonelab.github.io/MitoHEAR/
- Size: 95.6 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
MitoHEAR
MitoHEAR (Mitochondrial HEteroplasmy AnalyzeR) is an R package that allows the estimation as well as downstream statistical analysis of the mtDNA heteroplasmy calculated from single-cell datasets.
The package has been used in a recently published paper (Lima et al., 2021, Nature Metabolism), where we revealed that cells with higher levels of heteroplasmy are eliminated by cell competition in mouse embryos and are characterized by specific gene expression patterns.
Installation
You can install the released version of MitoHEAR from CRAN with:
r
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MitoHEAR")
And the development version from GitHub with:
r
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("https://github.com/ScialdoneLab/MitoHEAR/tree/master")
library(MitoHEAR)
For installing also the vignettes provided within the package, please run the following:
r
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("https://github.com/ScialdoneLab/MitoHEAR/tree/master", build_vignettes = TRUE)
library(MitoHEAR)
Getting started
The package has two main functions: getrawcounts_allele and get_heteroplasmy.
getrawcounts_allele
r
get_raw_counts_allele(bam_input, path_fasta, cell_names, cores_number = 1)
- bam_input: character vector of sorted bam files (one for each sample) with full path.
- path_fasta: fasta file of the genomic region of interest with full path.
- cell_names: character vector with the names of the samples.
- cores_number: Number of cores to use.
In the same location of the sorted bam file, also the corresponding index bam file (.bai) should be present
Below an example of input using the development version of MitoHEAR from GitHub. The example is based on single cell RNA seq mouse embryo data from Lima et al., Nature Metabolism, 2021 :
First we download the input bam files (5 samples in the example):
```r currentwd <- getwd() url <- "https://hmgubox2.helmholtz-muenchen.de/index.php/s/7P9C57RxfKnH5Qx/download/inputbamfiles.tar.gz" destfile <- paste0(currentwd, "inputbamfiles.tar.gz") download.file(url, destfile, quiet = FALSE) untar(destfile, exdir=current_wd)
load(system.file("extdata", "afterqc.Rda", package = "MitoHEAR")) cellnames <- as.vector(afterqc$newname) cellnames <- cellnames[1:5] cellnames[1:5] [1] "24538814" "24538823" "24538839" "24538840" "24538847" ``` where afterqc is a dataframe with number of rows equal to the number of samples and with columns related to meta data information (i.e. cluster and batch).
r
path_to_bam <- paste0(current_wd, "input_bam_files/")
bam_input <- paste(path_to_bam, cell_names, ".unique.bam", sep = "")
path_fasta <- system.file("extdata", "Mus_musculus.GRCm38.dna.chromosome.MT.fa", package = "MitoHEAR")
output_SNP_mt <- get_raw_counts_allele(bam_input, path_fasta, cell_names)
The example of input bam files (with 5 samples) is available also here.
The output of getrawcounts_allele is a list with three elements: 1. matrixallelecounts: matrix with rows equal to cell_names and with columns equal to the bases in the path_fasta file with the four possible alleles. For each pair sample-base there is the information about the counts on the alleles A,C,G and T 2. namepositionallele: character vectors with length equal to the number of columns in matrixallelecounts with information about the name of the bases and the corresponding allele 3. name_position: character vectors with information about the name of the bases.
```r
development version from GitHub
load(system.file("extdata", "outputSNPmt.Rda", package = "MitoHEAR")) matrixallelecounts <- outputSNPmt[[1]]
In this example we have 723 cells and 65196 columns (4 possible alleles for the 16299 bases in the mouse MT genome)
dim(matrixallelecounts) [1] 723 65196 head(matrixallelecounts[1:5,1:5]) 1AGMT 1CGMT 1GGMT 1TGMT 2ATMT 24538814 0 0 0 0 0 24538823 0 0 0 0 0 24538839 0 0 0 0 0 24538840 0 0 0 0 0 245388_47 0 0 0 0 0
namepositionallele <- outputSNPmt[[2]] namepositionallele[1:8] [1] "1AGMT" "1CGMT" "1GGMT" "1TGMT" "2ATMT" "2CTMT" "2GTMT" "2TTMT"
nameposition <- outputSNPmt[[3]] nameposition[1:8] [1] "1MT" "1MT" "1MT" "1MT" "2MT" "2MT" "2MT" "2MT" ```
get_heteroplasmy
r
get_heteroplasmy(raw_counts_allele, name_position_allele, name_position, number_reads, number_positions, filtering = 1, my.clusters = NULL)
starts from the output of getrawcounts and performs a two step filtering procedure, the first on the cells and the second on the bases. The aim is to keep only the cells that have more than number_reads counts in more than number_positions bases and to keep only the bases that are covered by more than number_reads counts in all the cells (filtering=1) or in at least 50% of cells in each cluster (filtering=2, with cluster specified by my.clusters).
An example of input could be (using the development version from GitHub):
```r load(system.file("extdata", "outputSNPmt.Rda", package = "MitoHEAR")) load(system.file("extdata", "after_qc.Rda", package = "MitoHEAR"))
We compute heteroplasmy only for cells that are in the condition "Cell competition OFF" and belong to cluster 1, 3 or 4
row.names(afterqc) <- afterqc$newname cellsfmkepi <- afterqc[(afterqc$condition == "Cell competition OFF") & (afterqc$cluster == 1 | afterqc$cluster == 3 | afterqc$cluster == 4), "newname"] afterqcfmkepi <- afterqc[cellsfmkepi, ] my.clusters <- afterqcfmkepi$cluster
matrixallelecounts <- outputSNPmt[[1]] namepositionallele <- outputSNPmt[[2]] nameposition <- outputSNP_mt[[3]]
epiblastcellcompetition <- getheteroplasmy(matrixallelecounts[cellsfmkepi, ], namepositionallele, nameposition, numberreads=50, numberpositions=2000, filtering = 2, my.clusters) ``` The output of get_heteroplasmy is a list with five elements. The most relevant elements are the matrix with heteroplasmy values (heteroplasmy_matrix) and the matrix with allele frequencies (allele_matrix), for all the cells and bases that pass the two step filtering procedures. The heteroplasmy is computed as 1-max(f), where f are the frequencies of the four alleles for every sample-base pair. For more info about the output see ?get_heteroplasmy.
```r heteroplasmymatrix <- epiblastcell_competition[[3]]
261 cells and 5736 bases pass the two step filtering procedures
dim(heteroplasmymatrix) [1] 261 5376 head(heteroplasmymatrix[1:5,1:5]) 136MT 138MT 140MT 141MT 142MT 24538814 0 0.0000000 0 0.000000000 0 24538839 0 0.0000000 0 0.000000000 0 24538840 0 0.0000000 0 0.008695652 0 24538847 0 0.0952381 0 0.000000000 0 245388_69 0 0.0000000 0 0.000000000 0
allelematrix <- epiblastcell_competition[[4]]
261 cells and 21504 allele-base (4 possible alleles for the 5736 bases).
dim(allelematrix) [1] 261 21504 head(allelematrix[1:4,1:4]) 136AGMT 136CGMT 136GGMT 136TGMT 24538814 0 0 1 0 24538839 0 0 1 0 24538840 0 0 1 0 24538847 0 0 1 0
```
Down-stream analysis
MitoHEAR offers several ways to extrapolate relevant information from heteroplasmy measurement. For the identification of most different bases according to heteroplasmy between two group of cells (i.e. two clusters), an unpaired two-samples Wilcoxon test is performed with the function getwilcoxtest. The heteroplasmy and the corresponding allele frequencies for a specific base can be plotted with plot_heteroplasmy and plotallelefrequency. The example is based again on single cell RNA seq mouse embryo data from Lima et al., Nature Metabolism, 2021 and is possible to run it using the development version of MitoHEAR.
Pre-processing of the dataset following the same steps described in the section Getting started: ```r load(system.file("extdata", "outputSNPmt.Rda", package = "MitoHEAR")) load(system.file("extdata", "afterqc.Rda", package = "MitoHEAR")) row.names(afterqc) <- afterqc$newname cellsfmkepi <- afterqc[(afterqc$condition == "Cell competition OFF") & (afterqc$cluster == 1 | afterqc$cluster == 3 | afterqc$cluster == 4), "newname"] afterqcfmkepi <- afterqc[cellsfmkepi, ] my.clusters <- afterqcfmk_epi$cluster
matrixallelecounts <- outputSNPmt[[1]] namepositionallele <- outputSNPmt[[2]] nameposition <- outputSNP_mt[[3]]
epiblastcellcompetition <- getheteroplasmy(matrixallelecounts[cellsfmkepi, ], namepositionallele, nameposition, numberreads=50, numberpositions=2000, filtering = 2, my.clusters) ```
```r summatrix <- epiblastcellcompetition[[1]] summatrixqc <- epiblastcellcompetition[[2]] heteroplasmymatrixci <- epiblastcellcompetition[[3]] allelematrixci <- epiblastcellcompetition[[4]] clusterci <- as.character(afterqc[row.names(heteroplasmymatrixci), ]$cluster) clusterci[clusterci == "1"] <- "Winner Epiblast" clusterci[clusterci == "3"] <- "Intermediate" clusterci[clusterci == "4"] <- "Loser Epiblast" indexci <- epiblastcellcompetition[[5]]
namepositionalleleqc <- namepositionallele[nameposition %in% colnames(summatrixqc)] namepositionqc <- nameposition[nameposition %in% colnames(summatrixqc)] ```
It is possible to perform an additional filtering step on the bases keeping only the ones with an heteroplasmy value above min_heteroplasmy in more than min_cells.
r
relevant_bases <- filter_bases(heteroplasmy_matrix_ci, min_heteroplasmy = 0.01, min_cells = 10, index_ci)
For detecting the difference in heteroplasmy values between two group of cells (i.e. two clusters), an unpaired two-samples Wilcoxon test is performed. In this case we run the test between the clusters Winner Epiblast and Loser Epiblast. As output, for each base, there is the adjusted p valued (FDR).
r
p_value_wilcox_test <- get_wilcox_test(heteroplasmy_matrix_ci[, relevant_bases], cluster_ci, "Winner Epiblast", "Loser Epiblast" , index_ci)
p_value_wilcox_test_sort <- sort(p_value_wilcox_test, decreasing = F)
The heteroplasmy and the corresponding allele frequencies for the most relevant base (according to Wilcoxon test) is shown.
r
plot_heteroplasmy(names(p_value_wilcox_test_sort)[1], heteroplasmy_matrix_ci, cluster_ci, index_ci)
plot_allele_frequency(names(p_value_wilcox_test_sort)[1], heteroplasmy_matrix_ci, allele_matrix_ci, cluster_ci, name_position_qc, name_position_allele_qc, 5, index_ci)

If for each sample a diffusion pseudo time information is available, then it is possible to detect the bases whose heteroplasmy changes in a significant way along pseudo-time with dpt_test and to plot the trend with plot_dpt. To run the function dpt_test with parameter method equal to GAM, the package gam need to be installed. ```r time <- afterqc[row.names(heteroplasmymatrixci), ]$pseudotime
install.packages('gam')
dptanalysis <- dpttest(heteroplasmymatrixci[, relevantbases], time, indexci, method = "GAM")
plotdpt(dptanalysis$Position[1], heteroplasmymatrixci, clusterci, time, dptanalysis, index_ci)
```

It is also possible to perform a cluster analysis on the samples based on distance matrix obtained from allele frequencies with clusteringangulardistance and to visualize an heatmap of the distance matrix with samples sorted according to the cluster result with plotdistancematrix. This approach could be useful for lineage tracing analysis. The data shown in the example below (reproducible with the development version of MitoHEAR) is bulk RNA seq mouse data from two mtDNA cell lines labelled Loser and Winner (Lima et al., Nature Metabolism, 2021 ).
Pre-processing of the dataset following the same steps described in the section Getting started: ```r load(system.file("extdata", "metadataanafinalbig.Rda", package = "MitoHEAR")) load(system.file("extdata", "metadataanafinalsmall.Rda", package = "MitoHEAR")) cellnames <- as.vector(metadataanafinalbig$namecell) cellnamesbulk <- cellnames load(system.file("extdata", "outputSNPanamt.Rda", package = "MitoHEAR")) matrixallelecounts <- outputSNPanamt[[1]] namepositionallele <- outputSNPanamt[[2]] nameposition <- outputSNPanamt[[3]] row.names(metadataanafinalbig) <- metadataanafinalbig$namecell metadataanafinalbig <- metadataanafinalbig[row.names(matrixallelecounts),] deleteduplicate <- metadataanafinalbig$namecell[seq(1,48,3)] metadataanafinalbigfilter <- metadataanafinalbig[deleteduplicate,] bulksample <- metadataanafinalbigfilter$namecelloriginal matrixallelecounts <- matrixallelecounts[deleteduplicate,] row.names(matrixallelecounts) <- bulksample row.names(metadataanafinalsmall) <- metadataanafinalsmall$namefastq
bulkdatacompetition <- getheteroplasmy(matrixallelecounts[bulksample,],namepositionallele,nameposition,50,2000,filtering = 1) heteroplasmymatrixbulk <- bulkdatacompetition[[3]] allelematrixbulk <- bulkdatacompetition[[4]] clusterbulk <- as.character(metadataanafinalsmall[row.names(heteroplasmymatrixbulk),]$status) indexbulk <- bulkdata_competition[[5]] ```
Unsupervised hierarchical clustering on the samples based on a distance matrix with the function clusteringdistang.
```r resultclusteringsc <- clusteringangulardistance(heteroplasmymatrixbulk, allelematrixbulk, clusterbulk, length(colnames(heteroplasmymatrixbulk)), deepSplitparam = 1, minClusterSizeparam = 8, 0.2, minvalue = 0.001, index = indexbulk, relevantbases = NULL)
oldnewclassification <- resultclusteringsc[[1]] distmatrixsc <- resultclusteringsc[[2]] topdist <- resultclusteringsc[[3]] oldclassification <- as.vector(oldnewclassification[, 1])
plotdistancematrix(distmatrixsc, old_classification)
```

For more exhaustive information about the functions offered by MitoHEAR see Vignettes section below and the help page of the single functions. (?function_name).
Vignettes
The following vignettes are provided within the package MitoHEAR (development version from GitHub) and are accessible within R.
For a complete overview of the vignettes without the need of using R, please click here
cellcompetitionmtexamplenotebook.Rmd:
r
utils::vignette("cell_competition_mt_example_notebook")
This tutorial uses single cell RNA seq mouse embryo data (Lima et al., Nature Metabolism, 2021)(Smart-Seq2 protocol).
The heteroplasmy is computed for the mouse mitochondrial genome.
Identification and plotting of most different bases according to heteroplasmy between clusters (with getwilcoxtest, plot_heteroplasmy and plotallelefrequency) and along pseudo time (with dpt_test and plot_dpt) are shown.
The top 10 bases with highest variation in heteroplasmy belong to the genes mt-Rnr1 and mt-Rnr2 and in these positions the heteroplasmy always increases with the diffusion pseudo time.
cellcompetitionbulkdatamtexamplenotebook.Rmd:
r
utils::vignette("cell_competition_bulk_data_mt_example_notebook")
This tutorial uses bulk RNA seq data from data from two mtDNA cell lines( Lima et al., Nature Metabolism, 2021 ).
Cluster analysis among samples based on allele frequency values (done with clusteringangulardistance) reveals that we can perfectly distinguish between the two cell lines only by looking at the heteroplasmy values of the mitochondrial bases.
lineagetracingexample_notebook.Rmd:
r
utils::vignette("lineage_tracing_example_notebook")
This tutorial uses single cell RNA seq mouse embryo data from Goolam et al., Cell, 2016 (Smart-Seq2 protocol).
There are embryos at different stages from 2-cells to 8-cells stage. At each stage, for every cell it is known the embryo of origin.
We illustrate how unsupervised cluster analysis based on allele frequencies information (performed with clusteringangulardistance) could be used in order to perform a lineage tracing analysis, by grouping together cells which are from the same embryo.
Ludwigetalexamplenotebook.Rmd:
r
utils::vignette("Ludwig_et_al_example_notebook")
This tutorial uses two single cell RNA seq human cells dataset from Ludwig et al., Cell, 2019 .
We illustrate how unsupervised cluster analysis based on allele frequencies information (performed with clusteringangulardistance) can be used in order to aggregate cells. The result from unsupervised cluster analysis are consistent with previously available information (colonies of cells, donors).
Citation
Lubatti et al., (2022). MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets. Journal of Open Source Software, 7(74), 4265, https://doi.org/10.21105/joss.04265
Contributions and Support
Contributions in the form of feedback, comments, code and bug report are welcome. * For any contributions, feel free to fork the source code and submit a pull requests. * Please report any issues or bugs here: https://github.com/ScialdoneLab/MitoHEAR/issues. Any questions and requests for support can also be directed to the package maintainer (gabriele[dot]lubatti[at]helmholtz-muenchen[dot]de).
Owner
- Name: Scialdone Lab
- Login: ScialdoneLab
- Kind: organization
- Email: antonio.scialdone@helmholtz-muenchen.de
- Repositories: 7
- Profile: https://github.com/ScialdoneLab
JOSS Publication
MitoHEAR: an R package for the estimation and downstream statistical analysis of the mitochondrial DNA heteroplasmy calculated from single-cell datasets
Authors
Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Munich, Germany, Institute of Functional Epigenetics, Helmholtz Zentrum München, Neuherberg, Germany, Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Munich, Germany, Institute of Functional Epigenetics, Helmholtz Zentrum München, Neuherberg, Germany, Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Munich, Germany, Institute of Functional Epigenetics, Helmholtz Zentrum München, Neuherberg, Germany, Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
Tags
bioinformatics single cell RNA seq heteroplasmyGitHub Events
Total
- Issues event: 1
Last Year
- Issues event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| gabrielelubatti | 6****i | 263 |
| Batool Almarzouq | 5****M | 7 |
| Daniel S. Katz | d****z@i****g | 1 |
| Charlotte Soneson | c****n@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 4
- Total pull requests: 3
- Average time to close issues: 1 day
- Average time to close pull requests: about 15 hours
- Total issue authors: 3
- Total pull request authors: 3
- Average comments per issue: 0.75
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- BatoolMM (2)
- vdd346 (1)
- sahilseth (1)
Pull Request Authors
- BatoolMM (1)
- csoneson (1)
- danielskatz (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- R >= 4.0 depends
- Biostrings * imports
- ComplexHeatmap * imports
- GenomicRanges * imports
- IRanges * imports
- Rsamtools * imports
- circlize * imports
- dynamicTreeCut * imports
- ggplot2 * imports
- gridExtra * imports
- magrittr * imports
- mcclust * imports
- rdist * imports
- reshape2 * imports
- rlist * imports
- clustree * suggests
- fmsb * suggests
- gam * suggests
- karyoploteR * suggests
- knitr * suggests
- plotly * suggests
- regioneR * suggests
- rmarkdown * suggests
- testthat * suggests
