https://github.com/bioconductor-source/sparsedossa2

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: bioconductor-source
Language: HTML
Default Branch: devel
Size: 22.2 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme

"Simulating realistic microbial observations with SparseDOSSA2"

Author Name: "Siyuan Ma"
Affiliation: Harvard T.H. Chan School of Public Health.
Broad Institute email: siyuan.ma@pennmedicine.upenn.edu

Introduction

SparseDOSSA2 an R package for fitting to and the simulation of realistic microbial abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2 model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities.

``` library(SparseDOSSA2)

tidyverse packages for utilities

library(magrittr) library(dplyr) library(ggplot2) ```

Installation

SparseDOSSA2 is a Bioconductor package and can be installed via the following command. ```

if (!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

BiocManager::install("SparseDOSSA2")

```

Simulating realistic microbial observations with SparseDOSSA2

The most important functionality of SparseDOSSA2 is the simulation of realistic synthetic microbial observations. To this end, SparseDOSSA2 provides three pre-trained templates, "Stool", "Vaginal", and "IBD", targeting continuous, discrete, and diseased population structures. Stool_simulation <- SparseDOSSA2(template = "Stool", n_sample = 100, n_feature = 100, verbose = TRUE) Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", n_sample = 100, n_feature = 100, verbose = TRUE)

Fitting to microbiome datasets with SparseDOSSA2

SparseDOSSA2 provide two functions, fitSparseDOSSA2 and fitCVSparseDOSSA2, to fit the SparseDOSSA2 model to microbial count or relative abundance observations. For these functions, as input, SparseDOSSA2 requires a feature-by-sample table of microbial abundance observations. We provide with SparseDOSSA2 a minimal example of such a dataset: a five-by-five of the HMP1-II stool study. ``` data("Stool_subset", package = "SparseDOSSA2")

columns are samples.

Stool_subset[1:2, 1, drop = FALSE] ```

Fitting SparseDOSSA2 model with fit_SparseDOSSA2

fitSparseDOSSA2 fits the SparseDOSSA2 model to estimate the model parameters: per-feature prevalence, mean and standard deviation of non-zero abundances, and feature-feature correlations. It also estimates joint distribution of these parameters and (if input is count) a read count distribution. ``` fitted <- fitSparseDOSSA2(data = Stool_subset, control = list(verbose = TRUE))

fitted mean log non-zero abundance values of the first two features

fitted$EM_fit$fit$mu[1:2] ```

Fitting SparseDOSSA2 model with fitCV_SparseDOSSA2

The user can additionally achieve optimal model fitting via fitCVSparseDOSSA2. They can either provide a vector of tuning parameter values (lambdas) to control sparsity in the estimation of the correlation matrix parameter, or a grid will be selected automatically. fitCVSparseDOSSA2 uses cross validation to select an "optimal" model fit across these tuning parameters via average testing log-likelihood. This is a computationally intensive procedure, and best-suited for users that would like accurate fitting to the input dataset, for best simulated new microbial observations on the same features as the input (i.e. not new features). ``` set.seed(1) fittedCV <- fitCVSparseDOSSA2(data = Stool_subset, lambdas = c(0.1, 1), K = 2, control = list(verbose = TRUE))

the average log likelihood of different tuning parameters

apply(fittedCV$EMfit$logLik_CV, 2, mean)

The second lambda (1) had better performance in terms of log likelihood,

and will be selected as the default fit.

```

Parallelization controls with future

SparseDOSSA2 internally uses r BiocStyle::CRANpkg("future") to allow for parallel computation. The user can thus specify parallelization through future's interface. See the reference manual for future for more details. This is particularly suited if fitting SparseDOSSA2 in a high-performance computing environment/ ```

regular fitting

system.time(fitted_regular <-

fitSparseDOSSA2(data = Stoolsubset,

control = list(verbose = FALSE)))

parallel fitting with future:

future::plan(future::multisession())

system.time(fitted_parallel <-

fitSparseDOSSA2(data = Stoolsubset,

control = list(verbose = FALSE)))

For CV fitting, there are three components that can be paralleled, in order:

different cross validation folds, different tuning parameter lambdas,

and different samples. It is usually most efficient to parallelize at the

sample level:

system.time(fittedregularCV <-

fitCVSparseDOSSA2(data = Stoolsubset,

lambdas = c(0.1, 1),

K = 2,

control = list(verbose = TRUE)))

future::plan(future::sequential(), future::sequential(), future::multisession())

system.time(fittedparallelCV <-

fitCVSparseDOSSA2(data = Stoolsubset,

lambdas = c(0.1, 1),

K = 2,

control = list(verbose = TRUE)))

```

Sessioninfo

``` sessionInfo()

R version 3.6.2 (2019-12-12) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale: [1] enUS.UTF-8/enUS.UTF-8/enUS.UTF-8/C/enUS.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SparseDOSSA20.99.0 Rmpfr0.8-2 gmp0.6-1 igraph1.2.6
[5] truncnorm1.0-8 magrittr2.0.1 future.apply1.7.0 future1.21.0
[9] huge1.3.4.1 mvtnorm1.1-1 ks1.11.7 BiocCheck1.22.0

loaded via a namespace (and not attached): [1] Rcpp1.0.5 compiler3.6.2 BiocManager1.30.10 bitops1.0-6
[5] tools3.6.2 digest0.6.27 mclust5.4.7 jsonlite1.7.2
[9] lattice0.20-41 pkgconfig2.0.3 Matrix1.2-18 graph1.64.0
[13] curl4.3 parallel3.6.2 xfun0.20 stringr1.4.0
[17] httr1.4.2 knitr1.30 globals0.14.0 stats43.6.2
[21] grid3.6.2 getopt1.20.3 optparse1.6.6 Biobase2.46.0
[25] listenv0.8.0 R62.5.0 parallelly1.23.0 XML3.99-0.3
[29] RBGL1.62.1 codetools0.2-18 biocViews1.54.0 BiocGenerics0.32.0 [33] MASS7.3-53 stringdist0.9.6.3 RUnit0.4.32 KernSmooth2.23-18 [37] stringi1.5.3 RCurl1.98-1.2
```

Owner

Name: (WIP DEV) Bioconductor Packages
Login: bioconductor-source
Kind: organization
Email: maintainer@bioconductor.org

Website: https://bioconductor.org
Repositories: 1
Profile: https://github.com/bioconductor-source

Source code for packages accepted into Bioconductor

GitHub Events

Total

Last Year

Dependencies

DESCRIPTION cran

Rmpfr * depends
future.apply * depends
huge * depends
igraph * depends
ks * depends
magrittr * depends
mvtnorm * depends
truncnorm * depends
BiocStyle * suggests
cubature * suggests
knitr * suggests
rmarkdown * suggests
testthat >= 2.1.0 suggests

https://github.com/bioconductor-source/sparsedossa2

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

.github/README.md

"Simulating realistic microbial observations with SparseDOSSA2"

Introduction

tidyverse packages for utilities

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

BiocManager::install("SparseDOSSA2")

Simulating realistic microbial observations with SparseDOSSA2

Fitting to microbiome datasets with SparseDOSSA2

columns are samples.

Fitting SparseDOSSA2 model with fit_SparseDOSSA2

fitted mean log non-zero abundance values of the first two features

Fitting SparseDOSSA2 model with fitCV_SparseDOSSA2

the average log likelihood of different tuning parameters

The second lambda (1) had better performance in terms of log likelihood,

and will be selected as the default fit.

Parallelization controls with future

regular fitting

system.time(fitted_regular <-

fitSparseDOSSA2(data = Stoolsubset,

control = list(verbose = FALSE)))

parallel fitting with future:

future::plan(future::multisession())

system.time(fitted_parallel <-

fitSparseDOSSA2(data = Stoolsubset,

control = list(verbose = FALSE)))

For CV fitting, there are three components that can be paralleled, in order:

different cross validation folds, different tuning parameter lambdas,

and different samples. It is usually most efficient to parallelize at the

sample level:

system.time(fittedregularCV <-

fitCVSparseDOSSA2(data = Stoolsubset,

lambdas = c(0.1, 1),

K = 2,

control = list(verbose = TRUE)))

future::plan(future::sequential(), future::sequential(), future::multisession())

system.time(fittedparallelCV <-

fitCVSparseDOSSA2(data = Stoolsubset,

lambdas = c(0.1, 1),

K = 2,

control = list(verbose = TRUE)))

Sessioninfo

Owner

GitHub Events

Total

Last Year

Dependencies