latentcor

latentcor: An R Package for estimating latent correlations from mixed data types - Published in JOSS (2021)

https://github.com/mingzehuang/latentcor

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: arxiv.org, joss.theoj.org, zenodo.org
  • Committers with academic emails
    2 of 5 committers (40.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

data-analysis data-mining data-processing data-science data-structures machine-learning mixed-types r statistics

Scientific Fields

Engineering Computer Science - 40% confidence
Last synced: 6 months ago · JSON representation

Repository

latentcor is an R package provides estimation for latent correlation with mixed data types (continuous, binary, truncated and ternary).

Basic Info
Statistics
  • Stars: 16
  • Watchers: 3
  • Forks: 6
  • Open Issues: 1
  • Releases: 4
Topics
data-analysis data-mining data-processing data-science data-structures machine-learning mixed-types r statistics
Created about 5 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog Contributing License Code of conduct

README.md

R-CMD-check codecov CRAN status Launch binder Lifecycle: stable JOSS DOI <!-- badges: end -->

latentcor: Latent Correlation for Mixed Types of Data

latentcor is an R package for estimation of latent correlations with mixed data types (continuous, binary, truncated, and ternary) under the latent Gaussian copula model. For references on the estimation framework, see

Statement of Need

No R software package is currently available that allows accurate and fast correlation estimation from mixed variable data in a unifying manner. The R package latentcor, introduced here, thus represents the first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation.

Multi-linear interpolation: Earlier versions of latentcor used multi-linear interpolation based on functionality of R package chebpol written by Simen Gaure. This functionality is needed for faster computations of latent correlations with approximation method. However, chebpol was removed from CRAN on 2022-02-07. The current version of latentcor reuses the multi-linear interpolation part of the chebpol (provided under Artistic-2 license) integrated directly within latentcor. To cite multi-linear interpolation only, please use original chebpol.

Accuracy: The approximation method for ternary/ternary, truncated(zero-inflated)/ternary, and ternary/binary cases are less accurate close to boundary (zero proportions) due to size limitations of CRAN packages on the pre-stored grid. If higher accuracy is desired and original method is computationally prohibitive, latencor is also available as Python package with Github development python version

Installation

To use latentcor, you need to install R. To enhance your user experience, you may use some IDE for it (e.g. RStudio).

The development version of latentcor is available on GitHub. You can download it with the help of the devtools package in R as follow:

r install.packages("devtools") devtools::install_github("https://github.com/mingzehuang/latentcor", build_vignettes = TRUE) The stable release version latentcor is available on CRAN. You can download it in R as follow:

r install.packages("latentcor")

Example

A simple example estimating latent correlation is shown below.

```r library(latentcor)

Generate two variables of sample size 100

The first variable is ternary (pi0 = 0.3, pi1 = 0.5, pi2 = 1-0.3-0.5 = 0.2)

The second variable is continuous.

No copula transformation is applied.

X = gen_data(n = 1000, types = c("ter", "con"), XP = list(c(0.3, .5), NA))$X

Estimate latent correlation matrix with the original method

latentcor(X = X, types = c("ter", "con"), method = "original")$R

Estimate latent correlation matrix with the approximation method

latentcor(X = X, types = c("ter", "con"))$R

Speed improvement by approximation method compared with original method

library(microbenchmark) microbenchmark(latentcor(X, types = c("ter", "con"), method = "original"), latentcor(X, types = c("ter", "con")))

Unit: milliseconds

min lq mean median uq max neval

5.3444 5.8301 7.033555 6.06740 6.74975 20.8878 100

1.5049 1.6245 2.009371 1.73805 1.99820 5.0027 100

This is run on Windows 10 with Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz 3.20 GHz

Heatmap for latent correlation matrix.

latentcor(X = X, types = c("ter", "con"), showplot = TRUE)$plotR `` Another example with themtcars` dataset.

```r library(latentcor)

Use build-in dataset mtcars

X = mtcars

Check variable types for manual determination

apply(mtcars, 2, table)

Or use built-in get_types function to get types suggestions

get_types(mtcars)

Estimate latent correlation matrix with original method

latentcor(mtcars, types = c("con", "ter", "con", "con", "con", "con", "con", "bin", "bin", "ter", "con"), method = "original")$R

Estimate latent correlation matrix with approximation method

latentcor(mtcars, types = c("con", "ter", "con", "con", "con", "con", "con", "bin", "bin", "ter", "con"))$R

Speed improvement by approximation method compared with original method

library(microbenchmark) microbenchmark(latentcor(mtcars, types = types, method = "original"), latentcor(mtcars, types = types, method = "approx"))

Unit: milliseconds

min lq mean median uq max neval

201.9872 215.6438 225.30385 221.5364 226.58330 411.4940 100

71.8457 75.1681 82.42531 80.1688 84.77845 238.3793 100

This is run on Windows 10 with Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz 3.20 GHz

Heatmap for latent correlation matrix with approximation method.

latentcor(mtcars, types = c("con", "ter", "con", "con", "con", "con", "con", "bin", "bin", "ter", "con"), showplot = TRUE)$plotR ```

Interactive heatmap see: interactive heatmap of latent correlations (approx) for mtcars

Community Guidelines

  1. Contributions and suggestions to the software are always welcome. Please consult our contribution guidelines prior to submitting a pull request.
  2. Report issues or problems with the software using github’s issue tracker.
  3. Contributors must adhere to the Code of Conduct.

Acknowledgments

We thank Dr. Grace Yoon for providing implementation details of the mixedCCA R package.

Owner

  • Name: Mingze “Rico” Huang
  • Login: mingzehuang
  • Kind: user
  • Location: Phoenix, AZ
  • Company: Western Alliance Bank

Ph.D. in Economics, M.S. in Statistics

JOSS Publication

latentcor: An R Package for estimating latent correlations from mixed data types
Published
September 21, 2021
Volume 6, Issue 65, Page 3634
Authors
Mingze Huang ORCID
Department of Statistics, Texas A&M University, College Station, TX, Department of Economics, Texas A&M University, College Station, TX
Christian L. Müller ORCID
Ludwig-Maximilians-Universität München, Germany, Helmholtz Zentrum München, Germany, Flatiron Institute, New York
Irina Gaynanova ORCID
Department of Statistics, Texas A&M University, College Station, TX
Editor
Chris Vernon ORCID
Tags
Statistics Latent Correlation

GitHub Events

Total
  • Pull request event: 1
  • Fork event: 1
Last Year
  • Pull request event: 1
  • Fork event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 668
  • Total Committers: 5
  • Avg Commits per committer: 133.6
  • Development Distribution Score (DDS): 0.325
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mingze Huang m****g@g****m 451
Huang s****z@t****u 121
Irina Gaynanova i****g@s****u 50
Christian L. Müller m****n 45
Rico s****z@r****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 15
  • Total pull requests: 2
  • Average time to close issues: 13 days
  • Average time to close pull requests: 5 months
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 4.27
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • rmflight (8)
  • corybrunson (5)
  • Vlasovets (1)
  • zdk123 (1)
Pull Request Authors
  • olivroy (2)
  • irinagain (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

DESCRIPTION cran
  • R >= 3.0.0 depends
  • MASS * imports
  • Matrix * imports
  • fMultivar * imports
  • ggplot2 * imports
  • graphics * imports
  • heatmaply * imports
  • mnormt * imports
  • pcaPP * imports
  • plotly * imports
  • stats * imports
  • covr * suggests
  • knitr * suggests
  • markdown * suggests
  • microbenchmark * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.0 suggests
.github/workflows/check-standard.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite