miceFast

R enviroment - fast imputations :dragon:

https://github.com/polkas/micefast

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.2%) to scientific vocabulary

Keywords

cpp fast fast-imputations grouping imputation imputations matrix mro multiple-imputation package r rcpp rcpparmadillo vif weighting
Last synced: 6 months ago · JSON representation

Repository

R enviroment - fast imputations :dragon:

Basic Info
Statistics
  • Stars: 20
  • Watchers: 2
  • Forks: 2
  • Open Issues: 2
  • Releases: 2
Topics
cpp fast fast-imputations grouping imputation imputations matrix mro multiple-imputation package r rcpp rcpparmadillo vif weighting
Created about 8 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog

README.md

miceFast

Author: Maciej Nasinski

Check the miceFast website for more details

R build status CRAN codecov Dependencies

Overview

miceFast provides fast methods for imputing missing data, leveraging an object-oriented programming paradigm and optimized linear algebra routines.
The package includes convenient helper functions compatible with data.table, dplyr, and other popular R packages.

Major speed improvements occur when:
- Using a grouping variable, where the data is automatically sorted by group, significantly reducing computation time. - Performing multiple imputations, by evaluating the underlying quantitative model only once for multiple draws. - Running Predictive Mean Matching (PMM), thanks to presorting and binary search.

For performance details, see performance_validity.R in the extdata folder.

It is recommended to read the Advanced Usage Vignette.

Installation

You can install miceFast from CRAN: r install.packages("miceFast") Or install the development version from GitHub: ```r

install.packages("devtools")

devtools::install_github("polkas/miceFast") ```

Quick Example

Below is a short demonstration. See the vignette for advanced usage and best practices.

```r library(miceFast)

set.seed(1234) data(air_miss)

Visualize the NA structure

upsetNA(airmiss, 6)

Simple and naive fill

imputeddata <- naivefillNA(airmiss)

Compare with other packages:

Hmisc

library(Hmisc) data.frame(Map(function(x) Hmisc::impute(x, "random"), air_miss))

mice

library(mice) mice::complete(mice::mice(air_miss, printFlag = FALSE)) ```

Loop example

Multiple imputations are performed in a loop where a continuous variable is imputed using a Bayesian linear model (lm_bayes) that incorporates relevant predictors and weights for robust estimation. Simultaneously, a categorical variable is imputed using linear discriminant analysis (LDA) augmented with a randomly generated ridge penalty.

```r library(dplyr)

Define a function that performs the imputation on the dataset

imputedata <- function(data) { data %>% mutate( # Impute the continuous variable using lmbayes SolarRimp = fillNA( x = ., model = "lmbayes", posity = "Solar.R", positx = c("Wind", "Temp", "Intercept"), w = weights # assuming 'weights' is a column in data ), # Impute the categorical variable using lda with a random ridge parameter Ozonechacimp = fillNA( x = ., model = "lda", posity = "Ozonechac", positx = c("Wind", "Temp"), ridge = runif(1, 0, 50) ) ) }

Set seed for reproducibility

set.seed(123456)

Run the imputation process 3 times using replicate()

This returns a list of imputed datasets.

res <- replicate(n = 3, expr = imputedata(airmiss), simplify = FALSE)

Check results: Calculate the mean of the imputed Solar.R values in each dataset

meansimputed <- lapply(res, function(x) mean(x$SolarRimp, na.rm = TRUE)) print(meansimputed)

Check results: Tabulate the imputed categorical variable for each dataset

tablesimputed <- lapply(res, function(x) table(x$Ozonechacimp)) print(tablesimputed) ```


Key Features

  • Object-Oriented Interface via miceFast objects (Rcpp modules).
  • Convenient Helpers:
    • fill_NA(): Single imputation (lda, lm_pred, lm_bayes, lm_noise).
    • fill_NA_N(): Multiple imputations (pmm, lm_bayes, lm_noise).
    • VIF(): Variance Inflation Factor calculations.
    • naive_fill_NA(): Automatic naive imputations.
    • compare_imp(): Compare original vs. imputed values.
    • upset_NA(): Visualize NA structure using UpSetR.

Quick Reference Table:

| Function | Description | |-----------------|-----------------------------------------------------------------------------| | new(miceFast) | Creates an OOP instance with numerous imputation methods (see the vignette). | | fill_NA() | Single imputation: lda, lm_pred, lm_bayes, lm_noise. | | fill_NA_N() | Multiple imputations (N repeats): pmm, lm_bayes, lm_noise. | | VIF() | Computes Variance Inflation Factors. | | naive_fill_NA() | Performs automatic, naive imputations. | | compare_imp() | Compares imputations vs. original data. | | upset_NA() | Visualizes NA structure using an UpSet plot. |


Performance Highlights

Benchmark testing (on R 4.4.3, macOS M3 Pro, optimized BLAS and LAPACK) shows miceFast can significantly reduce computation time, especially in these scenarios:

  • Linear Discriminant Analysis (LDA): ~5x faster.
  • Grouping Variable Imputations: ~10x faster (and can exceed 100x in some edge cases).
  • Multiple Imputations: ~x * (number of multiple imputations) faster, since the model is computed only once.
  • Variance Inflation Factors (VIF): ~5x faster, because we only compute the inverse of X'X.
  • Predictive Mean Matching (PMM): ~3x faster, thanks to presorting and binary search.

For performance details, see performance_validity.R in the extdata folder.

Owner

  • Name: Maciej Nasinski
  • Login: Polkas
  • Kind: user
  • Location: Warsaw Poland
  • Company: @insightsengineering

Maciej Nasinski - Data Scientist

GitHub Events

Total
  • Issues event: 1
  • Watch event: 4
  • Delete event: 2
  • Push event: 52
  • Pull request event: 6
  • Create event: 2
Last Year
  • Issues event: 1
  • Watch event: 4
  • Delete event: 2
  • Push event: 52
  • Pull request event: 6
  • Create event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 106
  • Total Committers: 3
  • Avg Commits per committer: 35.333
  • Development Distribution Score (DDS): 0.019
Past Year
  • Commits: 21
  • Committers: 1
  • Avg Commits per committer: 21.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Maciej Nasinski n****j@g****m 104
ol-oxy o****a@g****m 1
Maciej Nasinski m****i@a****l 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 17
  • Total pull requests: 13
  • Average time to close issues: 5 months
  • Average time to close pull requests: about 1 hour
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.59
  • Average comments per pull request: 0.15
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 6
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 hours
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Polkas (15)
  • sebastian-fox (2)
Pull Request Authors
  • Polkas (12)
  • ol-oxy (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 404 last-month
  • Total docker downloads: 21,154
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 12
  • Total maintainers: 1
cran.r-project.org: miceFast

Fast Imputations Using 'Rcpp' and 'Armadillo'

  • Versions: 12
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 404 Last month
  • Docker Downloads: 21,154
Rankings
Stargazers count: 13.8%
Forks count: 17.8%
Average: 24.7%
Downloads: 26.5%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.6.0 depends
  • Rcpp >= 0.12.12 imports
  • data.table * imports
  • methods * imports
  • UpSetR * suggests
  • dplyr * suggests
  • ggplot2 * suggests
  • knitr * suggests
  • magrittr * suggests
  • mice * suggests
  • pacman * suggests
  • rmarkdown * suggests
  • testthat * suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite