hdimpute

A Batch Process for High Dimensional Imputation via Chained Random Forests

https://github.com/pdwaggoner/hdimpute

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: springer.com
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.4%) to scientific vocabulary

Keywords from Contributors

standardization

Last synced: 7 months ago · JSON representation

Repository

A Batch Process for High Dimensional Imputation via Chained Random Forests

Basic Info

Host: GitHub
Owner: pdwaggoner
License: mit
Language: R
Default Branch: main
Homepage: https://pdwaggoner.github.io/Code.html
Size: 591 KB

Statistics

Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 2

Created about 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License Code of conduct

`hdImpute`: Batched high dimensional imputation

hdImpute is a correlation-based batch process for addressing high dimensional imputation problems. There are relatively few algorithms designed to handle imputation of missing data in high dimensional contexts in a relatively fast, efficient manner. Further, of the existing algorithms, even fewer are flexible enough to natively handle mixed-type data, often requiring a great deal of preprocessing to get the data into proper shape, and then postprocessing to return data to its original form. Such decisions as well as assumptions made by many algorithms regarding for example, the data generating process, limit the performance, flexibility, and usability of the algorithm. Built on top of a recent set of complementary algorithms for nonparametric imputation via chained random forests, missForest and missRanger, I offer a batch-based approach for subsetting the data based on ranked cross-feature correlations, and then imputing each batch separately, and then joining imputed subsets in the final step. The process is extremely fast and accurate after a bit of tuning to find the optimal batch size. As a result, high dimensional imputation is more accessible, and researchers are not forced to decide between speed or accuracy.

See the R-Bloggers post overviewing a basic implementation of hdImpute in R here

See the detailed complementary paper (Computational Statistics, 2023) introducing hdImpute along with several experimental results here (journal site) or here (full paper)

Python

A complementary version of hdImpute is being actively developed in Python. Take a look here and please feel free to directly contribute!

Access

Dev:

{r} devtools::install_github("pdwaggoner/hdImpute")

Stable (on CRAN):

{r} install.packages("hdImpute") library(hdImpute)

Usage

hdImpute includes five core functions, and two helpers. The first three are to proceed by individual stages ((1) build the correlation matrix, (2) flatten and rank the matrix to give a ranked feature list, and (3) build batches, impute, and join). The fourth function (hdImpute()) runs all stages simultaneously, which is slightly less flexible, but much simpler. Finally, the latest release (v0.2.1) includes a fifth function to evaluate the quality of imputations by computing the mean absolute differences ("MAD scores") for each variable in the original data compared to the imputed version of the data.

feature_cor(): creates the correlation matrix
flatten_mat(): flattens the correlation matrix from the previous stage, and ranks the features based on absolute correlations. Thus, the input for flatten_mat() should be the stored output from feature_cor().
impute_batches(): creates batches based on the feature rankings from flatten_mat(), and then imputes missing values for each batch, until all batches are completed. Then, joins the batches to give a completed, imputed data set.
hdImpute(): does everything for you. At a minimum, pass the raw data object (data) along with specifying the batch size (batch) to hdImpute() to return a complete, imputed data set (same as you'd get from the individual stages in the above three functions).
mad(): computes variable-wise mean absolute differences (MAD) between original and imputed dataframes. Returns the MAD scores for each variable as a tibble to ensure tidy compliance and easy interaction with other Tidyverse functions (e.g., ggplot() for visualizing imputation error).

There are several vignettes with deeper dives into the package functionality, which include a few ideas for how to use the software for any imputation project.

Contribute

This software is being actively developed, with many more features to come. Wide engagement with it and collaboration is welcomed! Here's a sampling of how to contribute:

Submit an issue reporting a bug, requesting a feature enhancement, etc.
Suggest changes directly via a pull request
Reach out directly with ideas if you're uneasy with public interaction

Thanks for using the tool. I hope its useful.

Owner

Name: Philip Waggoner
Login: pdwaggoner
Kind: user

Website: https://pdwaggoner.github.io
Repositories: 10
Profile: https://github.com/pdwaggoner

Director of Data Science @ YouGov Research Scholar @ Columbia University

GitHub Events

Total

Push event: 2

Last Year

Push event: 2

Committers

Last synced: over 2 years ago

All Time

Total Commits: 71
Total Committers: 2
Avg Commits per committer: 35.5
Development Distribution Score (DDS): 0.099

Past Year

Commits: 30
Committers: 1
Avg Commits per committer: 30.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Philip Waggoner	3****r	64
Philip Waggoner	p**r@g**m	7

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 2
Total pull requests: 11
Average time to close issues: 8 months
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.5
Average comments per pull request: 0.09
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

pdwaggoner (2)

Pull Request Authors

pdwaggoner (9)

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 275 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

cran.r-project.org: hdImpute

A Batch Process for High Dimensional Imputation

Homepage: https://github.com/pdwaggoner/hdImpute
Documentation: http://cran.r-project.org/web/packages/hdImpute/hdImpute.pdf
License: MIT + file LICENSE
Latest release: 0.2.1
published over 2 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 275 Last month

Rankings

Forks count: 21.9%

Stargazers count: 28.5%

Dependent packages count: 29.8%

Dependent repos count: 35.5%

Average: 37.1%

Downloads: 69.7%

Maintainers (1)

philip.waggoner@gmail.com

Last synced: 8 months ago

Dependencies

DESCRIPTION cran

cli * imports
dplyr * imports
magrittr * imports
missRanger * imports
plyr * imports
purrr * imports
tibble * imports
tidyselect * imports
knitr * suggests
missForest * suggests
rmarkdown * suggests
testthat >= 3.0.0 suggests
tidyverse * suggests
usethis * suggests

.github/workflows/testing.yml actions

actions/checkout v4 composite
actions/upload-artifact v4 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

hdimpute

Science Score: 49.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

hdImpute: Batched high dimensional imputation

Python

Access

Usage

Contribute

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: hdImpute

Rankings

Maintainers (1)

Dependencies

`hdImpute`: Batched high dimensional imputation