RfEmpImp

Multiple Imputation using Chained Random Forests

https://github.com/shangzhi-hong/rfempimp

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.9%) to scientific vocabulary

Keywords

imputation missing-data random-forest

Keywords from Contributors

transformation
Last synced: 6 months ago · JSON representation

Repository

Multiple Imputation using Chained Random Forests

Basic Info
  • Host: GitHub
  • Owner: shangzhi-hong
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 336 KB
Statistics
  • Stars: 4
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Topics
imputation missing-data random-forest
Created almost 6 years ago · Last pushed over 3 years ago
Metadata Files
Readme

README.Rmd

---
output: github_document
---



```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  fig.align = "center"
)
```

# RfEmpImp 

[![CRAN Status Badge](http://www.r-pkg.org/badges/version/RfEmpImp)](https://CRAN.R-project.org/package=RfEmpImp)
[![GitHub Version Badge](https://img.shields.io/static/v1?label=GitHub&message=2.1.8&color=3399ff)](https://github.com/shangzhi-hong/RfEmpImp)

An R package for random-forest-empowered imputation of missing Data

## Random-forest-based multiple imputation evolved
`RfEmpImp` is an R package for multiple imputation using chained random forests
(RF).  
This R package provides prediction-based and node-based multiple imputation
algorithms using random forests, and currently operates under the multiple
imputation computation framework [`mice`](https://CRAN.R-project.org/package=mice).  
For more details of the implemented imputation algorithms, please refer to:
[arXiv:2004.14823](https://arxiv.org/abs/2004.14823) (further updates soon).


## Installation
Users can install the CRAN version of `RfEmpImp` from CRAN, or the latest
development version of `RfEmpImp` from GitHub:  
```r
# Install from CRAN
install.packages("RfEmpImp")
# Install from GitHub online
if(!"remotes" %in% installed.packages()) install.packages("remotes")
remotes::install_github("shangzhi-hong/RfEmpImp")
# Install from released source package
install.packages(path_to_source_file, repos = NULL, type = "source")
# Attach
library(RfEmpImp)
```


## Prediction-based imputation
### For mixed types of variables
For data with mixed types of variables, users can call function `imp.rfemp()` to
use `RfEmp` method, for using `RfPred-Emp` method for continuous variables, and
using `RfPred-Cate` method for categorical variables
(of type `logical` or `factor`, etc.).  
Starting with version `2.0.0`, the names of parameters were further simplified,
please refer to the documentation for details.

### Prediction-based imputation for continuous variables
For continuous variables, in `RfPred-Emp` method, the empirical distribution of
random forest's out-of-bag prediction errors is used when constructing the
conditional distributions of the variable under imputation, providing conditional
distributions with better quality. Users can set `method = "rfpred.emp"` in
function call to `mice` to use it.

Also, in `RfPred-Norm` method, normality was assumed for RF prediction errors,
as proposed by Shah *et al.*, and users can set `method = "rfpred.norm"`
in function call to `mice` to use it.

### Prediction-based imputation for categorical variables
For categorical variables, in `RfPred.Cate` method, the probability machine
theory is used, and the predictions of missing categories are based on the
predicted probabilities for each missing observation. Users can set 
`method = "rfpred.cate"` in function call to `mice` to use it.

### Example for prediction-based imputation
```r
# Prepare data
df <- conv.factor(nhanes, c("age", "hyp"))
# Do imputation
imp <- imp.rfemp(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
```

## Node-based imputation
For continuous or categorical variables, the observations under the predicting
nodes of random forest are used as candidates for imputation.  
Two methods are now available for the `RfNode` algorithm series.  
It should be noted that categorical variables should be of types of `logical` or
`factor`, etc.

### Node-based imputation using predicting nodes
Users can call function `imp.rfnode.cond()` to use `RfNode-Cond` method,
performing imputation using the conditional distribution formed by the
prediction nodes.  
The weight changes of observations caused by the bootstrapping of random
forest are considered, and only the "in-bag" observations are used as candidates
for imputation.  
Also, users can set `method = "rfnode.cond"` in function call to `mice` to use
it.

### Node-based imputation using proximities
Users can call function `imp.rfnode.prox()` to use `RfNode-Prox` method, 
performing imputation using the proximity matrices of random forests.  
All the observations fall under the same predicting nodes are used as candidates
for imputation, including the out-of-bag ones.  
Also, users can set `method = "rfnode.prox"` in function call to `mice`
to use it.

### Example for node-based imputation
```r
# Prepare data
df <- conv.factor(nhanes, c("age", "hyp"))
# Do imputation
imp <- imp.rfnode.cond(df)
# Or: imp <- imp.rfnode.prox(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
```


## Imputation functions
| Type                        | Impute function | Univariate sampler        | Variable type |
|-----------------------------|-----------------|---------------------------|---------------|
| Prediction-based imputation | imp.emp()       | mice.impute.rfemp()       | Mixed         |
|                             | /               | mice.impute.rfpred.emp()  | Continuous    |
|                             | /               | mice.impute.rfpred.norm() | Continuous    |
|                             | /               | mice.impute.rfpred.cate() | Categorical   |
| Node-based imputation       | imp.node.cond() | mice.impute.rfnode.cond() | Mixed         |
|                             | imp.node.prox() | mice.impute.rfnode.prox() | Mixed         |
|                             | /               | mice.impute.rfnode()      | Mixed         |


## Package structure
The figure below shows how the imputation functions are organized in this R
package.  
Package structure


## Support for parallel computation
As random forest can be compute-intensive itself, and during multiple imputation
process, random forest models will be built for the variables containing missing
data for a certain number of iterations (usually 5 to 10 times) repeatedly
(usually 5 to 20 times, for the number of imputations performed).
Thus, computational efficiency is of crucial importance for multiple imputation
using chained random forests, especially for large data sets.  
So in `RfEmpImp`, the random forest model building process is accelerated using
parallel computation powered by [`ranger`](https://CRAN.R-project.org/package=ranger).
The ranger R package provides support for parallel computation using native C++.
In our simulations, parallel computation can provide impressive performance boost
for imputation process (about 4x faster on a quad-core laptop).


## References
1. Hong, Shangzhi, et al. "Multiple imputation using chained random forests."
Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
2. Zhang, Haozhe, et al. "Random forest prediction intervals."
The American Statistician (2019): 1-15.
3. Wright, Marvin N., and Andreas Ziegler. "ranger: A Fast Implementation of
Random Forests for High Dimensional Data in C++ and R." Journal of Statistical
Software 77.i01 (2017).
4. Shah, Anoop D., et al. "Comparison of random forest and parametric imputation
models for imputing missing data using MICE: a CALIBER study." American Journal
of Epidemiology 179.6 (2014): 764-774.
5. Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. "Recursive partitioning
for missing data imputation in the presence of interaction effects."
Computational Statistics & Data Analysis 72 (2014): 92-104.
6. Malley, James D., et al. "Probability machines." Methods of information in
medicine 51.01 (2012): 74-81.
7. Van Buuren, Stef, and Karin Groothuis-Oudshoorn. "mice: Multivariate Imputation
by Chained Equations in R." Journal of Statistical Software 45.i03 (2011).

Owner

  • Name: Shangzhi Hong
  • Login: shangzhi-hong
  • Kind: user
  • Location: Shanghai, China
  • Company: Fudan University

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 83
  • Total Committers: 3
  • Avg Commits per committer: 27.667
  • Development Distribution Score (DDS): 0.41
Top Committers
Name Email Commits
shangzhi-hong 49
shangzhi-hong h****t@h****m 25
shangzhi-hong 1****g@u****m 9

Issues and Pull Requests

Last synced: over 2 years ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 147 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
cran.r-project.org: RfEmpImp

Multiple Imputation using Chained Random Forests

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 147 Last month
Rankings
Forks count: 17.8%
Stargazers count: 24.2%
Dependent packages count: 29.8%
Average: 35.0%
Dependent repos count: 35.5%
Downloads: 67.6%
Maintainers (1)
Last synced: over 2 years ago

Dependencies

DESCRIPTION cran
  • R >= 3.5.0 depends
  • mice >= 3.9.0 depends
  • ranger >= 0.12.1 depends
  • knitr * suggests
  • rmarkdown * suggests
  • testthat >= 2.1.0 suggests