RfEmpImp

Multiple Imputation using Chained Random Forests

https://github.com/shangzhi-hong/rfempimp

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Keywords

imputation missing-data random-forest

Keywords from Contributors

transformation

Last synced: 6 months ago · JSON representation

Repository

Multiple Imputation using Chained Random Forests

Basic Info

Host: GitHub
Owner: shangzhi-hong
Language: R
Default Branch: master
Homepage:
Size: 336 KB

Statistics

Stars: 4
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 0

Topics

imputation missing-data random-forest

Created almost 6 years ago · Last pushed over 3 years ago

Metadata Files

Readme

README.Rmd

---
output: github_document
---



```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  fig.align = "center"
)
```

# RfEmpImp 

[![CRAN Status Badge](http://www.r-pkg.org/badges/version/RfEmpImp)](https://CRAN.R-project.org/package=RfEmpImp)
[![GitHub Version Badge](https://img.shields.io/static/v1?label=GitHub&message=2.1.8&color=3399ff)](https://github.com/shangzhi-hong/RfEmpImp)

An R package for random-forest-empowered imputation of missing Data

## Random-forest-based multiple imputation evolved
`RfEmpImp` is an R package for multiple imputation using chained random forests
(RF).  
This R package provides prediction-based and node-based multiple imputation
algorithms using random forests, and currently operates under the multiple
imputation computation framework [`mice`](https://CRAN.R-project.org/package=mice).  
For more details of the implemented imputation algorithms, please refer to:
[arXiv:2004.14823](https://arxiv.org/abs/2004.14823) (further updates soon).


## Installation
Users can install the CRAN version of `RfEmpImp` from CRAN, or the latest
development version of `RfEmpImp` from GitHub:  
```r
# Install from CRAN
install.packages("RfEmpImp")
# Install from GitHub online
if(!"remotes" %in% installed.packages()) install.packages("remotes")
remotes::install_github("shangzhi-hong/RfEmpImp")
# Install from released source package
install.packages(path_to_source_file, repos = NULL, type = "source")
# Attach
library(RfEmpImp)
```


## Prediction-based imputation
### For mixed types of variables
For data with mixed types of variables, users can call function `imp.rfemp()` to
use `RfEmp` method, for using `RfPred-Emp` method for continuous variables, and
using `RfPred-Cate` method for categorical variables
(of type `logical` or `factor`, etc.).  
Starting with version `2.0.0`, the names of parameters were further simplified,
please refer to the documentation for details.

### Prediction-based imputation for continuous variables
For continuous variables, in `RfPred-Emp` method, the empirical distribution of
random forest's out-of-bag prediction errors is used when constructing the
conditional distributions of the variable under imputation, providing conditional
distributions with better quality. Users can set `method = "rfpred.emp"` in
function call to `mice` to use it.

Also, in `RfPred-Norm` method, normality was assumed for RF prediction errors,
as proposed by Shah *et al.*, and users can set `method = "rfpred.norm"`
in function call to `mice` to use it.

### Prediction-based imputation for categorical variables
For categorical variables, in `RfPred.Cate` method, the probability machine
theory is used, and the predictions of missing categories are based on the
predicted probabilities for each missing observation. Users can set 
`method = "rfpred.cate"` in function call to `mice` to use it.

### Example for prediction-based imputation
```r
# Prepare data
df <- conv.factor(nhanes, c("age", "hyp"))
# Do imputation
imp <- imp.rfemp(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
```

## Node-based imputation
For continuous or categorical variables, the observations under the predicting
nodes of random forest are used as candidates for imputation.  
Two methods are now available for the `RfNode` algorithm series.  
It should be noted that categorical variables should be of types of `logical` or
`factor`, etc.

### Node-based imputation using predicting nodes
Users can call function `imp.rfnode.cond()` to use `RfNode-Cond` method,
performing imputation using the conditional distribution formed by the
prediction nodes.  
The weight changes of observations caused by the bootstrapping of random
forest are considered, and only the "in-bag" observations are used as candidates
for imputation.  
Also, users can set `method = "rfnode.cond"` in function call to `mice` to use
it.

### Node-based imputation using proximities
Users can call function `imp.rfnode.prox()` to use `RfNode-Prox` method, 
performing imputation using the proximity matrices of random forests.  
All the observations fall under the same predicting nodes are used as candidates
for imputation, including the out-of-bag ones.  
Also, users can set `method = "rfnode.prox"` in function call to `mice`
to use it.

### Example for node-based imputation
```r
# Prepare data
df <- conv.factor(nhanes, c("age", "hyp"))
# Do imputation
imp <- imp.rfnode.cond(df)
# Or: imp <- imp.rfnode.prox(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
```


## Imputation functions
| Type                        | Impute function | Univariate sampler        | Variable type |
|-----------------------------|-----------------|---------------------------|---------------|
| Prediction-based imputation | imp.emp()       | mice.impute.rfemp()       | Mixed         |
|                             | /               | mice.impute.rfpred.emp()  | Continuous    |
|                             | /               | mice.impute.rfpred.norm() | Continuous    |
|                             | /               | mice.impute.rfpred.cate() | Categorical   |
| Node-based imputation       | imp.node.cond() | mice.impute.rfnode.cond() | Mixed         |
|                             | imp.node.prox() | mice.impute.rfnode.prox() | Mixed         |
|                             | /               | mice.impute.rfnode()      | Mixed         |


## Package structure
The figure below shows how the imputation functions are organized in this R
package.  



## Support for parallel computation
As random forest can be compute-intensive itself, and during multiple imputation
process, random forest models will be built for the variables containing missing
data for a certain number of iterations (usually 5 to 10 times) repeatedly
(usually 5 to 20 times, for the number of imputations performed).
Thus, computational efficiency is of crucial importance for multiple imputation
using chained random forests, especially for large data sets.  
So in `RfEmpImp`, the random forest model building process is accelerated using
parallel computation powered by [`ranger`](https://CRAN.R-project.org/package=ranger).
The ranger R package provides support for parallel computation using native C++.
In our simulations, parallel computation can provide impressive performance boost
for imputation process (about 4x faster on a quad-core laptop).


## References
1. Hong, Shangzhi, et al. "Multiple imputation using chained random forests."
Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
2. Zhang, Haozhe, et al. "Random forest prediction intervals."
The American Statistician (2019): 1-15.
3. Wright, Marvin N., and Andreas Ziegler. "ranger: A Fast Implementation of
Random Forests for High Dimensional Data in C++ and R." Journal of Statistical
Software 77.i01 (2017).
4. Shah, Anoop D., et al. "Comparison of random forest and parametric imputation
models for imputing missing data using MICE: a CALIBER study." American Journal
of Epidemiology 179.6 (2014): 764-774.
5. Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. "Recursive partitioning
for missing data imputation in the presence of interaction effects."
Computational Statistics & Data Analysis 72 (2014): 92-104.
6. Malley, James D., et al. "Probability machines." Methods of information in
medicine 51.01 (2012): 74-81.
7. Van Buuren, Stef, and Karin Groothuis-Oudshoorn. "mice: Multivariate Imputation
by Chained Equations in R." Journal of Statistical Software 45.i03 (2011).

Owner

Name: Shangzhi Hong
Login: shangzhi-hong
Kind: user
Location: Shanghai, China
Company: Fudan University

Repositories: 1
Profile: https://github.com/shangzhi-hong

GitHub Events

Total

Last Year

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 83
Total Committers: 3
Avg Commits per committer: 27.667
Development Distribution Score (DDS): 0.41

Top Committers

Name	Email	Commits
shangzhi-hong		49
shangzhi-hong	h**t@h**m	25
shangzhi-hong	1**g@u**m	9

Issues and Pull Requests

Last synced: over 2 years ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 147 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

cran.r-project.org: RfEmpImp

Multiple Imputation using Chained Random Forests

Homepage: https://github.com/shangzhi-hong/RfEmpImp
Documentation: http://cran.r-project.org/web/packages/RfEmpImp/RfEmpImp.pdf
License: GPL-3
Status: removed
Latest release: 2.1.8
published over 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 147 Last month

Rankings

Forks count: 17.8%

Stargazers count: 24.2%

Dependent packages count: 29.8%

Average: 35.0%

Dependent repos count: 35.5%

Downloads: 67.6%

Maintainers (1)

shangzhi-hong@hotmail.com

Last synced: over 2 years ago

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
mice >= 3.9.0 depends
ranger >= 0.12.1 depends
knitr * suggests
rmarkdown * suggests
testthat >= 2.1.0 suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

RfEmpImp

Science Score: 10.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: RfEmpImp

Rankings

Maintainers (1)

Dependencies