Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Keywords
imputation
missing-data
random-forest
Keywords from Contributors
transformation
Last synced: 6 months ago
·
JSON representation
Repository
Multiple Imputation using Chained Random Forests
Basic Info
Statistics
- Stars: 4
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Topics
imputation
missing-data
random-forest
Created almost 6 years ago
· Last pushed over 3 years ago
Metadata Files
Readme
README.Rmd
---
output: github_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
fig.align = "center"
)
```
# RfEmpImp
[](https://CRAN.R-project.org/package=RfEmpImp)
[](https://github.com/shangzhi-hong/RfEmpImp)
An R package for random-forest-empowered imputation of missing Data
## Random-forest-based multiple imputation evolved
`RfEmpImp` is an R package for multiple imputation using chained random forests
(RF).
This R package provides prediction-based and node-based multiple imputation
algorithms using random forests, and currently operates under the multiple
imputation computation framework [`mice`](https://CRAN.R-project.org/package=mice).
For more details of the implemented imputation algorithms, please refer to:
[arXiv:2004.14823](https://arxiv.org/abs/2004.14823) (further updates soon).
## Installation
Users can install the CRAN version of `RfEmpImp` from CRAN, or the latest
development version of `RfEmpImp` from GitHub:
```r
# Install from CRAN
install.packages("RfEmpImp")
# Install from GitHub online
if(!"remotes" %in% installed.packages()) install.packages("remotes")
remotes::install_github("shangzhi-hong/RfEmpImp")
# Install from released source package
install.packages(path_to_source_file, repos = NULL, type = "source")
# Attach
library(RfEmpImp)
```
## Prediction-based imputation
### For mixed types of variables
For data with mixed types of variables, users can call function `imp.rfemp()` to
use `RfEmp` method, for using `RfPred-Emp` method for continuous variables, and
using `RfPred-Cate` method for categorical variables
(of type `logical` or `factor`, etc.).
Starting with version `2.0.0`, the names of parameters were further simplified,
please refer to the documentation for details.
### Prediction-based imputation for continuous variables
For continuous variables, in `RfPred-Emp` method, the empirical distribution of
random forest's out-of-bag prediction errors is used when constructing the
conditional distributions of the variable under imputation, providing conditional
distributions with better quality. Users can set `method = "rfpred.emp"` in
function call to `mice` to use it.
Also, in `RfPred-Norm` method, normality was assumed for RF prediction errors,
as proposed by Shah *et al.*, and users can set `method = "rfpred.norm"`
in function call to `mice` to use it.
### Prediction-based imputation for categorical variables
For categorical variables, in `RfPred.Cate` method, the probability machine
theory is used, and the predictions of missing categories are based on the
predicted probabilities for each missing observation. Users can set
`method = "rfpred.cate"` in function call to `mice` to use it.
### Example for prediction-based imputation
```r
# Prepare data
df <- conv.factor(nhanes, c("age", "hyp"))
# Do imputation
imp <- imp.rfemp(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
```
## Node-based imputation
For continuous or categorical variables, the observations under the predicting
nodes of random forest are used as candidates for imputation.
Two methods are now available for the `RfNode` algorithm series.
It should be noted that categorical variables should be of types of `logical` or
`factor`, etc.
### Node-based imputation using predicting nodes
Users can call function `imp.rfnode.cond()` to use `RfNode-Cond` method,
performing imputation using the conditional distribution formed by the
prediction nodes.
The weight changes of observations caused by the bootstrapping of random
forest are considered, and only the "in-bag" observations are used as candidates
for imputation.
Also, users can set `method = "rfnode.cond"` in function call to `mice` to use
it.
### Node-based imputation using proximities
Users can call function `imp.rfnode.prox()` to use `RfNode-Prox` method,
performing imputation using the proximity matrices of random forests.
All the observations fall under the same predicting nodes are used as candidates
for imputation, including the out-of-bag ones.
Also, users can set `method = "rfnode.prox"` in function call to `mice`
to use it.
### Example for node-based imputation
```r
# Prepare data
df <- conv.factor(nhanes, c("age", "hyp"))
# Do imputation
imp <- imp.rfnode.cond(df)
# Or: imp <- imp.rfnode.prox(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
```
## Imputation functions
| Type | Impute function | Univariate sampler | Variable type |
|-----------------------------|-----------------|---------------------------|---------------|
| Prediction-based imputation | imp.emp() | mice.impute.rfemp() | Mixed |
| | / | mice.impute.rfpred.emp() | Continuous |
| | / | mice.impute.rfpred.norm() | Continuous |
| | / | mice.impute.rfpred.cate() | Categorical |
| Node-based imputation | imp.node.cond() | mice.impute.rfnode.cond() | Mixed |
| | imp.node.prox() | mice.impute.rfnode.prox() | Mixed |
| | / | mice.impute.rfnode() | Mixed |
## Package structure
The figure below shows how the imputation functions are organized in this R
package.
## Support for parallel computation
As random forest can be compute-intensive itself, and during multiple imputation
process, random forest models will be built for the variables containing missing
data for a certain number of iterations (usually 5 to 10 times) repeatedly
(usually 5 to 20 times, for the number of imputations performed).
Thus, computational efficiency is of crucial importance for multiple imputation
using chained random forests, especially for large data sets.
So in `RfEmpImp`, the random forest model building process is accelerated using
parallel computation powered by [`ranger`](https://CRAN.R-project.org/package=ranger).
The ranger R package provides support for parallel computation using native C++.
In our simulations, parallel computation can provide impressive performance boost
for imputation process (about 4x faster on a quad-core laptop).
## References
1. Hong, Shangzhi, et al. "Multiple imputation using chained random forests."
Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
2. Zhang, Haozhe, et al. "Random forest prediction intervals."
The American Statistician (2019): 1-15.
3. Wright, Marvin N., and Andreas Ziegler. "ranger: A Fast Implementation of
Random Forests for High Dimensional Data in C++ and R." Journal of Statistical
Software 77.i01 (2017).
4. Shah, Anoop D., et al. "Comparison of random forest and parametric imputation
models for imputing missing data using MICE: a CALIBER study." American Journal
of Epidemiology 179.6 (2014): 764-774.
5. Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. "Recursive partitioning
for missing data imputation in the presence of interaction effects."
Computational Statistics & Data Analysis 72 (2014): 92-104.
6. Malley, James D., et al. "Probability machines." Methods of information in
medicine 51.01 (2012): 74-81.
7. Van Buuren, Stef, and Karin Groothuis-Oudshoorn. "mice: Multivariate Imputation
by Chained Equations in R." Journal of Statistical Software 45.i03 (2011).
Owner
- Name: Shangzhi Hong
- Login: shangzhi-hong
- Kind: user
- Location: Shanghai, China
- Company: Fudan University
- Repositories: 1
- Profile: https://github.com/shangzhi-hong
GitHub Events
Total
Last Year
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 83
- Total Committers: 3
- Avg Commits per committer: 27.667
- Development Distribution Score (DDS): 0.41
Top Committers
| Name | Commits | |
|---|---|---|
| shangzhi-hong | 49 | |
| shangzhi-hong | h****t@h****m | 25 |
| shangzhi-hong | 1****g@u****m | 9 |
Issues and Pull Requests
Last synced: over 2 years ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 147 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
cran.r-project.org: RfEmpImp
Multiple Imputation using Chained Random Forests
- Homepage: https://github.com/shangzhi-hong/RfEmpImp
- Documentation: http://cran.r-project.org/web/packages/RfEmpImp/RfEmpImp.pdf
- License: GPL-3
- Status: removed
-
Latest release: 2.1.8
published over 3 years ago
Rankings
Forks count: 17.8%
Stargazers count: 24.2%
Dependent packages count: 29.8%
Average: 35.0%
Dependent repos count: 35.5%
Downloads: 67.6%
Maintainers (1)
Last synced:
over 2 years ago
Dependencies
DESCRIPTION
cran
- R >= 3.5.0 depends
- mice >= 3.9.0 depends
- ranger >= 0.12.1 depends
- knitr * suggests
- rmarkdown * suggests
- testthat >= 2.1.0 suggests