https://github.com/edoardocostantini/mi-hd
Repository hosting project high-dimensional imputation comparison
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.3%) to scientific vocabulary
Repository
Repository hosting project high-dimensional imputation comparison
Basic Info
- Host: GitHub
- Owner: EdoardoCostantini
- License: mit
- Language: R
- Default Branch: master
- Size: 224 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
Multiple Imputation with High-dimensional Imputation Models
Repository hosting project high-dimensional imputation comparison.
Summary
Including a large number of predictors in the imputation model underlying a Multiple Imputation (MI) procedure is one of the most challenging tasks imputers face. A variety of high-dimensional MI techniques (MI-HD) can facilitate this task, but there has been limited research on their relative performance. In this study, we investigate a wide range of extant MI-HD techniques that can handle a large number of predictors in the imputation model and general missing data patterns.
We assess the relative performance of seven MI-HD methods with a Monte Carlo simulation study and a resampling study based on real survey data. The performance of the methods is defined by the degree to which they facilitate unbiased and confidence-valid estimates of the parameters of complete data analysis models.
We find that using regularized regression to select the predictors used in the MI model and using principal component analysis to reduce the dimensionality of auxiliary data produce the best results.
Contents
This directory contains the following main subfolders:
checks: contains scripts checking expected behavior of different functions and setupscode: the main software to run the studyconvergence: contains scripts to perform convergence checkscrossvalidate: contains scripts to perform cross-validation of the ridge penalty for one of the methods used in the study (bridge)data: where the EVS data should be store after cleaninginput: the folder storing software and other files needed by the study and not available elsewhereoutput: the folder where the results of scripts located in code are storedtxt: the folder containing the descriptions of thelavaanmodel used in the project.
How to replicate results
The content of this directory can be used to replicate the results reported in the manuscript: "SMR-21-0138.R1 - High-dimensional imputation for the social sciences: a comparison of state-of-the-art methods"
Running the simulations
We used R for these simulations.
Simulation study (exp1)
Installing Dependencies:
- Open the script init_general.R and install the packages with the traditional
install.packages()function. - Install the package
PcAuxusingdevtools::install_github("PcAux") - Install the package
blassoby downloading a compatible version of the package from the package author's website. If you are running on windows, you need to install g++ to be able to install this package. You can follow these instructions - Install IVEware by following this guide
- Open the script init_general.R and install the packages with the traditional
Running the simulation:
- Open the script exp1_init.R and make sure that the parameters and conditions of the simulation study are set to desired values. In particular, pay attention to:
parms$IVElocwhich needs to be set to the correct path for the operating system you are running (for more info look for~/srclibhere)
- Open the script exp1simulationscript_win.R
- Make sure the working directory is set to the location of this script (
./code/) - Define the number of clusters to be used by specifying the first argument in the
function
makeCluster() - Run the entire script
- Open the script exp1_init.R and make sure that the parameters and conditions of the simulation study are set to desired values. In particular, pay attention to:
Collinearity study (exp1.2)
- Installing Dependencies: same as above
- Running the simulation:
- Open the script exp1.2_init.R and make sure that the parameters and conditions of the simulation study are set to desired values. In particular, pay attention to:
parms$IVElocwhich needs to be set to the correct path for the operating system you are running (for more info look for~/srclibhere)
- Open the script exp1.2simulationscript_win.R
- Make sure the working directory is set to the location of this script (
./code/) - Define the number of clusters to be used by specifying the first argument in the
function
makeCluster() - Run the entire script
- Open the script exp1.2_init.R and make sure that the parameters and conditions of the simulation study are set to desired values. In particular, pay attention to:
EVS resampling study (exp4)
- Installing Dependencies: same as above
- Preparing the EVS population data:
- Download the EVS 2017 third pre-release https://doi.org/10.4232/1.13511.
- Store it in the
./data/folder inside this project. - Run the script exp4_prepEVS.R to clean the data and prepare it for the analysis.
- Running the simulation:
- Open the script exp4simulationscript_win.R
- Make sure the working directory is set to the location of this script (
./code/) - Define the number of clusters to be used by specifying the first argument in the function
makeCluster() - Run the entire script
Obtaining the plots and tables
The procedure is described for the simulation study "exp1". By using the scripts for "exp1.2" and "exp4", the same procedure can be followed for the collinearity study and the EVS resampling study.
- Open the script exp1_results.R and make sure you specify the name of the .rds file obtained from the simulation study run. This script will extract the results reported in the study.
- Open the script exp1_analysis.R and make sure you specify
the name of the .rds file obtained from the exp1_results.R
run.
To obtain all the plots, you can play around with the parameters defining what
is plotted by the script. For example, by changing
pm_grep <- "0.3"to0.1you will be able to produce the plots for the smaller proportion of missing cases.
Keeping track of the results
Because it happens that after getting a review you need to add conditions, repetitions, or tweak other aspects of simulation studies, you need to be able to re-run only certain aspects of the study. This requires being able to stitch together parts of the results. Here, I want to keep track of which filenames are important for the results. Because of the size of these result files, they are not stored in this repository directly. You can contact me if you want to get access to any of them.
Simulation Study
exp1_simOut_20201130_1006.rds- 1e3 repetitions
- all the original methods (pre-SMR submission)
exp1_simOut_20220201_1749.rds- 1e3 repetitions
- only additional methods MI-qp and MI-am run as a result of the SMR review
exp1_simOut_20220201_1749_res.rds- outcome of the exp1_results.R script combining (1) and (2)
exp1_simOut_20220225_1035.rds- 1e3 repetitions
- re-run of bridge with correct intercept inclusion
exp1_cv_bridge_20220224_1042.rds- Output for cross-validation of bridge with the correct use of intercept
exp1_simOut_20220225_1035_res.rds- Output for
exp1_results.Rscript combining (1), (2), and (4)
- Output for
exp1_cv_IVEware_20230324_1326.rds- Output for cross-validation of IVEware
minR2using 70 iterations
- Output for cross-validation of IVEware
exp1_conv_IVEware_20230327_1143.rds- Output for convergence checks for IVEware (above 5 iterations everything seems fine)
exp1_cv_IVEware_20230331_1121.rdsis a version with 70 iterations and 100 multiple imputed datasets
exp1_simOut_20230403_1631.rds- Output for IVEware method
exp1_simOut_20230403_1631_res.rds- Output for
exp1_results.Rscript combining (1), (2), (4), and (9)
- Output for
Extra Simulation Study on Collinearity
exp1_2_convergence_all_meth_20230403_1027.rds- Output for convergence checks for all R native methods.
exp1_2_cv_IVEware_20230405_1715.rds- Output for convergence checks for IVEware data.
exp1_2_cv_bridge_20230405_1449.rds- Output for cross-validation of
ridgeparameter for bridge
- Output for cross-validation of
exp1_2_cv_IVEware_20230406_1053.rds- Output for cross-validation of
minR2parameter for IVEware
- Output for cross-validation of
exp1_2_simOut_20230408_1748.rds- 30 repetitions for all methods (contains MI-QP time estimate!)
exp1_2_simOut_20230419_1403.rds- 500 for all R-based methods
exp1_2_simOut_20230421_1151.rds- 500 repetitions for IVEware method (stepFor / MI-SF)
exp1_2_simOut_20230424_0945.rds- 500 repetitions for MI-QP
exp1_2_simOut_20230421_1424.rds- Contains results of MI-PCA (using 50% rule) vs MI-AM test on all collinearity conditions.
exp1_2_simOut_20230426_0906.rds- Contains results of MI-PCA (using Kaiser rule) vs MI-AM test on all collinearity conditions.
exp1_2_simOut_main_results.rds- Concatenated version of 6, 7, 8, and MI-PCA-k (Kaiser rule) results from 10.
Resampling Study
exp4_simOut_20201204_2121.rds- first 500 repetitions
exp4_simOut_20201207_1134.rds- next 500 repetitions
exp4_simOut_20220131_1603.rds- 1e3 repetitions
- only additional methods MI-qp and MI-am run as a result of the SMR review
exp4_simOut_20220226_0950.rds- 1e3 repetitions
- re-run of bridge with correct intercept inclusion
exp4_simOut_20230323_1551.rds- 1e3 repetitions
- run of IVEware with 70 iterations
exp4_simOut_20220226_0950_res.rds- outcome of the exp4_results.R script combining (1), (2), (3), and (4)
exp4_simOut_20230323_1551_res.rds- outcome of the exp4_results.R script combining (1), (2), (3), (4), and (5)
exp4_cv_bridge_20220223_1646.rds- contains the results for cross-validation of bridge with the correct use of intercept
exp4_cv_IVEware_20230322_1841.rds- contains the results for cross-validation of IVEware
minR2parameter
- contains the results for cross-validation of IVEware
exp4_cv_IVEware_20230328_1544.rds- contains convergence checks results for IVEware on EVS data
Owner
- Name: Edo
- Login: EdoardoCostantini
- Kind: user
- Location: Tilburg, Netherlands
- Company: Tilburg University
- Website: https://edoardocostantini.github.io
- Repositories: 9
- Profile: https://github.com/EdoardoCostantini
Sociologist turned statistician, missed developer, born interior designer but never got there
GitHub Events
Total
- Watch event: 1
- Delete event: 2
Last Year
- Watch event: 1
- Delete event: 2