mi-spcr

Simulation study to compare different supervision approaches to PCR in MICE.

https://github.com/edoardocostantini/mi-spcr

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Simulation study to compare different supervision approaches to PCR in MICE.

Basic Info

Host: GitHub
Owner: EdoardoCostantini
License: mit
Language: R
Default Branch: main
Size: 7.07 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

Multiple imputation with the use of supervised principal component regression as a univariate imputation method (MI-SPCR)

Summary of project

The goal of the study was to understand how different approaches to supervised principal component analysis (PCA) can help to specify the imputation models in a Multivariate Imputation by Chained Equation (MICE) procedure to handle missing values. In particular, I wanted to compare the performance of four univariate imputation methods based on supervised principal component regression (PCR). We refer to this use of supervised PCR as supervised MI-PCR. The purpose of this study was to evaluate the statistical properties of MI-PCR in several settings that differed in the complexity of the data latent structure, the proportion of missing cases, the missing data mechanism, and the number of principal components PCs used by the imputation models.

Simulation study procedure

We used a Monte Carlo simulation study. The simulation study procedure involved four steps:

Data generation: We generated 500 data sets from a confirmatory factor analysis model.
Missing data imposition: We imposed missing values on three target items in each generated data set.
Imputation: We generated $d$ multiple imputed data tables for each generated data set using each of the different imputation methods.
Analysis: We estimated the mean, variance, covariance, and correlation of the three items with missing values on the $d$ differently imputed data tables, and we pooled the estimates according to Rubin's rules (1987, p. 76.)

We then assessed the performance of each imputation method by computing the following outcome measures:

RB: raw estimation bias;
PRB: percent relative estimation bias;
CIC: confidence interval coverage of the true parameter value;
CIW: average confidence interval width;
mcsd: standard deviation of the estimate across the Monte Carlo simulations;

for the following statistics:

cor: correlation between two items with missing values;
cov: covariance between two items with missing values;
mean: mean of an item with missing values;
var: variance of an item with missing values.

Simulation study fixed factors

These parameters were kept constant to generate the data:

dataset sample size (1000);
number of items per latent variable (3);
mean and variance of observed items (mu = 5, sd = 2.5);
factor loadings (0.85);
correlation between the first two latent variables (0.8);
correlation between the first two latent variables and the others (0.1);
number of items receiving missing values (3);
"shape" of missing values imposed on the three variables with missing values (right, left, tails, respectively).

These parameters were kept constant to impute the data:

number of multiple imputations ($d = 5$)
MICE algorithm iterations (25)

Simulation study experimental factors

The simulation study procedure is repeated for each of the conditions resulting by the crossing of the following experimental factors:

number of latent variables (nla = 2, 10, 50)

From previous work, we know the unsupervised PCA methods require to use of enough PCs as there are latent variables in the data generating model. We want to vary the true number of latent variables to verify this. The chosen values reflect: - a simple case where we only have the two latent variables: 1 latent variable measured by items receiving amputation and imputation; 1 latent variable measured by the MAR predictors (nla = 2, for a total of 6 items) - a small dimensionality setup (nla = 10, for a total of 30 items) - a large dimensionality setup (nla = 50, for a total of 150 items)

proportion of missing data per variable (pm = 0.1, 0.25, 0.5, levels chosen based on literature recommendations)
missing data mechanism (mech = MCAR, MAR)

These can be described by the following matrix describing which predictors are involved (no = 0, yes = 1) in the generation of the missing values on items X1 to X3:

X1 X2 X3 X4 X5 X6 MCAR 0 0 0 0 0 0 MAR 0 0 0 1 1 1

missing data treatment
- pcr: mice with principal component regression as univariate imputation method;
- spcr: mice with supervised principal component regression (Bair et. al., 2006) as univariate imputation method;
- plsr: mice with partial least squares regression (Wold, 1975) as univariate imputation method;
- pcovr: mice with principal covariates regression (De Jong and Kiers, 1992) as univariate imputation method;
- qp: mice with the normal linear model with bootstrap as univariate imputation method and quickpred() used to select the predictors as described by Van Buuren, Boshuizen, and Knook (1999, pp. 687–688);
- am: mice with the normal linear model with bootstrap as univariate imputation method and the analysis model variables used as predictors;
- all: mice with the normal linear model with bootstrap as univariate imputation method and all available items used as predictors;
- cc: complete case analysis;
- fo: fully observed data (results if there had been no missing values).
number of principal components (npcs) used by the approaches based on PCA

These numbers depend of the number of latent variables used: - for nla = 2, I chose npcs = 1 to 5 - for nla = 10, I chose npcs = 1 to 12, 20, 29 - for nla = 50, I chose npcs = 1 to 10, 20, 30, 40, 48:52, 60, 149

Results

Check out the results by playing with the Shiny app.

How to replicate results

To replicate the study, you first need to make sure you have installed all the packages used. You can use the ./code/0-prep-install.R script to install them. You should pay special attention to the version of mice you are using. This study uses the special version of this package that is stored in the input/ folder. The forked repository EdoardoCostantini/mice/tree/develop-pcr stores the code for this version. In the following guide, it is assumed that the machine on which the simulation is run already has all packages installed.

Before running the simulation study

To assess the lack of convergence issues, I recommend using the script ./code/prep-convergence-check.R before running the simulation study. This script runs the simulation study for the subset of most complex conditions and stores the mids objects so that trace-plots can be easily obtained.

To perform the convergence checks, perform the following steps:

Run the first and second sections of the ./code/prep-convergence-check.R
Once the results have been stored, run the third section to read the results and manually check the combinations of npcs and methods that you desire to check. You can do so by changing the values of the npcs and method object defined in this script.
Update the parms$mice_iters value in init.R file to match the number of iterations that your think is sufficient to avoid non-convergence with all methods.

Please note the following:

The object cindex in the script can be used to specify the desired subset of conditions to check convergence. It is acceptable to check convergence for the more challenging conditions and draw conclusions for the entire simulation study.
The number of iterations is set to 100, so that a possible lack of convergence for every multiple imputation method can be assessed.
The run is parallelized over the conditions.
The seed is set per condition.
Every condition is meant to be repeated only once.

After running the simulation study

The simulation study stores mids objects for a small number of repetitions. You may assess the lack of non-convergence directly on these. You may unzip the results form the simulation study, select the files that contain mids in their name and check convergence as in the third section of ./code/prep-convergence-check.R.

Running the simulation study on Lisa

Lisa Cluster is a cluster computer system managed by SURFsara, a cooperative association of Dutch educational and research institutions. Researchers at most Dutch universities can request access to this cluster computer. Here it is assumed that you know how to access Lisa and upload material to the server. In the following, I list the specific tasks you should go through to replicate the results. Bullet points starting with "PC" and "Lisa" indicate that the task should be performed in a terminal session on either your personal computer or on Lisa, respectively. The idea is that you want (1) to prepare the simulation scripts on your computer, (2) upload the results to lisa and (3) run the simulation on Lisa.

Prepare run on a personal computer:
- PC: Open 0-init-objects.R:
  - check/define the seed in parms$seed
  - check the fixed parameters and experimental factor levels are set to the desired values.
  - set run_descr to a meaningful description
- PC: Open 0-prep-estimate-time-per-rep.R:
  - Run it to check how long it takes to perform a single run across all the conditions with the chosen simulation study setup. This will create an R object called wall_time.
- PC: Open lisa-js-normal.sh:
  - replace the wall time in the header (#SBATCH -t) with the value of wall_time.
- PC: Open 1-sim-lisa-run.R:
  - Under the header Define stopos lines, define the number of cores to use per node, and the first and last repetitions. For example: ```
  Define how many cores will be used on a node
  
  ncores <- 16
  
  Define repetitions
  
  firstrep <- 49 lastrep <- 256
  
``` This will run the repetitions from 49 to 256 (usually you would start from 1, but you don't need to) and it will use 16 cores on every node.

PC: Open lisa-do-runRep.R
- check the # Subset conditions? if-statement is set to FALSE if you want to run the full simulation study, or to TRUE if you want to run a smaller trial study with just a few conditions.
PC: Run prep-lisa-direcotry.sh:
- In your terminal, run . code/0-prep-lisa-directory.sh run-name This script creates a folder on your computer by the name run-name in the lisa/ folder.
  - PC: upload the folder lisa/run-name to lisa with a commend like scp -r path/to/local/project/lisa/date-run user@lisa.surfsara.nl:mi-spcr

Prepare and run on lisa:
- Lisa: run prep-install.R to install all R-packages if it's the first time you are running it. Rscript mi-pcr/code/prep-install.R
- Lisa: Check all the packages are available by running Rscript mi-pcr/code/init-software.R If you don't get any errors, you are good to go.
- Lisa: Run the simulation by using the following bash script . mi-spcr/code/1-sim-lisa-1-run.sh partition narray where:
  - partition should either be short if you are running a small trial or normal if you are running the complete simulation study
  - narray should be the size of the array of jobs; its value should be the one returned by the narray object in the script 1-sim-lisa-1-run.sh (ceiling(goal_reps/ncores))
Store the results
- PC: When the array of jobs is done, you can pull the results to your machine by scp -r user@lisa.surfsara.nl:mi-pcr/output/folder path/to/local/project/output/folder For example, from a terminal session in the main folder scp -r user@lisa.surfsara.nl:mi-pcr/output/9829724 ./output/
Read the results on your computer:
- PC: The script 1-sim-lisa-2-unzip.R goes through the Lisa result folder, unzips tar.gz packages, and puts results together.
- PC: Finally, you can use the script 2-res-1-shape-results.R to compute bias, CIC, and all the outcome measures and prepare the RDS objects that can be plotted with the shiny app in 2-res-2-plots.R.
- Open and run the script 2-res-1-patchwork.R to combine results from different results files.

Running the simulation on a PC / Mac

You can also replicate the simulation on a personal computer by following these steps:

Prepare run:
- Open and run 0-prep-install.R to install all the packages you need to run the simulation This will override your mice installation. If this is undesirable, you can always install all these packages in a local library for this project.
- Open 0-init-objects.R:
  - check/define the seed in parms$seed
  - check the fixed parameters and experimental factor levels are set to the desired values.
  - set run_descr to a meaningful description
Run the simulation:
- Open 1-sim-pc-1-run.R
  - set the object reps to be an integer vector with values from 1 to the number of target repetitions you want to run
  - set the object clusters to the number of cores you want to use for parallelization
Read the results:
- Open and run the script 1-sim-pc-2-unzip.R which unzips the results and creates a unique file with all of the important results.
- Open and run the script 2-res-1-shape-results.R to compute bias, CIC, and all the outcome measures and prepare the RDS objects that can be plotted with the shiny app in 2-res-2-plots.R.
- Open and run the script 2-res-1-patchwork.R to combine results from different results files.

Convergence checks

COMING SOON

Result files

For anyone

The final results used for writing up the report are stored in 20221202-105949-results.rds. You can read this file and use the plotting functionalities in 2-res-2-plots.R to interact as you wish with the results. There you can find a shiny app to interact freely with the results in this file. You also have a couple of regular plots.

For housekeeping of the project

Unfortunately, projects get big, people are not shy with the feedback, parts need to be re-run, and it all becomes a mess. The final result file is the result of pasting together results from different runs. Here is a guide to correctly managing them. The current important files are:

9945538-9944296-9943298 (Folder) with the following main results files associated
- 20220827-094950-run-lisa-9945538-9944296-9943298-unzipped.rds containing unzipped raw data
- 20220827-094950-run-lisa-9945538-9944296-9943298-main-res.rds containing processed data (bias, cic, ciw computed)
20221126-121849-pcovr-correct-alpha-tuning.tar.gz which contains the re-run of the PCovR method with the correct alpha tuning. This archive has the following main results files associated:
- 20221126-121849-pcovr-correct-alpha-tuning-pc-unzipped.rds containing unzipped raw data
- 20221126-121849-pcovr-correct-alpha-tuning-pc-main-res.rds containing processed data (bias, cic, ciw computed)
20221202-105949-results.rds contains the combined results you are using this is the only file that can be found on GitHub. The rest is too big to be stored here.
20220729-151828-check-time-per-rep.tar.gz stores the time to impute for the version with the old alpha PCovR approach
20221222-075917-check-time-per-rep.tar.gz stores the time to impute for the version with the new alpha PCovR approach (taking longer for npcs = ncol(X), but more robust)

References

Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101(473), 119–137.

De Jong, S., & Kiers, H. A. (1992). Principal covariates regression: part i. theory. Chemometrics and Intelligent Laboratory Systems, 14(1-3), 155–164.

Wold, H. (1975). Path models with latent variables: The nipals approach. In Quantitative sociology (pp. 307–357). Elsevier.

Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys (Vol. 519). New York, NY: John Wiley & Sons.

Van Buuren, S., Boshuizen, H. C., & Knook, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18(6), 681–694.

Owner

Name: Edo
Login: EdoardoCostantini
Kind: user
Location: Tilburg, Netherlands
Company: Tilburg University

Website: https://edoardocostantini.github.io
Repositories: 9
Profile: https://github.com/EdoardoCostantini

Sociologist turned statistician, missed developer, born interior designer but never got there

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Costantini"
  given-names: "Edoardo"
  orcid: "https://orcid.org/0000-0001-9581-9913"
title: "mi-spcr"
version: 3.0.0
doi: 10.5281/zenodo.7390470
date-released: 2022-12-02
url: "https://github.com/EdoardoCostantini/mi-spcr"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science