comparison-clustering-longitudinal-data

Supplementary materials for the manuscript "A comparison of methods for clustering longitudinal data with slowly changing trends" by N. G. P. Den Teuling, S.C. Pauws, and E.R. van den Heuvel, published in Communications in Statistics - Simulation and Computation (2021).

https://github.com/philips-labs/comparison-clustering-longitudinal-data

Keywords

cluster-analysis longitudinal-analysis r simulation-study

Last synced: 6 months ago · JSON representation ·

Repository

Supplementary materials for the manuscript "A comparison of methods for clustering longitudinal data with slowly changing trends" by N. G. P. Den Teuling, S.C. Pauws, and E.R. van den Heuvel, published in Communications in Statistics - Simulation and Computation (2021).

Basic Info

Host: GitHub
Owner: philips-labs
License: gpl-2.0
Language: R
Default Branch: main
Homepage: https://doi.org/10.1080/03610918.2020.1861464
Size: 144 KB

Statistics

Stars: 4
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Topics

cluster-analysis longitudinal-analysis r simulation-study

Created over 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

README.md

comparison-clustering-longitudinal-data

This repository contains all R code used in running and analyzing the simulation study and case study reported in the manuscript.

As the simulation study involves many simulation settings (over 27,000) and the estimation time of some methods was rather long, a custom parallel simulation framework was implemented for use on a computation cluster. While a computational cluster is not strictly needed if you are only interested in replicating a subset of the simulation scenarios or methods, you will need to configure a Redis database server (https://redis.io/) in order to run any simulations. The instructions are provided below.

The complete database of simulation results (600 MB) is available upon request.

Useful links

MixTVEM source code used in the simulation study - https://github.com/dziakj1/MixTVEM
lcmm R package, used for estimating GMM and GBTM - https://cran.r-project.org/package=lcmm
kml R package, used for estimating KmL - https://cran.r-project.org/package=kml
latrend R package: The longitudinal clustering framework that we have created, originating from the learnings of this work - https://github.com/philips-software/latrend

Getting started

Either load the Rstudio project file comparison.Rproj, or start an R session with the working directory set to the root repository directory.
Install required packages and dependencies R install.packages(c("assertthat", "data.table", "effects", "ggplot2", "igraph", "kml", "latex2exp", "lcmm", "lpSolve", "memoise", "mvnfast", "magrittr", "multcompView", "nlme", "polynom", "R.utils", "rredis", "scales", "weights"), dependencies = TRUE)
Create an .Rprofile file with the following content: ```R FIGDIR <- 'figs' # directory to export figures to TABDIR <- 'tabs' # directory to export model coefficient tables to OSUUSAGEDATAFILE <- '../data/' CASEOSURESULTSDIR <- '../caseresults' # directory where to store the models

REDISHOSTFILE <- 'redis/localhost.txt' # file specifying hostname and port REDIS_PWD <- 'password' # server AUTH password

source('include.R') ``` Change file and directory paths as needed.

Restart the R session. This should now automatically run the .Rprofile file, which you can tell by the output in the console on start-up. The include.R script loads all required packages and functions.

You should now be able to run all functions and scripts. Running simulation studies requires a Redis database server to be configured.

Redis database

The Redis database stores the open jobs as well as the results of completed jobs. Parallel workers fetch jobs from the Redis queue, and store result in the respective experiment set. The benefit of storing results in the database is that it avoids the rather large file system overhead from saving thousands of small result files.

Installing Redis server

Windows

Download the Redis binaries. Older binaries are available at https://github.com/microsoftarchive/redis/ (download link)
Install Redis
1. Make sure Redis is added to your system's PATH environment variable.
2. Let Redis use the default port (6379).

Unix

WIP 1. set BASEDIR in redis.ksh

Starting Redis server

You need to start the Redis server before you can run simulations or retrieve simulation results.

The Redis configuration file included in the repository here configures a server on port 6379 with password "password" and database saved to redis/database.rdb. A server password is required because the simulation R code connects to Redis using authentication.

Windows

In order to start the Redis server on Windows, run redis.bat. Alternatively, you can open the command line in the root repository directory and execute redis-server redis/redis.conf If everything is configured correctly, you should see the following window:

If no window shows up, that indicates the Redis server failed to start. First check if the database directory path exists.

Unix

From the root directory of the repository, run redis-server redis/redis.conf

Connect to Redis

After you have confirmed that the Redis server is running and you have opened an R session with all scripts loaded, connect to Redis in R by running redis_connect(). You should see the message "Connected to Redis at localhost:6379.".

Running simulations

All simulation scenarios described in the manuscript are located inside the experiments folder. Simulation scenarios are defined in R scripts prefixed by exp_.

Generating simulation settings

As an example, the simulation settings for the scenario involving a known number of clusters are defined and generated in expnormalknown.R.

Specifically, the scenario with two-cluster dataset with quadratic trends and varying number of trajetories, observations, random effects, and noise, are generated using: R cases_normal2 <- expand.grid( data = c('longdata_randquad2'), model = c('longmodel_kml', 'longmodel_gcm2km', 'longmodel_gbtm2', 'longmodel_gmm2', 'longmodel_mixtvem_nugget'), numtraj = c(200, 500, 1000), numobs = c(4, 10, 25), numclus = 2, re = c(RE_NORM_LOW, RE_NORM_MED, RE_NORM_HIGH), noise = c(.01, .1), dataseed = 1:100, seed = 1 ) %>% as.data.table() %T>% print() The model names passed through the model argument are names of the functions defined in the methods folder. This makes it relatively easy to define and evaluate new methods. Providing dataseed = 1:100 results in 100 different datasets being generated.

Queueing simulation jobs

After generating the table of simulation settings, we can submit them to the job queue using the experiment_submit() function. Only jobs which have not been previously evaluated are added. R redis_connect() # connect to Redis first experiment_submit(name = 'normal_known', cases = cases_normal2)

Starting parallel workers

The submitted jobs now need to be evaluated. This evaluation is done by worker instances.

To start a simulation worker on Windows, run worker.bat. However, for this to work, R needs to be in your PATH environment variable so Windows can locate the R executable file. On Linux, in the command line from the repository directory, run R --slave -f redis/worker.R On computational clusters, you can start worker batch jobs in a similar manner.

You can start as many workers as your system allows. The workers will pull jobs from the queue and evaluate them. When no more jobs are open, the workers will terminate.

You can also evaluate jobs in the master R session by sourcing the redis/worker.R script.

Helper functions

Jobs R job_monitor() # monitor number of remaining jobs over time job_count() # returns number of open jobs job_clear() # clear the job queue Experiments R experiment_names() # get list of evaluated experiments experiment_delete(name) # delete all results of the respective experiment

Evaluating simulation results

Simulation results can be retrieved and analyzed at any moment in time, returning all job results that have been completed up to that moment. All simulation scenario analysis scripts are located inside the experiments folder, prefixed by analysis_.

Retrieving results

Methods output their results as a named list of scalar values. Results can therefore be easily combined into a table. All evaluated cases can be retrieved as a single data.table object using the experiment_getOutputTable() function. ```R resultsnormalall <- experimentgetOutputTable('normalknown')

head(resultsnormalall) ```

Owner

Name: Philips Labs
Login: philips-labs
Kind: organization
Location: Netherlands

Repositories: 131
Profile: https://github.com/philips-labs

Philips Labs - Projects in development

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
preferred-citation:
  type: article
  authors:
  - family-names: "Den Teuling"
    given-names: "Niek G. P."
    orcid: "https://orcid.org/0000-0003-1026-5080"
  - family-names: "Pauws"
    given-names: "Steffen C."
    orcid: "https://orcid.org/0000-0003-2257-9239"
  - family-names: "van den Heuvel"
    given-names: "Edwin R."
    orcid: "https://orcid.org/0000-0001-9157-7224"
  doi: "10.1080/03610918.2020.1861464"
  journal: "Communications in Statistics - Simulation and Computation"
  start: 1 # First page number
  end: 28 # Last page number
  title: "A comparison of methods for clustering longitudinal data with slowly changing trends"
  year: 2021

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 10 months ago

All Time

Total Commits: 24
Total Committers: 1
Avg Commits per committer: 24.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Niek Den Teuling	n****t	24

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

comparison-clustering-longitudinal-data

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

comparison-clustering-longitudinal-data

Useful links

Getting started

Redis database

Installing Redis server

Windows

Unix

Starting Redis server

Windows

Unix

Connect to Redis

Running simulations

Generating simulation settings

Queueing simulation jobs

Starting parallel workers

Helper functions

Evaluating simulation results

Retrieving results

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels