simulacrumWorkflowR

simulacrumWorkflowR: An R package for Streamlined Access and Analysis of the Simulacrum Cancer Dataset - Published in JOSS (2025)

https://github.com/clinda-aau/simulacrumworkflowr

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: joss.theoj.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.8%) to scientific vocabulary

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

An R library to ease the process of using Simulacrum.

Basic Info
  • Host: GitHub
  • Owner: CLINDA-AAU
  • License: mit
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 3.9 MB
Statistics
  • Stars: 1
  • Watchers: 4
  • Forks: 4
  • Open Issues: 0
  • Releases: 2
Created about 1 year ago · Last pushed 5 months ago
Metadata Files
Readme Contributing License Code of conduct

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
library(knitr)
```

# simulacrumWorkflowR

[![status](https://joss.theoj.org/papers/10.21105/joss.08120/status.svg)](https://doi.org/10.21105/joss.08120)


simulacrumWorkflowR is a package developed to assist users of the Simulacrum dataset in better preparing to use the dataset as a precursor to accessing real patient data in the Cancer Administration System (CAS).

The Simulacrum data is a synthetic version of the real patient data at CAS. It is publicly available and can be used to create and test analyses in R or STATA before executing them on the real data. However, setting up Simulacrum requires creating a local Oracle database, importing the data, and setting up an ODBC connection. To simplify this process, the simulacrumWorkflowR package automates the setup of a database within R and provides various utility functions for preprocessing, query generation, and query testing.

# Installation
simulacrumWorkflowR may be installed using the following command:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("CLINDA-AAU/simulacrumWorkflowR",
dependencies = TRUE, force = TRUE) 

```

# Overview
The main functions of simulacrumWorkflowR is: 

- Integrated SQL Environment: Leverages the sqldf (Grothendieck, 2017) package to enable SQL queries directly within R, eliminating the need for external database setup and ODBC connections by creating a local SQLite temporary database within the R environment.  

- Query Helper: Offers a collection of queries custom-made for the Simulacrum, for pulling and merging certain tables. Additionally, does the sqlite2oracle function assist in translating queries to be compatible with the NHS servers. 

- Helper Tools: Offers a range of data preprocessing functions for cleaning, and preparing the data for analysis, ensuring data quality and consistency. Key functions include cancer type grouping, survival status, and detailed logging.  

- Workflow Generator: Generates an R script with the complete workflow. Ensuring correct layout and the ability to integrate all the necessary code to obtain a workflow suitable for submission to the NHS and execution on the CAS database.  

# The process 
The process of using this package for getting access to the data at CAS through Simulacrum is as following: 

1) Download the latest version of Simulacrum at: 
```{r}
library(simulacrumWorkflowR)
open_simulacrum_request()
```
Or at the link: https://simulacrum.healthdatainsight.org.uk/using-the-simulacrum/requesting-data/

2) Copy the directory path of the Simulacrum files on your local machine

3) Use the package's data loader function to load the files into R 

4) Utilize R to handle data preprocessing and analysis 

5) Save the complete workflow with the workflow generator function 

6) Send the Workflow to NHS and wait for the results

# Explanation of the workflow
The workflow is built around the sqldf package where the user are able to setup a invisible database in the span of seconds and fully automated. Before the database is intialised, the user is required to download the latest version of the Simulacrum (v2.1.0) data: https://simulacrum.healthdatainsight.org.uk/using-the-simulacrum/requesting-data/ .

The latest Simulacrum data is formatted identically to the real CAS data. Once downloaded, the read_simulacrum() function can automatically load the CSV files as data frames in R:

```{r}
dir <- system.file("extdata", "minisimulacrum", package = "simulacrumWorkflowR")
# Automated data loading 
data_frames_lists <- read_simulacrum(dir, selected_files = c("sim_av_patient", "sim_av_tumour")) 
```

Access individual data frames as follows:

```{r}
SIM_AV_PATIENT <- data_frames_lists$sim_av_patient
SIM_AV_TUMOUR <- data_frames_lists$sim_av_tumour
```

Once data frames are loaded, you can start writing queries. It's recommended to keep queries simple and handle data management in R. Use the table_query_list function to access premade query templates. For example, to merge tables:
```{r}
query <- "SELECT *
FROM SIM_AV_PATIENT
INNER JOIN SIM_AV_TUMOUR ON SIM_AV_PATIENT.patientid = SIM_AV_TUMOUR.patientid;"
```


Execute queries with the sql_test() function:

```{r}
query_result <- query_sql(query)
```
## SQLite to Oracle Query Translation

To accommodate differences between SQLite and Oracle queries, use the sqlite2oracle() function:

```{r}

query2 <- "select *
from SIM_AV_PATIENT
where age > 50
limit 500;"

sqlite2oracle(query2)
```
Note: This function is built in `create_workflow()` 

## Preprocessing Functions

simulacrumWorkflowR includes functions to simplify data preprocessing:

- 'cancer_grouping'()
- 'group_ethnicity()'
- 'extended_summary()'
- 'survival_days()'

## Workflow Generation

When data management and analysis are complete, use the workflow generator function to produce an R script ready for submission to the NHS:

```{r}
create_workflow(
                         libraries = "library(dplyr)
                                      library(simulacrumWorkflowR)",
                         query = "SELECT *
                          FROM sim_av_patient
                          INNER JOIN sim_av_tumour ON sim_av_patient.patientid = sim_av_tumour.patientid
                          limit 500;",
                         data_management = "
                         # Run query on SQLite database
                          data <- cancer_grouping(query_result)

                          # Additional preprocessing
                          modified_data <- survival_days(data)
                          ",
                         analysis = "model = glm(AGE ~ STAGE_BEST + GRADE,  data=modified_data)",
                         model_results = "html_table_model(model)")
```

This workflow automates the process, ensuring easy integration and preparation of your Simulacrum data.

In the event of an error on NHS servers while executing the analysis pipeline, the `time_management` function and the base R `sink` will generate a comprehensive log to facilitate seamless debugging.

# References

- Grothendieck G, (2017). sqldf: Manipulate R Data Frames Using SQL. Link:  ggrothendieck/sqldf: Perform SQL      
  Selects on R Data Frames 

- Frayling L, Jose S. (2023) Simulacrum v2 User Guide. Health Data Insight. Link: Simulacrum-v2-User-Guide.pdf 

- National Disease Registration Service (NDRS). (2023). Guide to using Simulacrum and Submitting code. Link: https://digital.nhs.uk/ndrs/data/data-outputs/cancer-publications-and-tools/simulacrum/simulacrum-user-guide/developing-code-using-simulacrum-for-a-data-release-request 

Owner

  • Name: Center for Clinical Data Science (CLINDA)
  • Login: CLINDA-AAU
  • Kind: organization
  • Location: Aalborg, Denmark

Center for Clinical Data Science - Aalborg University & Aalborg University Hospital

GitHub Events

Total
  • Create event: 1
  • Issues event: 1
  • Release event: 1
  • Watch event: 2
  • Issue comment event: 34
  • Public event: 1
  • Push event: 75
  • Pull request event: 1
  • Fork event: 3
Last Year
  • Create event: 1
  • Issues event: 1
  • Release event: 1
  • Watch event: 2
  • Issue comment event: 34
  • Public event: 1
  • Push event: 75
  • Pull request event: 1
  • Fork event: 3

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 2
  • Total pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 8 hours
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 21.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 8 hours
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 21.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • aghaynes (1)
  • goldingn (1)
Pull Request Authors
  • goldingn (2)
  • DrEspresso (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/pdf_gen.yml actions
  • actions/checkout v4 composite
  • actions/upload-artifact v4 composite
  • openjournals/openjournals-draft-action master composite
DESCRIPTION cran