simulacrumWorkflowR

simulacrumWorkflowR: An R package for Streamlined Access and Analysis of the Simulacrum Cancer Dataset - Published in JOSS (2025)

https://github.com/clinda-aau/simulacrumworkflowr

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: joss.theoj.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.8%) to scientific vocabulary

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

An R library to ease the process of using Simulacrum.

Basic Info

Host: GitHub
Owner: CLINDA-AAU
License: mit
Language: R
Default Branch: main
Homepage:
Size: 3.9 MB

Statistics

Stars: 1
Watchers: 4
Forks: 4
Open Issues: 0
Releases: 2

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Code of conduct

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
library(knitr)
```

# simulacrumWorkflowR

[![status](https://joss.theoj.org/papers/10.21105/joss.08120/status.svg)](https://doi.org/10.21105/joss.08120)


simulacrumWorkflowR is a package developed to assist users of the Simulacrum dataset in better preparing to use the dataset as a precursor to accessing real patient data in the Cancer Administration System (CAS).

The Simulacrum data is a synthetic version of the real patient data at CAS. It is publicly available and can be used to create and test analyses in R or STATA before executing them on the real data. However, setting up Simulacrum requires creating a local Oracle database, importing the data, and setting up an ODBC connection. To simplify this process, the simulacrumWorkflowR package automates the setup of a database within R and provides various utility functions for preprocessing, query generation, and query testing.

# Installation
simulacrumWorkflowR may be installed using the following command:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("CLINDA-AAU/simulacrumWorkflowR",
dependencies = TRUE, force = TRUE) 

```

# Overview
The main functions of simulacrumWorkflowR is: 

- Integrated SQL Environment: Leverages the sqldf (Grothendieck, 2017) package to enable SQL queries directly within R, eliminating the need for external database setup and ODBC connections by creating a local SQLite temporary database within the R environment.  

- Query Helper: Offers a collection of queries custom-made for the Simulacrum, for pulling and merging certain tables. Additionally, does the sqlite2oracle function assist in translating queries to be compatible with the NHS servers. 

- Helper Tools: Offers a range of data preprocessing functions for cleaning, and preparing the data for analysis, ensuring data quality and consistency. Key functions include cancer type grouping, survival status, and detailed logging.  

- Workflow Generator: Generates an R script with the complete workflow. Ensuring correct layout and the ability to integrate all the necessary code to obtain a workflow suitable for submission to the NHS and execution on the CAS database.  

# The process 
The process of using this package for getting access to the data at CAS through Simulacrum is as following: 

1) Download the latest version of Simulacrum at: 
```{r}
library(simulacrumWorkflowR)
open_simulacrum_request()
```
Or at the link: https://simulacrum.healthdatainsight.org.uk/using-the-simulacrum/requesting-data/

2) Copy the directory path of the Simulacrum files on your local machine

3) Use the package's data loader function to load the files into R 

4) Utilize R to handle data preprocessing and analysis 

5) Save the complete workflow with the workflow generator function 

6) Send the Workflow to NHS and wait for the results

# Explanation of the workflow
The workflow is built around the sqldf package where the user are able to setup a invisible database in the span of seconds and fully automated. Before the database is intialised, the user is required to download the latest version of the Simulacrum (v2.1.0) data: https://simulacrum.healthdatainsight.org.uk/using-the-simulacrum/requesting-data/ .

The latest Simulacrum data is formatted identically to the real CAS data. Once downloaded, the read_simulacrum() function can automatically load the CSV files as data frames in R:

```{r}
dir <- system.file("extdata", "minisimulacrum", package = "simulacrumWorkflowR")
# Automated data loading 
data_frames_lists <- read_simulacrum(dir, selected_files = c("sim_av_patient", "sim_av_tumour")) 
```

Access individual data frames as follows:

```{r}
SIM_AV_PATIENT <- data_frames_lists$sim_av_patient
SIM_AV_TUMOUR <- data_frames_lists$sim_av_tumour
```

Once data frames are loaded, you can start writing queries. It's recommended to keep queries simple and handle data management in R. Use the table_query_list function to access premade query templates. For example, to merge tables:
```{r}
query <- "SELECT *
FROM SIM_AV_PATIENT
INNER JOIN SIM_AV_TUMOUR ON SIM_AV_PATIENT.patientid = SIM_AV_TUMOUR.patientid;"
```


Execute queries with the sql_test() function:

```{r}
query_result <- query_sql(query)
```
## SQLite to Oracle Query Translation

To accommodate differences between SQLite and Oracle queries, use the sqlite2oracle() function:

```{r}

query2 <- "select *
from SIM_AV_PATIENT
where age > 50
limit 500;"

sqlite2oracle(query2)
```
Note: This function is built in `create_workflow()` 

## Preprocessing Functions

simulacrumWorkflowR includes functions to simplify data preprocessing:

- 'cancer_grouping'()
- 'group_ethnicity()'
- 'extended_summary()'
- 'survival_days()'

## Workflow Generation

When data management and analysis are complete, use the workflow generator function to produce an R script ready for submission to the NHS:

```{r}
create_workflow(
                         libraries = "library(dplyr)
                                      library(simulacrumWorkflowR)",
                         query = "SELECT *
                          FROM sim_av_patient
                          INNER JOIN sim_av_tumour ON sim_av_patient.patientid = sim_av_tumour.patientid
                          limit 500;",
                         data_management = "
                         # Run query on SQLite database
                          data <- cancer_grouping(query_result)

                          # Additional preprocessing
                          modified_data <- survival_days(data)
                          ",
                         analysis = "model = glm(AGE ~ STAGE_BEST + GRADE,  data=modified_data)",
                         model_results = "html_table_model(model)")
```

This workflow automates the process, ensuring easy integration and preparation of your Simulacrum data.

In the event of an error on NHS servers while executing the analysis pipeline, the `time_management` function and the base R `sink` will generate a comprehensive log to facilitate seamless debugging.

# References

- Grothendieck G, (2017). sqldf: Manipulate R Data Frames Using SQL. Link:  ggrothendieck/sqldf: Perform SQL      
  Selects on R Data Frames 

- Frayling L, Jose S. (2023) Simulacrum v2 User Guide. Health Data Insight. Link: Simulacrum-v2-User-Guide.pdf 

- National Disease Registration Service (NDRS). (2023). Guide to using Simulacrum and Submitting code. Link: https://digital.nhs.uk/ndrs/data/data-outputs/cancer-publications-and-tools/simulacrum/simulacrum-user-guide/developing-code-using-simulacrum-for-a-data-release-request

Owner

Name: Center for Clinical Data Science (CLINDA)
Login: CLINDA-AAU
Kind: organization
Location: Aalborg, Denmark

Website: https://clinda.aau.dk/
Repositories: 17
Profile: https://github.com/CLINDA-AAU

Center for Clinical Data Science - Aalborg University & Aalborg University Hospital

GitHub Events

Total

Create event: 1
Issues event: 1
Release event: 1
Watch event: 2
Issue comment event: 34
Public event: 1
Push event: 75
Pull request event: 1
Fork event: 3

Last Year

Create event: 1
Issues event: 1
Release event: 1
Watch event: 2
Issue comment event: 34
Public event: 1
Push event: 75
Pull request event: 1
Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 3
Average time to close issues: about 2 months
Average time to close pull requests: about 8 hours
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 21.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 3
Average time to close issues: about 2 months
Average time to close pull requests: about 8 hours
Issue authors: 2
Pull request authors: 2
Average comments per issue: 21.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

aghaynes (1)
goldingn (1)

Pull Request Authors

goldingn (2)
DrEspresso (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/pdf_gen.yml actions

actions/checkout v4 composite
actions/upload-artifact v4 composite
openjournals/openjournals-draft-action master composite

DESCRIPTION cran

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science