simulacrumWorkflowR
simulacrumWorkflowR: An R package for Streamlined Access and Analysis of the Simulacrum Cancer Dataset - Published in JOSS (2025)
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: joss.theoj.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.8%) to scientific vocabulary
Scientific Fields
Earth and Environmental Sciences
Physical Sciences -
40% confidence
Last synced: 4 months ago
·
JSON representation
Repository
An R library to ease the process of using Simulacrum.
Basic Info
Statistics
- Stars: 1
- Watchers: 4
- Forks: 4
- Open Issues: 0
- Releases: 2
Created about 1 year ago
· Last pushed 5 months ago
Metadata Files
Readme
Contributing
License
Code of conduct
README.Rmd
---
output: github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
library(knitr)
```
# simulacrumWorkflowR
[](https://doi.org/10.21105/joss.08120)
simulacrumWorkflowR is a package developed to assist users of the Simulacrum dataset in better preparing to use the dataset as a precursor to accessing real patient data in the Cancer Administration System (CAS).
The Simulacrum data is a synthetic version of the real patient data at CAS. It is publicly available and can be used to create and test analyses in R or STATA before executing them on the real data. However, setting up Simulacrum requires creating a local Oracle database, importing the data, and setting up an ODBC connection. To simplify this process, the simulacrumWorkflowR package automates the setup of a database within R and provides various utility functions for preprocessing, query generation, and query testing.
# Installation
simulacrumWorkflowR may be installed using the following command:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("CLINDA-AAU/simulacrumWorkflowR",
dependencies = TRUE, force = TRUE)
```
# Overview
The main functions of simulacrumWorkflowR is:
- Integrated SQL Environment: Leverages the sqldf (Grothendieck, 2017) package to enable SQL queries directly within R, eliminating the need for external database setup and ODBC connections by creating a local SQLite temporary database within the R environment.
- Query Helper: Offers a collection of queries custom-made for the Simulacrum, for pulling and merging certain tables. Additionally, does the sqlite2oracle function assist in translating queries to be compatible with the NHS servers.
- Helper Tools: Offers a range of data preprocessing functions for cleaning, and preparing the data for analysis, ensuring data quality and consistency. Key functions include cancer type grouping, survival status, and detailed logging.
- Workflow Generator: Generates an R script with the complete workflow. Ensuring correct layout and the ability to integrate all the necessary code to obtain a workflow suitable for submission to the NHS and execution on the CAS database.
# The process
The process of using this package for getting access to the data at CAS through Simulacrum is as following:
1) Download the latest version of Simulacrum at:
```{r}
library(simulacrumWorkflowR)
open_simulacrum_request()
```
Or at the link: https://simulacrum.healthdatainsight.org.uk/using-the-simulacrum/requesting-data/
2) Copy the directory path of the Simulacrum files on your local machine
3) Use the package's data loader function to load the files into R
4) Utilize R to handle data preprocessing and analysis
5) Save the complete workflow with the workflow generator function
6) Send the Workflow to NHS and wait for the results
# Explanation of the workflow
The workflow is built around the sqldf package where the user are able to setup a invisible database in the span of seconds and fully automated. Before the database is intialised, the user is required to download the latest version of the Simulacrum (v2.1.0) data: https://simulacrum.healthdatainsight.org.uk/using-the-simulacrum/requesting-data/ .
The latest Simulacrum data is formatted identically to the real CAS data. Once downloaded, the read_simulacrum() function can automatically load the CSV files as data frames in R:
```{r}
dir <- system.file("extdata", "minisimulacrum", package = "simulacrumWorkflowR")
# Automated data loading
data_frames_lists <- read_simulacrum(dir, selected_files = c("sim_av_patient", "sim_av_tumour"))
```
Access individual data frames as follows:
```{r}
SIM_AV_PATIENT <- data_frames_lists$sim_av_patient
SIM_AV_TUMOUR <- data_frames_lists$sim_av_tumour
```
Once data frames are loaded, you can start writing queries. It's recommended to keep queries simple and handle data management in R. Use the table_query_list function to access premade query templates. For example, to merge tables:
```{r}
query <- "SELECT *
FROM SIM_AV_PATIENT
INNER JOIN SIM_AV_TUMOUR ON SIM_AV_PATIENT.patientid = SIM_AV_TUMOUR.patientid;"
```
Execute queries with the sql_test() function:
```{r}
query_result <- query_sql(query)
```
## SQLite to Oracle Query Translation
To accommodate differences between SQLite and Oracle queries, use the sqlite2oracle() function:
```{r}
query2 <- "select *
from SIM_AV_PATIENT
where age > 50
limit 500;"
sqlite2oracle(query2)
```
Note: This function is built in `create_workflow()`
## Preprocessing Functions
simulacrumWorkflowR includes functions to simplify data preprocessing:
- 'cancer_grouping'()
- 'group_ethnicity()'
- 'extended_summary()'
- 'survival_days()'
## Workflow Generation
When data management and analysis are complete, use the workflow generator function to produce an R script ready for submission to the NHS:
```{r}
create_workflow(
libraries = "library(dplyr)
library(simulacrumWorkflowR)",
query = "SELECT *
FROM sim_av_patient
INNER JOIN sim_av_tumour ON sim_av_patient.patientid = sim_av_tumour.patientid
limit 500;",
data_management = "
# Run query on SQLite database
data <- cancer_grouping(query_result)
# Additional preprocessing
modified_data <- survival_days(data)
",
analysis = "model = glm(AGE ~ STAGE_BEST + GRADE, data=modified_data)",
model_results = "html_table_model(model)")
```
This workflow automates the process, ensuring easy integration and preparation of your Simulacrum data.
In the event of an error on NHS servers while executing the analysis pipeline, the `time_management` function and the base R `sink` will generate a comprehensive log to facilitate seamless debugging.
# References
- Grothendieck G, (2017). sqldf: Manipulate R Data Frames Using SQL. Link: ggrothendieck/sqldf: Perform SQL
Selects on R Data Frames
- Frayling L, Jose S. (2023) Simulacrum v2 User Guide. Health Data Insight. Link: Simulacrum-v2-User-Guide.pdf
- National Disease Registration Service (NDRS). (2023). Guide to using Simulacrum and Submitting code. Link: https://digital.nhs.uk/ndrs/data/data-outputs/cancer-publications-and-tools/simulacrum/simulacrum-user-guide/developing-code-using-simulacrum-for-a-data-release-request
Owner
- Name: Center for Clinical Data Science (CLINDA)
- Login: CLINDA-AAU
- Kind: organization
- Location: Aalborg, Denmark
- Website: https://clinda.aau.dk/
- Repositories: 17
- Profile: https://github.com/CLINDA-AAU
Center for Clinical Data Science - Aalborg University & Aalborg University Hospital
GitHub Events
Total
- Create event: 1
- Issues event: 1
- Release event: 1
- Watch event: 2
- Issue comment event: 34
- Public event: 1
- Push event: 75
- Pull request event: 1
- Fork event: 3
Last Year
- Create event: 1
- Issues event: 1
- Release event: 1
- Watch event: 2
- Issue comment event: 34
- Public event: 1
- Push event: 75
- Pull request event: 1
- Fork event: 3
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 2
- Total pull requests: 3
- Average time to close issues: about 2 months
- Average time to close pull requests: about 8 hours
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 21.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 3
- Average time to close issues: about 2 months
- Average time to close pull requests: about 8 hours
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 21.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- aghaynes (1)
- goldingn (1)
Pull Request Authors
- goldingn (2)
- DrEspresso (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
.github/workflows/pdf_gen.yml
actions
- actions/checkout v4 composite
- actions/upload-artifact v4 composite
- openjournals/openjournals-draft-action master composite
DESCRIPTION
cran