synthetic-data-generator

SDG generates synthetic breast cancer patient data

https://github.com/sdm-tib/synthetic-data-generator

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Keywords

breast-cancer csv data-generator process-based rdf sql

Last synced: 10 months ago · JSON representation ·

Repository

SDG generates synthetic breast cancer patient data

Basic Info

Host: GitHub
Owner: SDM-TIB
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 474 KB

Statistics

Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Topics

breast-cancer csv data-generator process-based rdf sql

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Synthetic Data Generator (SDG)

The Synthetic Data Generator (SDG) creates process-based data. These data model the treatment of breast cancer patients following the distribution of values in a real breast cancer patient population.

Parameters

number of patients - the number of patients that will be modeled
mutation probability - the mutation probability describes how likely it is for each datum to deviate from the treatment guidelines.

[!NOTE] A mutation probability of 0.0 creates 'clean' data which complies with the treatment guidelines.

Output Data Formats

On execution, SDG creates the same data set in three different formats.

CSV
RDF
SQL (MySQL 8.1 dump)

Data Generation

Requirements: * Docker

There are two options for generating the data; one is using docker-compose. After executing either of the available options, your generated data set can be found in ./data.

Option 1: With docker-compose

If you want to use the docker-compose option, run the following commands:

bash docker-compose up -d --build docker exec -it SDG bash -c "SDG -n {patients} -p {mutation_prob}" docker-compose down -v

where * {patients} is a placeholder for the number of patients * {mutation_prob} is a placeholder for the mutation probability

This option is recommended if several data sets will be generated. The SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!

Option 2: Without docker-compose

If you do not want to use docker-compose, you can also execute:

bash ./generate.sh {patients} {mutation_prob}

where * {patients} is a placeholder for the number of patients * {mutation_prob} is a placeholder for the mutation probability

Similar to option 1, generate.sh will build the Docker image, start a Docker container, execute the SDG, and stop and remove the Docker container again. You can use this option if you do not have docker-compose installed or if you want to generate only one data set.

As in option 1, the SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!

Output Data Description

The SGD generates data that could be collected during the treatment process of breast cancer patients, including demographic, gynecologic, diagnostic, tumor-related, treatment, comorbidity, and family history data. To illustrate the output data, the following figure shows the Entity-Relationship diagram of the data generated when choosing the relational database as the output format. Because of readibility reasons, only the key attributes have been included, the rest of attrbites are described in the data dictionaty below. The other output formats generate equivalent data, using the corresponding formats.

Entity-Relationship diagram of the generated data

Data dictionary: * patient * ehr: INTEGER * birthdate: DATE * diagnosisdate: DATE * ageatdiagnosis: INTEGER * firsttreatmentdate: DATE * surgerydate: DATE * deathdate: DATE / NULL (if the patient has not died) * ageatdeath: INTEGER / NULL (if the patient has not died) * recurrenceyear: INTEGER / NULL (if the patient has not relapsed) * neoadjuvant: yes / no * erpositive: 1 / 0 * prpositive: 1 / 0 * her2overallpositive: 1 / 0 * ki67percentmaxsimp: INTEGER (ranging from 0 to 100) * menarcheage: INTEGER * menopauseage: INTEGER * pregnancy: INTEGER * abort: INTEGER * birth: INTEGER * caesarean: INTEGER * tumortnm * ehr: INTEGER * ntumor: INTEGER * tprefixy: 0 * tprefix: C / P * tcategory: IS / 0 / 1 / 2 / 3 / 4 * nprefixy: 0 * nprefix: C / P * ncategory: 0 / 1 / 2 / 3 * nsubcategory: MI / NULL * mcategory: 0 / 1 * tprefixyafterneoadj: 1 * tprefixafterneoadj: C / P / NULL (if not neoadjuvant) * tcategoryafterneoadj: IS / 0 / 1 / 2 / 3 / 4 / NULL (if not neoadjuvant) * nprefixyafterneoadj: 1 * nprefixafterneoadj: C / P / NULL (if not neoadjuvant) * ncategoryafterneoadj: 0 / 1 / 2 / 3 / NULL (if not neoadjuvant) * nsubcategoryafterneoadj: MI / NULL * mcategoryafterneoadj: 0 / 1 / NULL (if not neoadjuvant) * ntumortype: INTEGER * ntumorgrade: INTEGER * stagediagnosis: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV * stageafterneo: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV * tumortype * ehr: INTEGER * ntumortype: INTEGER * ductal: 1 / 0 * lobular: 1 / 0 * insitu: 1 / 0 * invasive: 1 / 0 * associatedinsitu: 1 / 0 * tumorgrade * ehr: INTEGER * ntumorgrade: INTEGER * grade: 1 / 2 / 3 * drug * iddrug: INTEGER * name: STRING * chemoterapyschema * idschema: INTEGER * name: STRING * drugchemoterapyschema * idschema: INTEGER * iddrug: INTEGER * chemoterapycycle * ehr: INTEGER * idschema: INTEGER * date: DATE * cyclenumber: INTEGER * surgery * ehr: INTEGER * surgery: STRING * nsurgery: INTEGER * dateyear: INTEGER * datemonth: INTEGER * dateday: INTEGER * radiotherapy * ehr: INTEGER * datestart: DATE * dateend: DATE * nradiotherapy: INTEGER * dosegy: FLOAT * comorbidity * id: INTEGER * ehr: INTEGER * comorbidity: STRING * negated: 0 / 1 * oraldrug * ehr: INTEGER * drug: STRING * oraldrugtype * drug: STRING * drugtype: STRING * familyhistory * ehr: INTEGER * cancercui: STRING * cui_description * cui: STRING * description: STRING

Owner

Name: Scientific Data Management Group
Login: SDM-TIB
Kind: organization
Email: Philipp.Rohde@tib.eu
Location: Hannover, Germany

Website: https://www.tib.eu/en/research-development/scientific-data-management/
Repositories: 66
Profile: https://github.com/SDM-TIB

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use data generated with SDG in your work, please cite it using the following metadata."
authors:
- family-names: "Diaz-Honrubia"
  given-names: "Antonio Jesus"
  orcid: "https://orcid.org/0000-0001-5464-0714"
- family-names: "Rohde"
  given-names: "Philipp D."
  orcid: "https://orcid.org/0000-0002-9835-4354"
title: "Synthetic Data Generator"
url: "https://github.com/SDM-TIB/Synthetic-Data-Generator"

GitHub Events

Total

Last Year

Dependencies

Dockerfile docker

mysql 8.1.0 build

docker-compose.yml docker

sdmtib/sdg latest

requirements.txt pypi

mysql-connector-python ==8.2.0
numpy ==1.26.1
pandas ==2.1.2
rdfizer ==4.7.2.7

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science