synthetic-data-generator

SDG generates synthetic breast cancer patient data

https://github.com/sdm-tib/synthetic-data-generator

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary

Keywords

breast-cancer csv data-generator process-based rdf sql
Last synced: 6 months ago · JSON representation ·

Repository

SDG generates synthetic breast cancer patient data

Basic Info
  • Host: GitHub
  • Owner: SDM-TIB
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 474 KB
Statistics
  • Stars: 2
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
breast-cancer csv data-generator process-based rdf sql
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

License: MIT

Synthetic Data Generator (SDG)

The Synthetic Data Generator (SDG) creates process-based data. These data model the treatment of breast cancer patients following the distribution of values in a real breast cancer patient population.

Parameters

  • number of patients - the number of patients that will be modeled
  • mutation probability - the mutation probability describes how likely it is for each datum to deviate from the treatment guidelines.

[!NOTE] A mutation probability of 0.0 creates 'clean' data which complies with the treatment guidelines.

Output Data Formats

On execution, SDG creates the same data set in three different formats.

  • CSV
  • RDF
  • SQL (MySQL 8.1 dump)

Data Generation

Requirements: * Docker

There are two options for generating the data; one is using docker-compose. After executing either of the available options, your generated data set can be found in ./data.

Option 1: With docker-compose

If you want to use the docker-compose option, run the following commands:

bash docker-compose up -d --build docker exec -it SDG bash -c "SDG -n {patients} -p {mutation_prob}" docker-compose down -v

where * {patients} is a placeholder for the number of patients * {mutation_prob} is a placeholder for the mutation probability

This option is recommended if several data sets will be generated. The SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!

Option 2: Without docker-compose

If you do not want to use docker-compose, you can also execute:

bash ./generate.sh {patients} {mutation_prob}

where * {patients} is a placeholder for the number of patients * {mutation_prob} is a placeholder for the mutation probability

Similar to option 1, generate.sh will build the Docker image, start a Docker container, execute the SDG, and stop and remove the Docker container again. You can use this option if you do not have docker-compose installed or if you want to generate only one data set.

As in option 1, the SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!

Output Data Description

The SGD generates data that could be collected during the treatment process of breast cancer patients, including demographic, gynecologic, diagnostic, tumor-related, treatment, comorbidity, and family history data. To illustrate the output data, the following figure shows the Entity-Relationship diagram of the data generated when choosing the relational database as the output format. Because of readibility reasons, only the key attributes have been included, the rest of attrbites are described in the data dictionaty below. The other output formats generate equivalent data, using the corresponding formats.

Entity-Relationship diagram of the generated data

Data dictionary: * patient * ehr: INTEGER * birthdate: DATE * diagnosisdate: DATE * ageatdiagnosis: INTEGER * firsttreatmentdate: DATE * surgerydate: DATE * deathdate: DATE / NULL (if the patient has not died) * ageatdeath: INTEGER / NULL (if the patient has not died) * recurrenceyear: INTEGER / NULL (if the patient has not relapsed) * neoadjuvant: yes / no * erpositive: 1 / 0 * prpositive: 1 / 0 * her2overallpositive: 1 / 0 * ki67percentmaxsimp: INTEGER (ranging from 0 to 100) * menarcheage: INTEGER * menopauseage: INTEGER * pregnancy: INTEGER * abort: INTEGER * birth: INTEGER * caesarean: INTEGER * tumortnm * ehr: INTEGER * ntumor: INTEGER * tprefixy: 0 * tprefix: C / P * tcategory: IS / 0 / 1 / 2 / 3 / 4 * nprefixy: 0 * nprefix: C / P * ncategory: 0 / 1 / 2 / 3 * nsubcategory: MI / NULL * mcategory: 0 / 1 * tprefixyafterneoadj: 1 * tprefixafterneoadj: C / P / NULL (if not neoadjuvant) * tcategoryafterneoadj: IS / 0 / 1 / 2 / 3 / 4 / NULL (if not neoadjuvant) * nprefixyafterneoadj: 1 * nprefixafterneoadj: C / P / NULL (if not neoadjuvant) * ncategoryafterneoadj: 0 / 1 / 2 / 3 / NULL (if not neoadjuvant) * nsubcategoryafterneoadj: MI / NULL * mcategoryafterneoadj: 0 / 1 / NULL (if not neoadjuvant) * ntumortype: INTEGER * ntumorgrade: INTEGER * stagediagnosis: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV * stageafterneo: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV * tumortype * ehr: INTEGER * ntumortype: INTEGER * ductal: 1 / 0 * lobular: 1 / 0 * insitu: 1 / 0 * invasive: 1 / 0 * associatedinsitu: 1 / 0 * tumorgrade * ehr: INTEGER * ntumorgrade: INTEGER * grade: 1 / 2 / 3 * drug * iddrug: INTEGER * name: STRING * chemoterapyschema * idschema: INTEGER * name: STRING * drugchemoterapyschema * idschema: INTEGER * iddrug: INTEGER * chemoterapycycle * ehr: INTEGER * idschema: INTEGER * date: DATE * cyclenumber: INTEGER * surgery * ehr: INTEGER * surgery: STRING * nsurgery: INTEGER * dateyear: INTEGER * datemonth: INTEGER * dateday: INTEGER * radiotherapy * ehr: INTEGER * datestart: DATE * dateend: DATE * nradiotherapy: INTEGER * dosegy: FLOAT * comorbidity * id: INTEGER * ehr: INTEGER * comorbidity: STRING * negated: 0 / 1 * oraldrug * ehr: INTEGER * drug: STRING * oraldrugtype * drug: STRING * drugtype: STRING * familyhistory * ehr: INTEGER * cancercui: STRING * cui_description * cui: STRING * description: STRING

Owner

  • Name: Scientific Data Management Group
  • Login: SDM-TIB
  • Kind: organization
  • Email: Philipp.Rohde@tib.eu
  • Location: Hannover, Germany

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use data generated with SDG in your work, please cite it using the following metadata."
authors:
- family-names: "Diaz-Honrubia"
  given-names: "Antonio Jesus"
  orcid: "https://orcid.org/0000-0001-5464-0714"
- family-names: "Rohde"
  given-names: "Philipp D."
  orcid: "https://orcid.org/0000-0002-9835-4354"
title: "Synthetic Data Generator"
url: "https://github.com/SDM-TIB/Synthetic-Data-Generator"

GitHub Events

Total
Last Year

Dependencies

Dockerfile docker
  • mysql 8.1.0 build
docker-compose.yml docker
  • sdmtib/sdg latest
requirements.txt pypi
  • mysql-connector-python ==8.2.0
  • numpy ==1.26.1
  • pandas ==2.1.2
  • rdfizer ==4.7.2.7