synthetic-data-generator
SDG generates synthetic breast cancer patient data
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Keywords
Repository
SDG generates synthetic breast cancer patient data
Basic Info
Statistics
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Synthetic Data Generator (SDG)
The Synthetic Data Generator (SDG) creates process-based data. These data model the treatment of breast cancer patients following the distribution of values in a real breast cancer patient population.
Parameters
- number of patients - the number of patients that will be modeled
- mutation probability - the mutation probability describes how likely it is for each datum to deviate from the treatment guidelines.
[!NOTE] A mutation probability of 0.0 creates 'clean' data which complies with the treatment guidelines.
Output Data Formats
On execution, SDG creates the same data set in three different formats.
- CSV
- RDF
- SQL (MySQL 8.1 dump)
Data Generation
Requirements: * Docker
There are two options for generating the data; one is using docker-compose.
After executing either of the available options, your generated data set can be found in ./data.
Option 1: With docker-compose
If you want to use the docker-compose option, run the following commands:
bash
docker-compose up -d --build
docker exec -it SDG bash -c "SDG -n {patients} -p {mutation_prob}"
docker-compose down -v
where
* {patients} is a placeholder for the number of patients
* {mutation_prob} is a placeholder for the mutation probability
This option is recommended if several data sets will be generated. The SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!
Option 2: Without docker-compose
If you do not want to use docker-compose, you can also execute:
bash
./generate.sh {patients} {mutation_prob}
where
* {patients} is a placeholder for the number of patients
* {mutation_prob} is a placeholder for the mutation probability
Similar to option 1, generate.sh will build the Docker image, start a Docker container, execute the SDG, and stop and remove the Docker container again.
You can use this option if you do not have docker-compose installed or if you want to generate only one data set.
As in option 1, the SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!
Output Data Description
The SGD generates data that could be collected during the treatment process of breast cancer patients, including demographic, gynecologic, diagnostic, tumor-related, treatment, comorbidity, and family history data. To illustrate the output data, the following figure shows the Entity-Relationship diagram of the data generated when choosing the relational database as the output format. Because of readibility reasons, only the key attributes have been included, the rest of attrbites are described in the data dictionaty below. The other output formats generate equivalent data, using the corresponding formats.

Data dictionary: * patient * ehr: INTEGER * birthdate: DATE * diagnosisdate: DATE * ageatdiagnosis: INTEGER * firsttreatmentdate: DATE * surgerydate: DATE * deathdate: DATE / NULL (if the patient has not died) * ageatdeath: INTEGER / NULL (if the patient has not died) * recurrenceyear: INTEGER / NULL (if the patient has not relapsed) * neoadjuvant: yes / no * erpositive: 1 / 0 * prpositive: 1 / 0 * her2overallpositive: 1 / 0 * ki67percentmaxsimp: INTEGER (ranging from 0 to 100) * menarcheage: INTEGER * menopauseage: INTEGER * pregnancy: INTEGER * abort: INTEGER * birth: INTEGER * caesarean: INTEGER * tumortnm * ehr: INTEGER * ntumor: INTEGER * tprefixy: 0 * tprefix: C / P * tcategory: IS / 0 / 1 / 2 / 3 / 4 * nprefixy: 0 * nprefix: C / P * ncategory: 0 / 1 / 2 / 3 * nsubcategory: MI / NULL * mcategory: 0 / 1 * tprefixyafterneoadj: 1 * tprefixafterneoadj: C / P / NULL (if not neoadjuvant) * tcategoryafterneoadj: IS / 0 / 1 / 2 / 3 / 4 / NULL (if not neoadjuvant) * nprefixyafterneoadj: 1 * nprefixafterneoadj: C / P / NULL (if not neoadjuvant) * ncategoryafterneoadj: 0 / 1 / 2 / 3 / NULL (if not neoadjuvant) * nsubcategoryafterneoadj: MI / NULL * mcategoryafterneoadj: 0 / 1 / NULL (if not neoadjuvant) * ntumortype: INTEGER * ntumorgrade: INTEGER * stagediagnosis: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV * stageafterneo: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV * tumortype * ehr: INTEGER * ntumortype: INTEGER * ductal: 1 / 0 * lobular: 1 / 0 * insitu: 1 / 0 * invasive: 1 / 0 * associatedinsitu: 1 / 0 * tumorgrade * ehr: INTEGER * ntumorgrade: INTEGER * grade: 1 / 2 / 3 * drug * iddrug: INTEGER * name: STRING * chemoterapyschema * idschema: INTEGER * name: STRING * drugchemoterapyschema * idschema: INTEGER * iddrug: INTEGER * chemoterapycycle * ehr: INTEGER * idschema: INTEGER * date: DATE * cyclenumber: INTEGER * surgery * ehr: INTEGER * surgery: STRING * nsurgery: INTEGER * dateyear: INTEGER * datemonth: INTEGER * dateday: INTEGER * radiotherapy * ehr: INTEGER * datestart: DATE * dateend: DATE * nradiotherapy: INTEGER * dosegy: FLOAT * comorbidity * id: INTEGER * ehr: INTEGER * comorbidity: STRING * negated: 0 / 1 * oraldrug * ehr: INTEGER * drug: STRING * oraldrugtype * drug: STRING * drugtype: STRING * familyhistory * ehr: INTEGER * cancercui: STRING * cui_description * cui: STRING * description: STRING
Owner
- Name: Scientific Data Management Group
- Login: SDM-TIB
- Kind: organization
- Email: Philipp.Rohde@tib.eu
- Location: Hannover, Germany
- Website: https://www.tib.eu/en/research-development/scientific-data-management/
- Repositories: 66
- Profile: https://github.com/SDM-TIB
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use data generated with SDG in your work, please cite it using the following metadata." authors: - family-names: "Diaz-Honrubia" given-names: "Antonio Jesus" orcid: "https://orcid.org/0000-0001-5464-0714" - family-names: "Rohde" given-names: "Philipp D." orcid: "https://orcid.org/0000-0002-9835-4354" title: "Synthetic Data Generator" url: "https://github.com/SDM-TIB/Synthetic-Data-Generator"
GitHub Events
Total
Last Year
Dependencies
- mysql 8.1.0 build
- sdmtib/sdg latest
- mysql-connector-python ==8.2.0
- numpy ==1.26.1
- pandas ==2.1.2
- rdfizer ==4.7.2.7