syndata

SYNDATA software includes a suite of statistical/machine learning models to generate discrete/categorical synthetic data.

https://github.com/llnl/syndata

Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization llnl has institutional domain (software.llnl.gov)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary

Keywords

clinical-research machine-learning statistics synthetic-data-generation

Last synced: 10 months ago · JSON representation ·

Repository

SYNDATA software includes a suite of statistical/machine learning models to generate discrete/categorical synthetic data.

Basic Info

Host: GitHub
Owner: LLNL
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 34.2 KB

Statistics

Stars: 1
Watchers: 5
Forks: 1
Open Issues: 0
Releases: 0

Topics

clinical-research machine-learning statistics synthetic-data-generation

Created over 4 years ago · Last pushed over 4 years ago

Metadata Files

Readme Contributing License Code of conduct Citation

Synthetic Data Generation with Machine Learning (SYNDATA)

SYNDATA software includes a suite of statistical/machine learning models to generate discrete/categorical synthetic data. To train each model, the user must provide the input data from which the model parameters will be infered. Once the models are trained, they can be used to generate entirely synthetic data. Finally, in addition to the actual models, SYNDATA includes code to process data, evaluate results (based on cross validation), and create a PDF report.

For more details of the methods implemented and the metrics used to evaluate synthetic data generation models, we refer to our paper: Generation and evaluation of synthetic patient data.

Installation

This software suite runs on specific versions of Python and its libraries. We recommend creating a Python environment and install all dependencies from requirements.txt file. To create an environment and install the correct version of the packages, do:

python3 -m venv datagen_env

then activate the environment:

source datagen_env/bin/activate

finally, install all dependencies:

python -m pip install -r requirements.txt

Done. You can now start running your experiments.

Quick Start

A demo file is available in the experiments/ folder. It runs an experiment with UCI's Breast Cancer data. One can build up on this file to create new experiments.

python demo.py

A folder with logs and a PDF report will be created in outputs/ folder. Check that out after running your experiment. The demo.py script may take a few minutes to complete. We recomend using a GPU-powered computer for a faster execution.

Authors:

Andre Goncalves (LLNL)
Rui Meng (LLNL)
Braden Soper (LLNL)
Priyadip Ray (LLNL)
Ana Paula Sales (LLNL)

Code Release

LLNL-CODE-831774

Owner

Name: Lawrence Livermore National Laboratory
Login: LLNL
Kind: organization
Email: github-admin@llnl.gov
Location: Livermore, CA, USA

Website: https://software.llnl.gov
Twitter: LLNL_OpenSource
Repositories: 520
Profile: https://github.com/LLNL

For over 70 years, the Lawrence Livermore National Laboratory has applied science and technology to make the world a safer place.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite our paper below."
authors:
  - family-names: "Goncalves"
    given-names: "Andre"
    orcid: "https://orcid.org/0000-0002-0320-280X"
  - family-names: "Meng"
    given-names: "Rui"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Ray"
    given-names: "Priyadip"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Soper"
    given-names: "Braden"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Sales"
    given-names: "Ana Paula"
    orcid: "https://orcid.org/0000-0000-0000-0000"
title: "SYNDATA Software"
version: v.1.0
doi: None
date-released: 2022-02-15
url: "https://github.com/LLNL/SYNDATA"
preferred-citation:
  type: article
  authors:
  - family-names: "Goncalves"
    given-names: "Andre"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Ray"
    given-names: "Priyadip"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Soper"
    given-names: "Braden"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Stevens"
    given-names: "Jennifer"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Coyle"
    given-names: "Linda"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  - family-names: "Sales"
    given-names: "Ana Paula"
    orcid: "https://orcid.org/0000-0000-0000-0000"
  doi: "10.1186/s12874-020-00977-1"
  journal: "BMC Medical Research Methodology"
  month: 5
  start: 1 # First page number
  end: 40 # Last page number
  title: "Generation and evaluation of synthetic patient data"
  number: 1
  volume: 20
  year: 2020

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science