Synthia

Synthia: multidimensional synthetic data generation in Python - Published in JOSS (2021)

https://github.com/dmey/synthia

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 24 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

augmentation climate copula data-augmentation data-generation data-generator data-modelling data-science dependency-analysis dependency-modeling finance fpca functional-data machine-learning oversampling principal-component-analysis statistics synthetic-data weather xarray

Scientific Fields

Engineering Computer Science - 60% confidence
Last synced: 4 months ago · JSON representation

Repository

📈 🐍 Multidimensional synthetic data generation with Copula and fPCA models in Python

Basic Info
Statistics
  • Stars: 64
  • Watchers: 3
  • Forks: 10
  • Open Issues: 2
  • Releases: 6
Topics
augmentation climate copula data-augmentation data-generation data-generator data-modelling data-science dependency-analysis dependency-modeling finance fpca functional-data machine-learning oversampling principal-component-analysis statistics synthetic-data weather xarray
Created almost 6 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog Contributing License Zenodo

README.md

synthia

[![PyPI](https://img.shields.io/pypi/v/synthia)](https://pypi.org/project/synthia) [![CI](https://github.com/dmey/synthia/workflows/CI/badge.svg)](https://github.com/dmey/synthia/actions) [![DOI](https://joss.theoj.org/papers/10.21105/joss.02863/status.svg)](https://doi.org/10.21105/joss.02863) [Overview](#overview) | [Documentation](#documentation) | [How to cite](#how-to-cite) | [Contributing](#contributing) | [Development notes](#development-notes) | [Copyright and license](#copyright-and-license) | [Acknowledgements](#acknowledgements)

Overview

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are statistical models that allow these properties to be simulated (Joe 2014). As such, copula generated data have shown potential to improve the generalization of machine learning (ML) emulators (Meyer et al. 2021) or anonymize real-data datasets (Patki et al. 2016).

Synthia is an open source Python package to model univariate and multivariate data, parameterize data using empirical and parametric methods, and manipulate marginal distributions. It is designed to enable scientists and practitioners to handle labelled multivariate data typical of computational sciences. For example, given some vertical profiles of atmospheric temperature, we can use Synthia to generate new but statistically similar profiles in just three lines of code (Table 1).

Synthia supports three methods of multivariate data generation through: (i) fPCA, (ii) parametric (Gaussian) copula, and (iii) vine copula models for continuous (all), discrete (vine), and categorical (vine) variables. It has a simple and succinct API to natively handle xarray's labelled arrays and datasets. It uses a pure Python implementation for fPCA and Gaussian copula, and relies on the fast and well tested C++ library vinecopulib through pyvinecopulib's bindings for fast and efficient computation of vines. For more information, please see the website at https://dmey.github.io/synthia.

Table 1. Example application of Gaussian and fPCA classes in Synthia. These are used to generate random profiles of atmospheric temperature similar to those included in the source data. The xarray dataset structure is maintained and returned by Synthia.

| Source | Synthetic with Gaussian Copula | Synthetic with fPCA | | -------------------------------------------- | -------------------------------------------------------- | ------------------------------------------------ | | ds = syn.util.load_dataset() | g = syn.CopulaDataGenerator() | g = syn.fPCADataGenerator() | | | g.fit(ds, syn.GaussianCopula()) | g.fit(ds) | | | g.generate(n_samples=500) | g.generate(n_samples=500) | | | | | | Source | Gaussian | fPCA |

Documentation

For installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the website.

How to cite

If you are using Synthia, please cite the following two papers using their respective Digital Object Identifiers (DOIs). Citations may be generated automatically using Crosscite's DOI Citation Formatter or from the BibTeX entries below.

| Synthia Software | Software Application | | --------------------------------------------------------------- | ------------------------------------------------------------------------- | | DOI: 10.21105/joss.02863 | DOI: 10.5194/gmd-14-5205-2021 |

```bibtex @article{MeyerandNagler_2021, doi = {10.21105/joss.02863}, url = {https://doi.org/10.21105/joss.02863}, year = {2021}, publisher = {The Open Journal}, volume = {6}, number = {65}, pages = {2863}, author = {David Meyer and Thomas Nagler}, title = {Synthia: multidimensional synthetic data generation in Python}, journal = {Journal of Open Source Software} }

@article{MeyerandNaglerandHogan_2021, doi = {10.5194/gmd-14-5205-2021}, url = {https://doi.org/10.5194/gmd-14-5205-2021}, year = {2021}, publisher = {Copernicus {GmbH}}, volume = {14}, number = {8}, pages = {5205--5215}, author = {David Meyer and Thomas Nagler and Robin J. Hogan}, title = {Copula-based synthetic data augmentation for machine-learning emulators}, journal = {Geoscientific Model Development} } ```

If needed, you may also cite the specific software version with its corresponding Zendo DOI.

Contributing

If you are looking to contribute, please read our Contributors' guide for details.

Development notes

If you would like to know more about specific development guidelines, testing and deployment, please refer to our development notes.

Copyright and license

Copyright 2020 D. Meyer and T. Nagler. Licensed under MIT.

Acknowledgements

Special thanks to @letmaik for his suggestions and contributions to the project.

Owner

  • Login: dmey
  • Kind: user

JOSS Publication

Synthia: multidimensional synthetic data generation in Python
Published
September 24, 2021
Volume 6, Issue 65, Page 2863
Authors
David Meyer ORCID
Department of Meteorology, University of Reading, Reading, UK, Department of Civil and Environmental Engineering, Imperial College London, London, UK
Thomas Nagler ORCID
Mathematical Institute, Leiden University, Leiden, The Netherlands
Editor
Olivia Guest ORCID
Tags
machine-learning data-science python

Papers & Mentions

Total mentions: 1

Revealing the morphological architecture of a shape memory polyurethane by simulation
Last synced: 2 months ago

GitHub Events

Total
  • Issues event: 1
  • Watch event: 9
  • Fork event: 1
Last Year
  • Issues event: 1
  • Watch event: 9
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 59
  • Total Committers: 4
  • Avg Commits per committer: 14.75
  • Development Distribution Score (DDS): 0.119
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
dmey d****y 52
Thomas Nagler t****r 4
Maik Riechert m****t@a****e 2
Konrad Hinsen k****n 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 11
  • Total pull requests: 21
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 24 days
  • Total issue authors: 6
  • Total pull request authors: 5
  • Average comments per issue: 1.91
  • Average comments per pull request: 0.1
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mnarayan (3)
  • dmey (3)
  • khinsen (2)
  • BigTuna08 (1)
  • nathan-greeneltch (1)
  • NickBanana7 (1)
Pull Request Authors
  • dmey (12)
  • tnagler (4)
  • khinsen (2)
  • letmaik (2)
  • icarosadero (1)
Top Labels
Issue Labels
Pull Request Labels
documentation (1) enhancement (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 24 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 2
  • Total versions: 6
  • Total maintainers: 1
pypi.org: synthia

Multidimensional synthetic data generation in Python

  • Versions: 6
  • Dependent Packages: 1
  • Dependent Repositories: 2
  • Downloads: 24 Last month
Rankings
Dependent packages count: 3.3%
Stargazers count: 10.1%
Dependent repos count: 11.9%
Average: 13.3%
Forks count: 13.4%
Downloads: 28.1%
Maintainers (1)
Last synced: 4 months ago

Dependencies

setup.py pypi
  • bottleneck *
  • numpy *
  • scipy *
  • xarray *
.github/workflows/ci.yml actions
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite
  • peaceiris/actions-gh-pages v3 composite
environment.yml conda
  • bottleneck
  • jupyter
  • matplotlib
  • myst-parser
  • nbsphinx
  • numpy
  • pip
  • pytest
  • python 3.8.*
  • scipy
  • seaborn
  • setuptools
  • sphinx
  • sphinx-autobuild
  • sphinx-copybutton
  • sphinx_rtd_theme
  • sphinxcontrib-bibtex 1.*
  • xarray