sdnist

SDNist: Benchmark data and evaluation tools for data synthesizers.

https://github.com/usnistgov/sdnist

Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
    Organization usnistgov has institutional domain (www.nist.gov)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords

dataset differential-privacy privacy python python3 synthetic-data synthetic-data-generator
Last synced: 4 months ago · JSON representation ·

Repository

SDNist: Benchmark data and evaluation tools for data synthesizers.

Basic Info
  • Host: GitHub
  • Owner: usnistgov
  • License: other
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 53.1 MB
Statistics
  • Stars: 37
  • Watchers: 2
  • Forks: 15
  • Open Issues: 2
  • Releases: 12
Topics
dataset differential-privacy privacy python python3 synthetic-data synthetic-data-generator
Created about 4 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation Dei

README.md

Python Version

SDNist v2.5: Deidentified Data Report Tool

SDNist is the official software package for engaging in the NIST Collaborative Research Cycle

Welcome! SDNist is a python package that provides benchmark data and evaluation metrics for deidentified data generators. This version of SDNist only supports using the NIST ACS Data Excerpts, a geographically partitioned, limited feature data set. Future versions of SDNist will be extended to support additional NIST Excerpt Benchmark data sets.

The deidentified data report evaluates utility and privacy of a given deidentified dataset and generates a summary quality report with performance of a deidentified dataset enumerated and illustrated for each utility and privacy metric.

Preview sample reports produced by the tool here.

This tool is being actively developed. Please (raise an Issue)[https://github.com/usnistgov/SDNist/issues] if you catch a bug or would like have feature suggestions.

Project Team

Karan Bhagat, Knexus Research - Developer sdnist.report package

Damon Streat, Knexus Research - Developer

Christine Task, Knexus Research - Project technical lead

Gary Howarth, NIST - Project PI gary.howarth@nist.gov

Acknowledgements

SDNist v2 grew from SDNist v1, developed in partnership with Saurus Technologies under CRADA CN-21-0143.

Reporting Issues

Help us improve the package and this guide by reporting issues here.

Temporal Map Challenge Environment

SDNist v2.0 and above does not support the Temporal Map Challenge environment.

To run the testing environment from the NIST PSCR Differential Privacy Temporal Map Challenge for the Chicago Taxi data sprint or the American Community Survey sprint, please go to the the Temporal Map Challenge assets repository.

Setting Up the SDNIST Report Tool

Brief Setup Instructions

SDNist is compatible with Python versions from 3.9 to 3.12. If you have installed a previous version of the SDNist library, we recommend installing v2.5 in a virtual environment. v2.5 can be installed via Release 2.5 or via the Pypi server: pip install sdnist or, if you already have a version installed, pip install --upgrade sdnist.

The NIST ACS Data Excerpt data will download on the fly.

Detailed Setup Instructions Using Pypi

  1. The SDNist Report Tool is a part of the sdnist Python library that can be installed on a user’s MAC OS, Windows, or Linux machine.

  2. The sdnist library requires Python be installed on the user's machine. It supports Python versions from 3.9 to 3.12. Check whether an installation exists on the machine by executing the following command in your terminal on Mac/Linux or powershell on Windows: c:\\> python -V If Python is already installed, the above command should return the currently installed version. If Python is not found or the version is below 3.7, then you can download Python from the Python website.

  3. Create a local directory/folder on the machine to set up the SDNist library. This guide assumes the local directory to be sdnist-project; an example of a complete file path is c:\sdnist-project: c:\\sdnist-project>

  4. In the already-opened terminal or powershell window, execute the following command to create a new Python environment. The sdnist library will be installed in this newly created Python environment:

    c:\\sdnist-project> python -m venv venv

  5. The new Python environment will be created in the sdnist-project directory, and the files of the environment should be in the venv directory. To check whether a new Python environment was created successfully, use the following command to list all directories in the sdnist-project directory, and make sure the venv directory exists.

    MAC OS/Linux: sdnist-project> ls Windows: c:\\sdnist-project> dir

  6. Now activate the Python environment and install the sdnist library into it.

    MAC OS/Linux: sdnist-project> . venv/bin/activate The python virtual environment should now be activated. You should see environment name (venv in this case) appended to the terminal prompt as below:
    (venv) sdnist-project>

    Windows: c:\\sdnist-project> . venv/Scripts/activate The python virtual environment should now be activated. You should see environment name (venv in this case) appended to the command/powershell prompt as below:
    (venv) c:\\sdnist-project>

    On Windows, a few users may encounter the following error if their machines are new (executing scripts is disabled by default on some Windows machines): C:\\sdnist-project\\venv\\Scripts\\Activate.ps1 cannot be loaded because running scripts is disabled on this system. Run the following command to let Windows execute scripts: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope LocalMachine

  7. Install sdnist Python library: (venv) c:\\sdnist-project> pip install sdnist

  8. Installation is successful if executing the following command outputs a help menu for the sdnist.report package: (venv) c:\\sdnist-project> python -m sdnist.report -h Output: ``` usage: main.py [-h] [--labels LABELS] [--data-root DATAROOT] PATHDEIDENTIFIEDDATASET TARGETDATASET_NAME

    positional arguments:
      PATH_DEIDENTIFIED_DATASET
                            Location of deidentified dataset (csv or parquet
                            file).
      TARGET_DATASET_NAME   Select name of the target dataset that was used to
                            generated given deidentified dataset.
    
    options:
      -h, --help            show this help message and exit
      --labels LABELS       This argument is used to add meta-data to help
                            identify which deidentified data was was evaluated in
                            the report. The argument can be a string that is a
                            plain text label for the file, or it can be a file
                            path to a json file containing [label, value] pairs.
                            This labels will be included in the printed report.
      --data-root DATA_ROOT
                            Path of the directory to be used as the root for the
                            target datasets.
    
    Choices for Target Dataset Name:
      [DATASET NAME]        [FILENAME]
      MA                    ma2019
      TX                    tx2019
      NATIONAL              national2019
      SBO                   sbo_target
    

    ```

  9. These instructions install sdnist into a virtual environment. The virtual environment must be activated (step 9) each time a new terminal window is used with sdnist.

Generate Data Quality Report

  1. The sdnist.report package requires a path to the deidentified dataset file and the name of the target dataset from which the deidentified dataset file will be created. Following is the command line usage of the sdnist.report package: python -m sdnist.report PATH_DEINDETIFIED_DATASET TARGET_DATSET_NAME

    The above command is just an example usage signature of the package. Steps 3 through 5 show the actual commands to run the tool, where the parameter PATHDEIDENTIFIEDDATASET is replaced with the path of the deidentified dataset file on the your machine, and the parameter TARGETDATASETNAME is replaced with one of the bundled dataset names (MA, TX, or NATIONAL).

    A deidentified dataset file can be anywhere on your machine. You only need the path of the file to pass it as an argument to the sdnist.report package. For illustration purposes, this guide assumes an example deidentified dataset file named syn_tx.csv is generated from the bundled dataset file named TX that is present in the sdnist-project directory. You can also use the bundled toy deidentified datasets for generating some toy evaluation reports using the sdnist.report package by following steps 5 and 6 in the next section, Setup Data for SDNIST Report Tool.

    The sdnist.report packages come bundled with three target datasets: MA, TX, and NATIONAL. If these datasets are not available locally, the package will download them automatically when you run any one of the commands in steps 3 through 5 for the first time. In case of any trouble while downloading the datasets, please refer to the next section, Setup Data for SDNIST Report Tool.

  2. If you have closed the terminal or the powershell window that was used for the tool setup, open a new one, and after navigating the to sdnist-project directory, run the activate script as explained in step 9 of the Setup SDNIST Report Tool section.

  3. Use the following command to generate a data quality report for the example deidentified dataset (syntx.csv) that is generated using the bundled dataset TX: ``` (venv) c:\sdnist-project> python -m sdnist.report syntx.csv TX ``` At the completion of the process initiated by the above command, an .html report will open in the default web browser on your machine. Likewise, .html report files will be available in the reports directory created automatically in the sdnist-project directory.

  4. Use the following command to generate a data quality report for the example deidentified dataset (synma.csv) that is generated using the bundled dataset MA: ``` (venv) c:\sdnist-project> python -m sdnist.report synma.csv MA ```

  5. Use the following command to generate a data quality report for the example deidentified dataset (synnational.csv) that is generated using the bundled dataset NATIONAL: ``` (venv) c:\sdnist-project> python -m sdnist.report synnational.csv NATIONAL ```

  6. Use the following command to generate a data quality report for the example deidentified dataset (synsbo.csv) that is generated using the bundled dataset SBO: ``` (venv) c:\sdnist-project> python -m sdnist.report synsbo.csv SBO ```

  7. Starting from version 2.1, SDNist allow users to add labels for the deidentified dataset used to generate report:

    • To add single string label to the report, use command line option --labels followed by a string as given in the following example command: (venv) c:\\sdnist-project> python -m sdnist.report syn_national.csv NATIONAL --labels used_epsilon_1 This is how the string label usedepsilon1 will appear in the report: string label in report
    • To add multiple string labels to the report, use command line option --labels followed by a path to the json file containing labels: (venv) c:\\sdnist-project> python -m sdnist.report syn_national.csv NATIONAL --labels example_labels.json Where examplelabels.json can be: ``` { "epsilon": "1", "delta": "10^-5", "created on": "March 3, 2023", "deidentification method": "examplemethod" } ``` This is how the example_labels.json will appear in the report: multiple labels in report
  8. The following are all the parameters offered by the sdnist.report package:

 - **PATH_DEIDENTIFIED_DATASET**: The absolute or relative path to the deidentified dataset .csv or parquet file. If the provided path is relative, it should be relative to the current working directory. This guide assumes the current working directory is sdnist-project.
 - **TARGET_DATASET_NAME**: This should be the name of one of the datasets bundled with the sdnist.report package. It is the name of the dataset from which the input deidentified dataset is generated, and it can be one of the following:
   - MA
   - TX
   - NATIONAL
   - SBO

 - **--data-root**: The absolute or relative path to the directory containing the bundled dataset, or the directory where the bundled dataset should be downloaded to if it is not available locally. The default directory is set to **BenchmarkData**.
 - **--labels**: This argument is used to add meta-data to help identify which deidentified data was was evaluated in the report.  The argument can be a string that is a plain text label for the file, or it can be a file path to a json file containing label, value pairs. 

Setup Data for SDNIST Report Tool

  1. The sdnist.report package comes with built-in datasets. The package will automatically download the datasets from Github if they are not already available locally on your machine. You should see following message on your terminal or powershell window when the datasets are downloaded by the sdnist.report package: ``` (venv) c:\sdnist-project> python -m sdnist.report syn_tx.csv TX

    Downloading all SDNist datasets from:
    (link change) https://github.com/usnistgov/SDNist/releases/download/v2.2.0/BenchmarkData.zip ...
    ...5%, 47352 KB, 8265 KB/s, 5 seconds elapsed ```

    Follow the next subsection, Download Data Manually, if the sdnist.report package is unable to download the datasets.

  2. All the datasets required by the sdnist.report package are installed into the sdnist _toy _data directory, which should be now present inside the sdnist-project directory. sdnist _toy _data is also a data root directory. You can use some other directory as a data root by providing the –data-root argument to the sdnist.report package. If you provide a –data-root argument with a path, the sdnist.report package will look for datasets in the data root directory you have specified, and the package will download it if it is not present in the data root.

  3. The sdnist.report package also needs a deidentified dataset that it can evaluate against its original counterpart. Since the sdnist.report package comes bundled with the datasets, the deidentified dataset should be generated using the bundled datasets.

You can download a copy of the datasets from Github NIST ACS Data Excerpts. This copy is similar to the one bundled with the sdnist.report package, but it contains more documentation and a description of the datasets.

  1. You can download the toy deidentified datasets from Github Sdnist Toy Deidentified Dataset. Unzip the downloaded file, and move the unzipped toydeidentifieddataset directory to the sdnist-project directory.

  2. Each toy deidentified dataset file is generated using the NIST ACS Data Excerpts. The synma.csv, syntx.csv, and syn_national.csv deidentified dataset files are created from target datasets MA (ma2019.csv), TX (tx2019.csv), and NATIONAL(national2019.csv), respectively. You can use one of the toy deidentified dataset files for testing whether the sdnist.report package is installed correctly on your system.

  3. Use the following commands for generating reports if you are using a toy deidentified dataset file:

For evaluating the Massachusetts dataset: (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_ma.csv MA

For evaluating the Texas dataset: (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_tx.csv TX

For evaluating the national dataset: (venv) c:\\sdnist-project> python -m sdnist.report toy_deidentified_data/syn_national.csv NATIONAL

  1. A deidentified dataset can be a .csv or a parquet file, and the path of this file is required by the sdnist.report package to generate a data quality report.

Download Data Manually

  1. If the sdnist.report package is not able to download the datasets, you can download them from Github NIST ACS Data Excerpts.
  2. Unzip the BenchmarkData.zip file and move the unzipped BenchmarkData directory to the sdnist-project directory.
  3. Delete the BenchmarkData.zip file once the data is successfully extracted from the zip.

Citing SDNist Deidentified Data Report Tool

If you publish work that utilizes the SDNist Deidentified Data Tool, please cite the software. Citation recommendation:

Task C., Bhagat K., and Howarth G.S. (2023), SDNist v2: Deidentified Data Report Tool, National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2943.

Credits

Owner

  • Name: National Institute of Standards and Technology
  • Login: usnistgov
  • Kind: organization
  • Location: Gaithersburg, Md.

Department of Commerce

Citation (CITATION.cff)

cff-version: 1.2.0
title: "SDNist: Deidentified Data Report Tool"
abstract: "SDNist provides benchmark data and a suite of both machine- and human-readable outputs with more than ten metrics including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools. "
message: >-
  If you use this repository or present information about it publicly, please cite us.
type: software
version: 2.3
doi: 10.18434/mds2-2943
date-released: 2023-4-14
contact:
  - affiliation: "National Institute of Standards and Technology"
    email: gary.howarth@nist.gov
    family-names: Gary
    given-names: Howarth
authors:
- family-names: Task
  given-names: Christine
  affiliation: Knexus Research Corporation
  email: christine.task@knexusresearch.com
- family-names: Bhagat
  given-names: Karan
  affiliation: Knexus Research Corporation
- family-names: Howarth
  given-names: Gary
  affiliation: National Institute of Standards and Technology
  email: gary.howarth@nist.gov
  ORCID:  0000-0002-3587-0546

GitHub Events

Total
  • Issues event: 9
  • Watch event: 3
  • Delete event: 8
  • Member event: 1
  • Issue comment event: 8
  • Push event: 25
  • Pull request review event: 1
  • Pull request event: 21
  • Fork event: 3
  • Create event: 3
Last Year
  • Issues event: 9
  • Watch event: 3
  • Delete event: 8
  • Member event: 1
  • Issue comment event: 8
  • Push event: 25
  • Pull request review event: 1
  • Pull request event: 21
  • Fork event: 3
  • Create event: 3

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 273
  • Total Committers: 9
  • Avg Commits per committer: 30.333
  • Development Distribution Score (DDS): 0.421
Top Committers
Name Email Commits
Karan Bhagat k****m@g****m 158
Gary Howarth 5****h@u****m 89
currid 9****d@u****m 9
Grégoire Lothe gl@s****m 5
Christine Task c****k@C****n 4
Peggy Currid (Contractor) c****d@g****m 3
Mary Ann Wall 5****l@u****m 3
Karan Bhagat 8****m@u****m 1
glipstein g****n@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 15
  • Total pull requests: 40
  • Average time to close issues: 7 months
  • Average time to close pull requests: 2 months
  • Total issue authors: 9
  • Total pull request authors: 7
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.23
  • Merged pull requests: 28
  • Bot issues: 0
  • Bot pull requests: 16
Past Year
  • Issues: 3
  • Pull requests: 19
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 19 days
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.26
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 6
Top Authors
Issue Authors
  • garyhowarth (3)
  • ghost (3)
  • yoid2000 (3)
  • mikel-hernandezj (1)
  • iAmiRNA (1)
  • AwesomeLemon (1)
  • madi (1)
  • logmms (1)
  • djstreat (1)
Pull Request Authors
  • dependabot[bot] (17)
  • kbtriangulum (11)
  • garyhowarth (5)
  • djstreat (4)
  • glipstein (1)
  • ghost (1)
  • currid (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (17) python (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 272 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 10
  • Total maintainers: 1
pypi.org: sdnist

SDNist: Deidentified Data Report Generator

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 272 Last month
Rankings
Dependent packages count: 10.1%
Downloads: 11.2%
Forks count: 11.4%
Stargazers count: 12.2%
Average: 13.3%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 4 months ago

Dependencies

Pipfile pypi
  • flask >=2 develop
  • ipykernel >=6 develop
  • ipywidgets >=7 develop
  • jupyterlab >=2 develop
  • PyQt5 >=5
  • PyQtWebEngine >=5
  • jinja2 >=3
  • loguru >=0.6
  • matplotlib >=3
  • networkx >=2
  • numpy >=1
  • pandas >=1
  • pyarrow >=7
  • pydot >=1
  • python-louvain 0.16
  • requests >=2
  • scikit-learn >=1
  • scipy >=1
  • tqdm >=4
setup.py pypi
  • jinja2 >=3
  • loguru >=0.6
  • matplotlib >=3
  • numpy >=1
  • pandas >=1
  • pyarrow >=7
  • requests >=2
  • scikit-learn >=1
  • scipy >=1
  • tqdm >=4