datasharing

Mirror of https://gitlab.com/CBDS/DataSharing

https://github.com/olemussmann/datasharing

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Mirror of https://gitlab.com/CBDS/DataSharing

Basic Info
  • Host: GitHub
  • Owner: OleMussmann
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Size: 11.2 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 6 years ago · Last pushed almost 6 years ago
Metadata Files
Readme License Citation Zenodo

README.md

FAIRHealth Project: Privacy-Preserving Distributed Learning Infrastructure (PPDL)

Introduction

FAIRHealth project is a collaboration between Maastricht University and Statistics Netherlands from Feb 2018 to Feb 2020. It is funded by Dutch National Research Agenda (NWA) under VWData program. In this project, we propose an innovative infrastructure for the secure and privacy-preserving analysis of personal health data from multiple providers with different governance policies. The approach involves distributed machine learning to analyze vertically partitioned data (different variables/attributes/features about a particular individual are distributed over a set of data sources).

The main idea of our infrastructure is to send data-processing or analysis algorithms to data sources rather than transferring data to the researchers. Only the final (verified) results can be return to the researchers. Our infrastructure is an extension of Personal Health Train Archtecture. The trains (applications) containing analytic algorithms are sent to the data stations (sources). The stations (sources) can inspect whether the train is allowed to execute the application on (a subset of) the available data.

Please find our publications:

  1. Papar: A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario
  2. Paper: Using the Personal Health Train for Automated and Privacy-Preserving Analytics on Vertically Partitioned Data
  3. Others: Slides, Video Demo 1, Video Demo 2, Video Demo 3

Structure of PPDL

Until Feb 2020, PPDL infrastructure contains 5 components:

  1. Data transformation (Transform csv, sav data files to RDF data stored in graph database)
  2. Overview of Data (Visualize and obtain basic information/statistical summary of data)
  3. Pseudonymization & Encryption (Pseudonymize personal identifiers(PI) and encrypt data files)
  4. Matching & Merging (Match and merge multiple datasets on pseudonymized PI)
  5. Analysis (Go through machine learning pipeline)
  6. Logging all data processing history

Prerequisites

Hardware:

  • Windows 10 (fall creators update or higher)
  • MacOS 10.13 (High Sierra)
  • Ubuntu 16.04, 17.10 or 18.04
  • Moderately recent CPU (minimum i5 processor)
  • 8 GB of RAM (not occupied by many other applications/services)

Software:

  • Docker Community Edition
  • Python 3.6 (with pip as dependency manager)

How to use it? (Test on your local laptop)

  1. Install base containers: in all data stations (data parties and Trusted Secure Environment), a basic container needs to to installed. In your terminal:

shell docker pull sophia921025/datasharing_base:v0.1

  1. Get an overview of data: At each data party station, create a folder, put data file and request.yaml into this folder. Configure request.yaml based on the overview of data you need. In the folder which contains data file and request.yaml, Mac/Linux run: (please change the third line "dataparty1.csv" to the name of your own data file.)

shell docker run --rm \ -v "$(pwd)/input/request.yaml:/inputVolume/request.yaml" \ -v "$(pwd)/input/data_party_1.csv:/data_party_1.csv" \ -v "$(pwd)/output:/output" sophia921025/datasharing_overview:local0.1

Windows run (please change the third line "dataparty1.csv" to the name of your own data file.)

shell docker run --rm \ -v "%cd%/input/request.yaml:/inputVolume/request.yaml" \ -v "%cd%/input/data_party_1.csv:/data_party_1.csv" \ -v "%cd%/output:/output" sophia921025/datasharing_overview:local0.1

  1. Generate public-priavte keys for encryption and verficiation keys transferring.

shell docker run --rm \ -v "$(pwd)/input/ppkeys_input.yaml:/inputVolume/ppkeys_input.yaml" \ -v "$(pwd)/output:/output" sophia921025/datasharing_ppkeys:local0.1

  1. Pseudonymization and encryption: to pseudonymize the personal identifiers (PI) for linking multiple datasets, and encrypt the data files (pseudonymized PI + actual data). Go to the folder which contains ***data file* and encrypt_input.yaml. Please configure encrypt_input.yaml first. Then in the terminal (Mac/Linux): (please change the second line "dataparty1.csv" to the name of your own data file.)

shell docker run --rm \ -v "$(pwd)/input/data_party_1.csv:/data_party_1.csv" \ -v "$(pwd)/input/publicKey_dms.pem:/publicKey.pem" \ -v "$(pwd)/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \ -v "$(pwd)/output:/output" sophia921025/datasharing_encdata:local0.1

Windows (please change the second line "dataparty1.csv" to the name of your own data file.):

shell docker run --rm \ -v "%cd%/input/data_party_1.csv:/data_party_1.csv" \ -v "%cd%/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \ -v "%cd%/output:/output" sophia921025/datasharing_encdata:local0.1

After successful execution, your encrypted data file and key file (keys.json) will be stored locally/to the server (e.g., trusted third party, trusted secure environment).

  1. Sign your model file (python script) by all data parties. Create a folder where contains your_model.py and encrypt_input.yaml (need to be configured). Then in the terminal, Mac/Linux (please change the second line "your_model.py" to the name of your own model file):

shell docker run --rm \ -v "$(pwd)/input/MLmodel_test.py:/MLmodel_test.py" \ -v "$(pwd)/input/sign_model_input.yaml:/inputVolume/sign_model_input.yaml" \ -v "$(pwd)/output:/output" datasharing_signmodel:local0.1

Windows (please change the second line "your_model.py" to the name of your own model file):

shell docker run --rm \ -v "%cd%/input/your_model.py:/your_model.py" \ -v "%cd%/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \ -v "%cd%/output:/output" sophia921025/datasharing_signmodel:local0.1

  1. At Trusted Secure Environment (TSE), create a folder, put encrypted data files from data parties, security_input.yaml, and analysis_input.yaml, and your analysis python script (ML models) into this folder. Configure security_input.yaml based on the keys from data parties, and analysis_input.yaml based on your analysis requirements. In your terminal:

Mac/Linux:

shell docker run --rm \ -v "$(pwd)/input:/input" \ -v "$(pwd)/output:/output" \ -v "$(pwd)/input/security_input.yaml:/inputVolume/security_input.yaml" \ -v "$(pwd)/input/analysis_input.yaml:/inputVolume/analysis_input.yaml" \ sophia921025/datasharing_tse:v0.1

Windows:

shell docker run --rm \ -v "%cd%/input:/input" \ -v "%cd%/output:/output" \ -v "%cd%/input/security_input.yaml:/inputVolume/security_input.yaml" \ sophia921025/datasharing_tse:v0.1

If Docker container runs properly, you will see execution logs as below. In the end, all results and logging histories (ppds.log) are stored in the output folder. To avoid data leakage from error shooting, if errors occur during executions, the error messages will saved in the ppds.log instead of printing out on the screen.

powershell INFO ░ 2020-02-02 19:40:56,751 ░ verDec ░ verDec.py line 14 ▓ Reading request.yaml file... INFO ░ 2020-02-02 19:40:56,944 ░ verDec ░ verDec.py line 111 ▓ Signed models has been verified successfully! INFO ░ 2020-02-02 19:40:56,945 ░ verDec ░ verDec.py line 151 ▓ Verification and decryption took 0.3028s to run ... ... ... INFO ░ 2020-01-19 10:25:05,619 ░ main ░ main.py line 272 ▓ In total, all models training took 16.6441 to run.

Owner

  • Login: OleMussmann
  • Kind: user

Citation (CITATION.cff)

# YAML 1.2
---
abstract: |
    "FAIRHealth project is a collaboration between Maastricht University and
    Statistics Netherlands from Feb 2018 to Feb 2020. It is funded by Dutch
    National Research Agenda (NWA) under VWData program. In this project, we
    propose an innovative infrastructure for the secure and privacy-preserving
    analysis of personal health data from multiple providers with different
    governance policies.  The approach involves distributed machine learning to
    analyze vertically partitioned data (different variables/attributes/features
    about a particular individual are distributed over a set of data sources).
    
    The main idea of our infrastructure is to send data-processing or analysis
    algorithms to data sources rather than transferring data to the researchers.
    Only the final (verified) results can be return to the researchers. Our
    infrastructure is an extension of Personal Health Train Archtecture. The
    trains (applications) containing analytic algorithms are sent to the data
    stations (sources). The stations (sources) can inspect whether the train is
    allowed to execute the application on (a subset of) the available data."
authors: 
  -
    affiliation: "UMC+ Maastricht"
    family-names: Soest
    given-names: Johan
    name-particle: van
    name-suffix: PhD
    orcid: "https://orcid.org/0000-0003-2548-0330"
  -
    affiliation: "UMC+ Maastricht"
    family-names: Sun
    given-names: Chang
    name-suffix: MSc
    orcid: "https://orcid.org/0000-0001-8325-8848"
  -
    affiliation: "Statistics Netherlands (CBS)"
    family-names: Mussmann
    given-names: "Bjoern Ole"
    name-suffix: PhD
    orcid: "https://orcid.org/0000-0002-3803-4287"
cff-version: "1.1.0"
date-released: 2020-02-01
keywords: 
  - FAIRhealth
  - VWData
  - "privacy preserving analysis"
  - PPA
license: "Apache-2.0"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://gitlab.com/CBDS/DataSharing"
title: FAIRHealth
version: "0.0.5"
...

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago


Dependencies

DataPartyStation/EncData/Dockerfile docker
  • datasharing_base v0.1 build
DataPartyStation/OverviewData/Dockerfile docker
  • datasharing_base v0.1 build
DataPartyStation/PPKeys/Dockerfile docker
  • datasharing_base v0.1 build
DataPartyStation/SignModel/Dockerfile docker
  • datasharing_base v0.1 build
DataforTesting/Dockerfile docker
  • datasharing/base 2020-01-15 build
TSEStation/Dockerfile docker
  • datasharing_base v0.1 build
baseContainer/Dockerfile docker
  • python 3.6.9-slim-stretch build