datasharing

Mirror of https://gitlab.com/CBDS/DataSharing

https://github.com/olemussmann/datasharing

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Mirror of https://gitlab.com/CBDS/DataSharing

Basic Info

Host: GitHub
Owner: OleMussmann
License: apache-2.0
Language: Python
Default Branch: master
Size: 11.2 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created over 6 years ago · Last pushed about 6 years ago

Metadata Files

Readme License Citation Zenodo

FAIRHealth Project: Privacy-Preserving Distributed Learning Infrastructure (PPDL)

Introduction

FAIRHealth project is a collaboration between Maastricht University and Statistics Netherlands from Feb 2018 to Feb 2020. It is funded by Dutch National Research Agenda (NWA) under VWData program. In this project, we propose an innovative infrastructure for the secure and privacy-preserving analysis of personal health data from multiple providers with different governance policies. The approach involves distributed machine learning to analyze vertically partitioned data (different variables/attributes/features about a particular individual are distributed over a set of data sources).

The main idea of our infrastructure is to send data-processing or analysis algorithms to data sources rather than transferring data to the researchers. Only the final (verified) results can be return to the researchers. Our infrastructure is an extension of Personal Health Train Archtecture. The trains (applications) containing analytic algorithms are sent to the data stations (sources). The stations (sources) can inspect whether the train is allowed to execute the application on (a subset of) the available data.

Please find our publications:

Structure of PPDL

Until Feb 2020, PPDL infrastructure contains 5 components:

Data transformation (Transform csv, sav data files to RDF data stored in graph database)
Overview of Data (Visualize and obtain basic information/statistical summary of data)
Pseudonymization & Encryption (Pseudonymize personal identifiers(PI) and encrypt data files)
Matching & Merging (Match and merge multiple datasets on pseudonymized PI)
Analysis (Go through machine learning pipeline)
Logging all data processing history

Prerequisites

Hardware:

Windows 10 (fall creators update or higher)
MacOS 10.13 (High Sierra)
Ubuntu 16.04, 17.10 or 18.04
Moderately recent CPU (minimum i5 processor)
8 GB of RAM (not occupied by many other applications/services)

Software:

Docker Community Edition
- native on Ubuntu Install
- for Windows Install
- for Mac Install
Python 3.6 (with pip as dependency manager)

How to use it? (Test on your local laptop)

Install base containers: in all data stations (data parties and Trusted Secure Environment), a basic container needs to to installed. In your terminal:

shell docker pull sophia921025/datasharing_base:v0.1

Get an overview of data: At each data party station, create a folder, put data file and request.yaml into this folder. Configure request.yaml based on the overview of data you need. In the folder which contains data file and request.yaml, Mac/Linux run: (please change the third line "dataparty1.csv" to the name of your own data file.)

shell docker run --rm \ -v "$(pwd)/input/request.yaml:/inputVolume/request.yaml" \ -v "$(pwd)/input/data_party_1.csv:/data_party_1.csv" \ -v "$(pwd)/output:/output" sophia921025/datasharing_overview:local0.1

Windows run (please change the third line "dataparty1.csv" to the name of your own data file.)

shell docker run --rm \ -v "%cd%/input/request.yaml:/inputVolume/request.yaml" \ -v "%cd%/input/data_party_1.csv:/data_party_1.csv" \ -v "%cd%/output:/output" sophia921025/datasharing_overview:local0.1

Generate public-priavte keys for encryption and verficiation keys transferring.

shell docker run --rm \ -v "$(pwd)/input/ppkeys_input.yaml:/inputVolume/ppkeys_input.yaml" \ -v "$(pwd)/output:/output" sophia921025/datasharing_ppkeys:local0.1

Pseudonymization and encryption: to pseudonymize the personal identifiers (PI) for linking multiple datasets, and encrypt the data files (pseudonymized PI + actual data). Go to the folder which contains ***data file* and encrypt_input.yaml. Please configure encrypt_input.yaml first. Then in the terminal (Mac/Linux): (please change the second line "dataparty1.csv" to the name of your own data file.)

shell docker run --rm \ -v "$(pwd)/input/data_party_1.csv:/data_party_1.csv" \ -v "$(pwd)/input/publicKey_dms.pem:/publicKey.pem" \ -v "$(pwd)/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \ -v "$(pwd)/output:/output" sophia921025/datasharing_encdata:local0.1

Windows (please change the second line "dataparty1.csv" to the name of your own data file.):

shell docker run --rm \ -v "%cd%/input/data_party_1.csv:/data_party_1.csv" \ -v "%cd%/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \ -v "%cd%/output:/output" sophia921025/datasharing_encdata:local0.1

After successful execution, your encrypted data file and key file (keys.json) will be stored locally/to the server (e.g., trusted third party, trusted secure environment).

Sign your model file (python script) by all data parties. Create a folder where contains your_model.py and encrypt_input.yaml (need to be configured). Then in the terminal, Mac/Linux (please change the second line "your_model.py" to the name of your own model file):

shell docker run --rm \ -v "$(pwd)/input/MLmodel_test.py:/MLmodel_test.py" \ -v "$(pwd)/input/sign_model_input.yaml:/inputVolume/sign_model_input.yaml" \ -v "$(pwd)/output:/output" datasharing_signmodel:local0.1

Windows (please change the second line "your_model.py" to the name of your own model file):

shell docker run --rm \ -v "%cd%/input/your_model.py:/your_model.py" \ -v "%cd%/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \ -v "%cd%/output:/output" sophia921025/datasharing_signmodel:local0.1

At Trusted Secure Environment (TSE), create a folder, put encrypted data files from data parties, security_input.yaml, and analysis_input.yaml, and your analysis python script (ML models) into this folder. Configure security_input.yaml based on the keys from data parties, and analysis_input.yaml based on your analysis requirements. In your terminal:

Mac/Linux:

shell docker run --rm \ -v "$(pwd)/input:/input" \ -v "$(pwd)/output:/output" \ -v "$(pwd)/input/security_input.yaml:/inputVolume/security_input.yaml" \ -v "$(pwd)/input/analysis_input.yaml:/inputVolume/analysis_input.yaml" \ sophia921025/datasharing_tse:v0.1

Windows:

shell docker run --rm \ -v "%cd%/input:/input" \ -v "%cd%/output:/output" \ -v "%cd%/input/security_input.yaml:/inputVolume/security_input.yaml" \ sophia921025/datasharing_tse:v0.1

If Docker container runs properly, you will see execution logs as below. In the end, all results and logging histories (ppds.log) are stored in the output folder. To avoid data leakage from error shooting, if errors occur during executions, the error messages will saved in the ppds.log instead of printing out on the screen.

powershell INFO ░ 2020-02-02 19:40:56,751 ░ verDec ░ verDec.py line 14 ▓ Reading request.yaml file... INFO ░ 2020-02-02 19:40:56,944 ░ verDec ░ verDec.py line 111 ▓ Signed models has been verified successfully! INFO ░ 2020-02-02 19:40:56,945 ░ verDec ░ verDec.py line 151 ▓ Verification and decryption took 0.3028s to run ... ... ... INFO ░ 2020-01-19 10:25:05,619 ░ main ░ main.py line 272 ▓ In total, all models training took 16.6441 to run.

Owner

Login: OleMussmann
Kind: user

Repositories: 4
Profile: https://github.com/OleMussmann

Citation (CITATION.cff)

# YAML 1.2
---
abstract: |
    "FAIRHealth project is a collaboration between Maastricht University and
    Statistics Netherlands from Feb 2018 to Feb 2020. It is funded by Dutch
    National Research Agenda (NWA) under VWData program. In this project, we
    propose an innovative infrastructure for the secure and privacy-preserving
    analysis of personal health data from multiple providers with different
    governance policies.  The approach involves distributed machine learning to
    analyze vertically partitioned data (different variables/attributes/features
    about a particular individual are distributed over a set of data sources).
    
    The main idea of our infrastructure is to send data-processing or analysis
    algorithms to data sources rather than transferring data to the researchers.
    Only the final (verified) results can be return to the researchers. Our
    infrastructure is an extension of Personal Health Train Archtecture. The
    trains (applications) containing analytic algorithms are sent to the data
    stations (sources). The stations (sources) can inspect whether the train is
    allowed to execute the application on (a subset of) the available data."
authors: 
  -
    affiliation: "UMC+ Maastricht"
    family-names: Soest
    given-names: Johan
    name-particle: van
    name-suffix: PhD
    orcid: "https://orcid.org/0000-0003-2548-0330"
  -
    affiliation: "UMC+ Maastricht"
    family-names: Sun
    given-names: Chang
    name-suffix: MSc
    orcid: "https://orcid.org/0000-0001-8325-8848"
  -
    affiliation: "Statistics Netherlands (CBS)"
    family-names: Mussmann
    given-names: "Bjoern Ole"
    name-suffix: PhD
    orcid: "https://orcid.org/0000-0002-3803-4287"
cff-version: "1.1.0"
date-released: 2020-02-01
keywords: 
  - FAIRhealth
  - VWData
  - "privacy preserving analysis"
  - PPA
license: "Apache-2.0"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://gitlab.com/CBDS/DataSharing"
title: FAIRHealth
version: "0.0.5"
...

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

Dependencies

DataPartyStation/EncData/Dockerfile docker

datasharing_base v0.1 build

DataPartyStation/OverviewData/Dockerfile docker

datasharing_base v0.1 build

DataPartyStation/PPKeys/Dockerfile docker

datasharing_base v0.1 build

DataPartyStation/SignModel/Dockerfile docker

datasharing_base v0.1 build

DataforTesting/Dockerfile docker

datasharing/base 2020-01-15 build

TSEStation/Dockerfile docker

datasharing_base v0.1 build

baseContainer/Dockerfile docker

python 3.6.9-slim-stretch build

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science