https://github.com/aehrc/fhir-analytics-pipeline

An example of how to build a pipeline for extracting, transforming and analyzing FHIR data.

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

An example of how to build a pipeline for extracting, transforming and analyzing FHIR data.

Basic Info

Host: GitHub
Owner: aehrc
License: apache-2.0
Language: Python
Default Branch: main
Size: 278 MB

Statistics

Stars: 6
Watchers: 7
Forks: 1
Open Issues: 0
Releases: 1

Created about 3 years ago · Last pushed about 3 years ago

Metadata Files

Readme License

FHIR Analytics Pipeline

An example of how to build a pipeline for extracting, transforming and analyzing FHIR data.

Use case

The pipeline addresses the use case of self-service analytics for COVID-19 management.

In particular, we want the data analyst with some knowledge of SQL and armed with BI tools such a Power BI to be able to leverage FHIR data and answer questions like: "How many non-vaccinated patients are there in various areas, stratified by a customisable risk score?"

In order to enable this use case, we will build a pipeline that extracts a simple, tabular use-case centric view of the data from a FHIR server to an SQL database.

The use-case centric view includes the following columns: - id: patient's id - gender: patient's gender - birthDate: patients' date of birth - postalCode: patient's address postal code - hasHD: has the patient been ever diagnosed with a heart disease - hasCKD: has the patient been ever diagnosed with a chronic kidney disease - hasBMIOver30: has the patient ever had BMI over 30 - isCovidVaccinated: has the patient been vaccinated with any of the COVID-19 vaccines

Design and implementation

The design of the pipeline is shown of the figure below.

Pipeline design

The pipeline consists of the following steps: - Extracting the ndjson data from the FHIR server using FHIR Bulk Export API - Encoding ndjson resource data as Apache Spark data frames and storing them as tables in a Delta Lake with SQL projection similar to SQL on FHIR - Transforming the data using FHIRPath into the use-case centric view and loading it into an SQL database

The pipeline is implemented using Databricks notebooks and in addition to Apache Spark heavily relies on Pathling python libraries.

In particular leverages Pathling to encode FHIR resources as Spark data frames, and to transform the FHIR data to the use-case centric view with FHIRPath expressions and the extract() operation. The following snipped of the code shows how the transformation is performed for the COIVD-19 use case:

python coivd19_view_df = fhir_ds.extract('Patient', columns= [ exp("id"), exp("gender"), exp("birthDate"), exp("address.postalCode.first()").alias("postalCode"), exp("reverseResolve(Condition.subject).exists(code.subsumedBy(http://snomed.info/sct|56265001))").alias("hasHD"), exp("reverseResolve(Condition.subject).exists(code.subsumedBy(http://snomed.info/sct|709044004))").alias("hasCKD"), exp("reverseResolve(Observation.subject).where(code.subsumedBy(http://loinc.org|39156-5)).exists(valueQuantity > 30 'kg/m2')").alias("hasBMIOver30"), exp("reverseResolve(Immunization.patient).vaccineCode.memberOf('https://aehrc.csiro.au/fhir/ValueSet/covid-19-vaccines').anyTrue()").alias("isCovidVaccinated"), ], filters = [ "address.country.first() = 'US'" ] )

The project

The project has the following structure:

etl directory contains Databricks notebooks with the implementation of the pipeline.
demos directory contains Databricks notebooks that explore various aspects of working with the FHIR data, such us:
- transforming with FHIRPath and (Spark) SQL
- incorporating terminology queries
- working with extensions
- working with recursive structures such as QuestionnaireResponse and Questionnaire
ctrl directory contains notebooks that can be used to setup/cleanup the environment.
analytics directory contains notebooks that provide examples of performing simple analytics.

For the instructions on how to setup Pathling in Databrics environment, see the instructions in the Pathling Documentation.

The pipeline requires Pathling version 6.2.1 or later.

Before running the notebooks in demos and etl please run ctrl/SetUp in order to prepare the workspace (mainly download the example data).

Workflow

The notebooks implementing the pipeline steps in etl directory can be orchestrated into an automated pipeline with a Databricks Workflow.

Pipeline Workflow

The screencasts here part1 nad part2 demonstrate how to create such a workflow that connects to the 1000 patients database from https://bulk-data.smarthealthit.org/ and creates the use-case centric view in the devdays_sql_1000 schema.

Analytics

Please watch the screencast to see how PowerBI can be used to connect to the COVID-19 view at Databricks DeltaLake SQL database, to perform some simple analytics, and produce the COVID-19 risk map below:

COVID-19 Risk Map

References:

Owner

Name: The Australian e-Health Research Centre
Login: aehrc
Kind: organization

Website: https://aehrc.com
Twitter: ehealthresearch
Repositories: 101
Profile: https://github.com/aehrc

The Australian e-Health Research Centre (AEHRC) is CSIRO’s digital health research program.

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aehrc/fhir-analytics-pipeline

Science Score: 39.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

FHIR Analytics Pipeline

Use case

Design and implementation

The project

Workflow

Analytics

References:

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels