ukbb-ehr-data

Prepare UK Biobank Electronic Health Record data for research

https://github.com/philipdarke/ukbb-ehr-data

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary

Keywords

ehr electronic-health-records healthcare uk-biobank
Last synced: 6 months ago · JSON representation ·

Repository

Prepare UK Biobank Electronic Health Record data for research

Basic Info
  • Host: GitHub
  • Owner: philipdarke
  • License: mit
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 252 KB
Statistics
  • Stars: 26
  • Watchers: 1
  • Forks: 5
  • Open Issues: 0
  • Releases: 1
Topics
ehr electronic-health-records healthcare uk-biobank
Created over 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Prepare UK Biobank EHR data for research

DOI

Clean and prepare UK Biobank primary care EHR for research. Tested with the interim EHR data release.

Installation

  1. Install the ukbbhelpr R package from here.
  2. Clone the EHR code set repository here.
  3. Clone this repository and follow the instructions below.

The following R packages are required. Install them using:

R required <- c("zoo", "dplyr", "plyr", "ggplot2", "cowplot") optional <- c("caret", "QDiabetes", "survival") install.packages(required) install.packages(optional) # needed to run code in the paper directory

UK Biobank data

Download the data for your UK Biobank application from the data showcase. The following fields are required to process the primary care EHR data:

Description | Field ----------- | ----- Year and month of birth | 34, 52 Date of assessment centre visit | 53 Linked date of death | 40000

The fields below are required to run the code in the 02_extract_records and paper directories:

Description | Field ----------- | ----- Demographic data | 31, 189, 21000 | Anthropomorphic measurements | 48, 50, 21002 HbA1c blood glucose | 30750 Self-reported non-cancer medical history | 2986, 20002, 20003, 20008 Smoking history | 1249, 2887, 3456, 20116 Summary secondary care data | 41270, 41271, 41272, 41273, 41280, 41281, 41282, 41283

:warning: Edit 01_prepare_data/01_subset_visit_data.R if any of the optional fields above are unavailable.

In addition, the primary care data is required:

Description | File ----------- | ---- Participant registration records | gp_registrations.txt Clinical event records | gp_clinical.txt Prescription records | gp_scripts.txt

Prepare the data for research

  1. Update file_paths.R with the paths to your downloaded data.
  2. Run the scripts in the 01_prepare_data directory sequentially to infer periods of data collection for each participant. The results are saved in data/data_period.rds by default.
  3. Run the scripts in the 02_extract_records directory sequentially to extract the files marked * in the table below.

Alternatively, run_all.R can be run instead of steps 3 and 4.

:warning: The EHR data are large files and run_all.R in particular is very memory intensive. Use of a high performance computing service is recommended. UK Biobank data must be stored and processed as required under the Material Transfer Agreement.

Tested with the September 2019 interim EHR release on an Intel Xeon E5-2699 v4 processor (2.2 GHz, 22 cores, 55 MB cache) with 256Gb RAM running R 3.6 on CentOS Linux 7. The code has not been tested on R 4.0+.

Output summary

The following files are saved in the data directory by default:

File | Description ---- | ----------- data_period.rds | Period(s) of EHR data collection for each participant gp_event.rds | Clean event/diagnosis data gp_presc.rds | Clean prescription data biomarkers.rds* | Extracted biomarkers demographic.rds* | Ethnicity, smoking history and Townsend deprivation family_history.rds* | Family history data diagnoses.rds* | Extracted diagnosis codes for a range of common conditions prescriptions.rds* | Estimated periods during which selected drugs were prescribed

Files marked * are generated by the scripts in the 02_extract_records directory.

Visualising the results

Estimating periods of EHR data collection

visualisation/01_algorithm.R can be used to plot the results of the algorithm used to infer periods of EHR data collection for a participant.

Data collection algorithm example

Diabetes phenotyping case study

visualisation/02_phenotyping.R can be used to plot the results of the diabetes phenotyping algorithm. paper/02_diabetes_phenotyping.R must be run first.

Example output from diabetes phenotyping tool

Citing this work

If you use this work, please cite it as below:

@article{10.1093/jamia/ocab260, author = {Darke, Philip and Cassidy, Sophie and Catt, Michael and Taylor, Roy and Missier, Paolo and Bacardit, Jaume}, title = "{Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study}", journal = {Journal of the American Medical Informatics Association}, volume = {29}, number = {3}, pages = {546-552}, year = {2021}, month = {12}, issn = {1527-974X}, doi = {10.1093/jamia/ocab260}, url = {https://doi.org/10.1093/jamia/ocab260}, eprint = {https://academic.oup.com/jamia/article-pdf/29/3/546/42333190/ocab260.pdf}, }

Licence

Made available under the MIT Licence.

Owner

  • Name: Philip Darke
  • Login: philipdarke
  • Kind: user
  • Company: Newcastle University

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Darke"
  given-names: "Philip"
  orcid: "https://orcid.org/0000-0002-9033-2767"
- family-names: "Cassidy"
  given-names: "Sophie"
  orcid: "https://orcid.org/0000-0002-0228-7274"
- family-names: "Catt"
  given-names: "Michael"
- family-names: "Taylor"
  given-names: "Roy"
- family-names: "Missier"
  given-names: "Paolo"
- family-names: "Bacardit"
  given-names: "Jaume"
  orcid: "https://orcid.org/0000-0002-2692-7205"
title: "Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study"
version: 1.0.0
doi: 10.1093/jamia/ocab260
date-released: 2020-12-13
url: "https://doi.org/10.1093/jamia/ocab260"

GitHub Events

Total
  • Watch event: 2
  • Fork event: 2
Last Year
  • Watch event: 2
  • Fork event: 2

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 16
  • Total Committers: 1
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Philip Darke 4****e 16

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: 2 months
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 1.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • erenelci (1)
  • RoyeSie (1)
  • xiaonanl1996 (1)
  • Gizmodiat (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

poetry.lock pypi
  • numpy 1.23.2
  • pandas 1.4.3
  • python-dateutil 2.8.2
  • pytz 2022.2.1
  • six 1.16.0
pyproject.toml pypi
  • pandas ^1.4.3
  • python ^3.8