ukbb-ehr-data

Prepare UK Biobank Electronic Health Record data for research

https://github.com/philipdarke/ukbb-ehr-data

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Keywords

ehr electronic-health-records healthcare uk-biobank

Last synced: 6 months ago · JSON representation ·

Repository

Prepare UK Biobank Electronic Health Record data for research

Basic Info

Host: GitHub
Owner: philipdarke
License: mit
Language: R
Default Branch: main
Homepage:
Size: 252 KB

Statistics

Stars: 26
Watchers: 1
Forks: 5
Open Issues: 0
Releases: 1

Topics

ehr electronic-health-records healthcare uk-biobank

Created over 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

Prepare UK Biobank EHR data for research

Clean and prepare UK Biobank primary care EHR for research. Tested with the interim EHR data release.

Installation

Install the ukbbhelpr R package from here.
Clone the EHR code set repository here.
Clone this repository and follow the instructions below.

The following R packages are required. Install them using:

R required <- c("zoo", "dplyr", "plyr", "ggplot2", "cowplot") optional <- c("caret", "QDiabetes", "survival") install.packages(required) install.packages(optional) # needed to run code in the paper directory

UK Biobank data

Download the data for your UK Biobank application from the data showcase. The following fields are required to process the primary care EHR data:

Description | Field ----------- | ----- Year and month of birth | 34, 52 Date of assessment centre visit | 53 Linked date of death | 40000

The fields below are required to run the code in the 02_extract_records and paper directories:

Description | Field ----------- | ----- Demographic data | 31, 189, 21000 | Anthropomorphic measurements | 48, 50, 21002 HbA1c blood glucose | 30750 Self-reported non-cancer medical history | 2986, 20002, 20003, 20008 Smoking history | 1249, 2887, 3456, 20116 Summary secondary care data | 41270, 41271, 41272, 41273, 41280, 41281, 41282, 41283

:warning: Edit 01_prepare_data/01_subset_visit_data.R if any of the optional fields above are unavailable.

In addition, the primary care data is required:

Description | File ----------- | ---- Participant registration records | gp_registrations.txt Clinical event records | gp_clinical.txt Prescription records | gp_scripts.txt

Prepare the data for research

Update file_paths.R with the paths to your downloaded data.
Run the scripts in the 01_prepare_data directory sequentially to infer periods of data collection for each participant. The results are saved in data/data_period.rds by default.
Run the scripts in the 02_extract_records directory sequentially to extract the files marked * in the table below.

Alternatively, run_all.R can be run instead of steps 3 and 4.

:warning: The EHR data are large files and run_all.R in particular is very memory intensive. Use of a high performance computing service is recommended. UK Biobank data must be stored and processed as required under the Material Transfer Agreement.

Tested with the September 2019 interim EHR release on an Intel Xeon E5-2699 v4 processor (2.2 GHz, 22 cores, 55 MB cache) with 256Gb RAM running R 3.6 on CentOS Linux 7. The code has not been tested on R 4.0+.

Output summary

The following files are saved in the data directory by default:

File | Description ---- | ----------- data_period.rds | Period(s) of EHR data collection for each participant gp_event.rds | Clean event/diagnosis data gp_presc.rds | Clean prescription data biomarkers.rds* | Extracted biomarkers demographic.rds* | Ethnicity, smoking history and Townsend deprivation family_history.rds* | Family history data diagnoses.rds* | Extracted diagnosis codes for a range of common conditions prescriptions.rds* | Estimated periods during which selected drugs were prescribed

Files marked * are generated by the scripts in the 02_extract_records directory.

Visualising the results

Estimating periods of EHR data collection

visualisation/01_algorithm.R can be used to plot the results of the algorithm used to infer periods of EHR data collection for a participant.

Data collection algorithm example

Diabetes phenotyping case study

visualisation/02_phenotyping.R can be used to plot the results of the diabetes phenotyping algorithm. paper/02_diabetes_phenotyping.R must be run first.

Example output from diabetes phenotyping tool

Citing this work

If you use this work, please cite it as below:

@article{10.1093/jamia/ocab260, author = {Darke, Philip and Cassidy, Sophie and Catt, Michael and Taylor, Roy and Missier, Paolo and Bacardit, Jaume}, title = "{Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study}", journal = {Journal of the American Medical Informatics Association}, volume = {29}, number = {3}, pages = {546-552}, year = {2021}, month = {12}, issn = {1527-974X}, doi = {10.1093/jamia/ocab260}, url = {https://doi.org/10.1093/jamia/ocab260}, eprint = {https://academic.oup.com/jamia/article-pdf/29/3/546/42333190/ocab260.pdf}, }

Licence

Made available under the MIT Licence.

Owner

Name: Philip Darke
Login: philipdarke
Kind: user
Company: Newcastle University

Website: philipdarke.com
Repositories: 4
Profile: https://github.com/philipdarke

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Darke"
  given-names: "Philip"
  orcid: "https://orcid.org/0000-0002-9033-2767"
- family-names: "Cassidy"
  given-names: "Sophie"
  orcid: "https://orcid.org/0000-0002-0228-7274"
- family-names: "Catt"
  given-names: "Michael"
- family-names: "Taylor"
  given-names: "Roy"
- family-names: "Missier"
  given-names: "Paolo"
- family-names: "Bacardit"
  given-names: "Jaume"
  orcid: "https://orcid.org/0000-0002-2692-7205"
title: "Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study"
version: 1.0.0
doi: 10.1093/jamia/ocab260
date-released: 2020-12-13
url: "https://doi.org/10.1093/jamia/ocab260"

GitHub Events

Total

Watch event: 2
Fork event: 2

Last Year

Watch event: 2
Fork event: 2

Committers

Last synced: over 1 year ago

All Time

Total Commits: 16
Total Committers: 1
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Philip Darke	4****e	16

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 4
Total pull requests: 0
Average time to close issues: 2 months
Average time to close pull requests: N/A
Total issue authors: 4
Total pull request authors: 0
Average comments per issue: 1.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

erenelci (1)
RoyeSie (1)
xiaonanl1996 (1)
Gizmodiat (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

poetry.lock pypi

numpy 1.23.2
pandas 1.4.3
python-dateutil 2.8.2
pytz 2022.2.1
six 1.16.0

pyproject.toml pypi

pandas ^1.4.3
python ^3.8

ukbb-ehr-data

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Prepare UK Biobank EHR data for research

Installation

UK Biobank data

Prepare the data for research

Output summary

Visualising the results

Estimating periods of EHR data collection

Diabetes phenotyping case study

Citing this work

Licence

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies