ukbb-ehr-data
Prepare UK Biobank Electronic Health Record data for research
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Keywords
Repository
Prepare UK Biobank Electronic Health Record data for research
Basic Info
Statistics
- Stars: 26
- Watchers: 1
- Forks: 5
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Prepare UK Biobank EHR data for research
Clean and prepare UK Biobank primary care EHR for research. Tested with the interim EHR data release.
Installation
- Install the
ukbbhelprR package from here. - Clone the EHR code set repository here.
- Clone this repository and follow the instructions below.
The following R packages are required. Install them using:
R
required <- c("zoo", "dplyr", "plyr", "ggplot2", "cowplot")
optional <- c("caret", "QDiabetes", "survival")
install.packages(required)
install.packages(optional) # needed to run code in the paper directory
UK Biobank data
Download the data for your UK Biobank application from the data showcase. The following fields are required to process the primary care EHR data:
Description | Field
----------- | -----
Year and month of birth | 34, 52
Date of assessment centre visit | 53
Linked date of death | 40000
The fields below are required to run the code in the 02_extract_records and paper directories:
Description | Field
----------- | -----
Demographic data | 31, 189, 21000
| Anthropomorphic measurements | 48, 50, 21002
HbA1c blood glucose | 30750
Self-reported non-cancer medical history | 2986, 20002, 20003, 20008
Smoking history | 1249, 2887, 3456, 20116
Summary secondary care data | 41270, 41271, 41272, 41273, 41280, 41281, 41282, 41283
:warning: Edit 01_prepare_data/01_subset_visit_data.R if any of the optional fields above are unavailable.
In addition, the primary care data is required:
Description | File
----------- | ----
Participant registration records | gp_registrations.txt
Clinical event records | gp_clinical.txt
Prescription records | gp_scripts.txt
Prepare the data for research
- Update
file_paths.Rwith the paths to your downloaded data. - Run the scripts in the
01_prepare_datadirectory sequentially to infer periods of data collection for each participant. The results are saved indata/data_period.rdsby default. - Run the scripts in the
02_extract_recordsdirectory sequentially to extract the files marked * in the table below.
Alternatively, run_all.R can be run instead of steps 3 and 4.
:warning: The EHR data are large files and run_all.R in particular is very memory intensive. Use of a high performance computing service is recommended. UK Biobank data must be stored and processed as required under the Material Transfer Agreement.
Tested with the September 2019 interim EHR release on an Intel Xeon E5-2699 v4 processor (2.2 GHz, 22 cores, 55 MB cache) with 256Gb RAM running R 3.6 on CentOS Linux 7. The code has not been tested on R 4.0+.
Output summary
The following files are saved in the data directory by default:
File | Description
---- | -----------
data_period.rds | Period(s) of EHR data collection for each participant
gp_event.rds | Clean event/diagnosis data
gp_presc.rds | Clean prescription data
biomarkers.rds* | Extracted biomarkers
demographic.rds* | Ethnicity, smoking history and Townsend deprivation
family_history.rds* | Family history data
diagnoses.rds* | Extracted diagnosis codes for a range of common conditions
prescriptions.rds* | Estimated periods during which selected drugs were prescribed
Files marked * are generated by the scripts in the 02_extract_records directory.
Visualising the results
Estimating periods of EHR data collection
visualisation/01_algorithm.R can be used to plot the results of the algorithm used to infer periods of EHR data collection for a participant.

Diabetes phenotyping case study
visualisation/02_phenotyping.R can be used to plot the results of the diabetes phenotyping algorithm. paper/02_diabetes_phenotyping.R must be run first.

Citing this work
If you use this work, please cite it as below:
@article{10.1093/jamia/ocab260,
author = {Darke, Philip and Cassidy, Sophie and Catt, Michael and Taylor, Roy and Missier, Paolo and Bacardit, Jaume},
title = "{Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study}",
journal = {Journal of the American Medical Informatics Association},
volume = {29},
number = {3},
pages = {546-552},
year = {2021},
month = {12},
issn = {1527-974X},
doi = {10.1093/jamia/ocab260},
url = {https://doi.org/10.1093/jamia/ocab260},
eprint = {https://academic.oup.com/jamia/article-pdf/29/3/546/42333190/ocab260.pdf},
}
Licence
Made available under the MIT Licence.
Owner
- Name: Philip Darke
- Login: philipdarke
- Kind: user
- Company: Newcastle University
- Website: philipdarke.com
- Repositories: 4
- Profile: https://github.com/philipdarke
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Darke" given-names: "Philip" orcid: "https://orcid.org/0000-0002-9033-2767" - family-names: "Cassidy" given-names: "Sophie" orcid: "https://orcid.org/0000-0002-0228-7274" - family-names: "Catt" given-names: "Michael" - family-names: "Taylor" given-names: "Roy" - family-names: "Missier" given-names: "Paolo" - family-names: "Bacardit" given-names: "Jaume" orcid: "https://orcid.org/0000-0002-2692-7205" title: "Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study" version: 1.0.0 doi: 10.1093/jamia/ocab260 date-released: 2020-12-13 url: "https://doi.org/10.1093/jamia/ocab260"
GitHub Events
Total
- Watch event: 2
- Fork event: 2
Last Year
- Watch event: 2
- Fork event: 2
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 4
- Total pull requests: 0
- Average time to close issues: 2 months
- Average time to close pull requests: N/A
- Total issue authors: 4
- Total pull request authors: 0
- Average comments per issue: 1.75
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- erenelci (1)
- RoyeSie (1)
- xiaonanl1996 (1)
- Gizmodiat (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- numpy 1.23.2
- pandas 1.4.3
- python-dateutil 2.8.2
- pytz 2022.2.1
- six 1.16.0
- pandas ^1.4.3
- python ^3.8