cprd-data-wrangle

Introduction to CPRD using synthetic datasets

https://github.com/aim-rsf/cprd-data-wrangle

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization aim-rsf has institutional domain (www.turing.ac.uk)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.1%) to scientific vocabulary

Keywords

cprd cprd-aurum database ehr-data etl-pipeline health notebook-jupyter postgresql python synthetic-data tutorial
Last synced: 6 months ago · JSON representation

Repository

Introduction to CPRD using synthetic datasets

Basic Info
  • Host: GitHub
  • Owner: aim-rsf
  • License: gpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 25.7 MB
Statistics
  • Stars: 6
  • Watchers: 3
  • Forks: 0
  • Open Issues: 4
  • Releases: 6
Topics
cprd cprd-aurum database ehr-data etl-pipeline health notebook-jupyter postgresql python synthetic-data tutorial
Created over 1 year ago · Last pushed 12 months ago
Metadata Files
Readme Contributing License Citation

README.md

All Contributors DOI Project Status: Inactive  The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

Welcome

Who is this repository for?

This repository is for anyone new to working with datasets released by the Clinical Practice Research Datalink (CPRD). Researchers tasked with understanding the database tables, then querying and filtering to create a research cohort, may find our pre-processing pipeline and interactive notebooks a helpful guide to getting started.

Please note:

  • You need your own copy of CPRD's synthetic/real data to run the code. This repository does not contain any data files. You can access two of CPRD's synthetic datasets for free, alongside a Data Sharing Agreement (DSA).

  • CPRD are moving towards a TRE model of data access, instead of a researcher downloading data onto their own computer. Read more here.

  • This is a work in progress repository. If you would like to suggest or contribute a change, please read our contributor guide.

Project Goals

We aim to streamline the process for researchers using CPRD datasets, with the creation of clear documentation, efficient data management strategies and analytical pipelines. We will start with development of workflows utilising CPRD's medium fidelity synthetic datasets because they resemble

"the real world CPRD data with respect to the data types, data values, data formats, data structure and table relationships" ref.

New to Synthetic Data? Read an introduction here.

We will create and share documentation & code, in openly available languages. We will start by loading the data into a relational database and summarising some of its main features.

By working with our research collaborators, we aim to test workflows written with synthetic datasets on the real datasets to ensure transferability and utility. An anticipated mismatch will be the size of the data files and possibly the variability in file format. Please reach out to us if you want to test our code on your real CPRD data, or have any feedback on improving transferability and utility.

CPRD's most recently released data specifications can be found here for the real datasets and here for the synthetic datasets.

Current content

We include information on CPRD's Code Browser tool and how to request access to it.

The code-for-aurum folder uses Python and postgreSQL to create a pre-processing workflow for CPRD Aurum data which includes a conversion of data file format for compatibility, and then reading the data into tables in a relational database. Workbooks have been created to familiarise a user with the CPRD Aurum tables, including how they link together and how to build a sample cohort. See a preview below:

landing-page-demo-gif

Similar resources

We have not done an exhaustive search for public resources with similar content (loading and pre-processing of CPRD data) but from the ones we have found many were narrow in scope (related to the goals of a specific research project) and/or not maintained (not updated for many months or years). However, these two resources may be worth taking a look at: https://github.com/HFAnalyticsLab/aurumpipeline and https://github.com/Exeter-Diabetes.

Contributions and Acknowledgments

We acknowledge and thank these groups for making this project possible:

The views expressed within any file in this repository are those of the author(s) within the AIM-RSF programme, and not necessarily those of the: NIHR, Department of Health and Social Care, Medicines and Healthcare products Regulatory Agency (MHRA) or CPRD.

Thanks to specific contributors

This project follows the all-contributors specification, using the emoji key: <!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section --> <!-- prettier-ignore-start --> <!-- markdownlint-disable -->

Rachael Stickland
Rachael Stickland

Mahwish Mohammad
Mahwish Mohammad

Batool Almarzouq
Batool Almarzouq

Ann-Marie Mallon
Ann-Marie Mallon

Kirstie Whitaker
Kirstie Whitaker

Would you like to contribute? Please read our contributor guide.

Licence

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details. For more information, refer to GNU General Public License.

Citation

Almarzouq, B., Mallon, A.-M., Mohammad, M., Stickland, R., Whitaker, K., & AIM-RSF team. (2025). Introduction to CPRD using synthetic datasets (cprd-data-wrangle). Zenodo: https://doi.org/10.5281/zenodo.13693615


You got to the end of the README? You get our :seal: of approval!

Owner

  • Name: AI for Multiple Long-term Conditions - Research Support Facility
  • Login: aim-rsf
  • Kind: organization
  • Location: United Kingdom

Developing data standards, best practice and community around AI for multiple long term conditions research

GitHub Events

Total
  • Create event: 6
  • Release event: 4
  • Issues event: 4
  • Watch event: 5
  • Delete event: 4
  • Issue comment event: 7
  • Push event: 13
  • Pull request review event: 1
  • Pull request event: 6
Last Year
  • Create event: 6
  • Release event: 4
  • Issues event: 4
  • Watch event: 5
  • Delete event: 4
  • Issue comment event: 7
  • Push event: 13
  • Pull request review event: 1
  • Pull request event: 6

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.25
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.25
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • RayStick (7)
  • Rainiefantasy (4)
  • BatoolMM (1)
Pull Request Authors
  • RayStick (6)
  • Rainiefantasy (4)
  • BatoolMM (1)
Top Labels
Issue Labels
question (7) enhancement (2) documentation (1)
Pull Request Labels
documentation (9) enhancement (2)