https://github.com/bethan-mallabar-rimmer/crc_irm
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: bethan-mallabar-rimmer
- Language: R
- Default Branch: main
- Size: 6.81 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
This repository contains files and pipeline for 'Colorectal cancer risk stratification using a polygenic risk score in symptomatic primary care patients – a UK Biobank retrospective cohort study'.
Doi: https://doi.org/10.1038/s41431-024-01654-3
analysis.R contains all the code used for the analysis. It mostly runs on R version 4.1.1.
It's split into the following sections. Some sections require files as input (often hosted securely on UKBB and can't be shared publicly - for reproducibility, I've tried to describe relevant details of these files e.g. column headers in code annotations). Sometimes analysis decisions were made based on descriptive graphs of the data. In summary, the code isn't designed to run from beginning to end non-stop & will need adapting if applied to different datasets.
Contents: 1. Identify a list of CRC symptoms 2. Find UKBB participants with symptoms and make a table of earliest symptom for each participant 3. Find earliest occurence of CRC for participants (identify cases & controls) and remove participants with hereditary syndromes increasing risk of CRC. 4. Add all lifestyle/symptom/health variables to participant data frame 5. Check case/control numbers by ancestry. Analysis continued with only European cohort due to case numbers and unrelated individuals to avoid bias. 6. Generate the polygenic risk score for all participants and work out quintiles 7. Split cohort 80:20 into training and testing groups for validation. Stratify both testing & training cohorts by age and sex. 8. In training cohort: Logistic regression analysis to find variables associated with case or control groups. 9. In training cohort: Calculate ROCAUC of each variable and build integrated risk model iteratively based on ROCAUC values, with 5-fold cross validation. 10. In training cohort: Compare all possible integrated risk models with AIC. 11. Results of steps 9 and 10 concurred that a 6-variable integrated risk model performed best in the training cohort. Evaluate this model in the testing cohort.
Abbreviations: AIC, Akaike information criterion, CRC = colorectal cancer, ROCAUC = receiver operating characteristic area under the curve
Please note the code to generate a polygenic risk score in this pipeline is no longer working after the rbgen package disappeared from the internet. New code for this can be found at: https://github.com/hdg204/GRS-Nexus - please contact the author of this repository with any questions.
The findreadcodes folder contains an R function which takes read codes as input and returns similar read codes.
CRCreadcodes contains some of the 227 Read codes for CRC symptoms which were used to include participants in this study (others are available upon request - see folder readme file), and the list of 49 Read codes used to identify cases of CRC in participants' GP records.
Owner
- Login: bethan-mallabar-rimmer
- Kind: user
- Repositories: 1
- Profile: https://github.com/bethan-mallabar-rimmer