epiclockinvasivebrca

https://github.com/danmonyak/epiclockinvasivebrca

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, springer.com, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: danmonyak
License: mit
Language: HTML
Default Branch: main
Size: 63.8 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created over 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

Measuring the Age of Individual Breast Cancers Using an Entropy-Based Molecular Clock

EpiClockInvasiveBRCA

1. Introduction

The code in this repository can be used to generate all figures and results found in Monyak et al. (2025). Preprocessing of some of the data sources is performed by Python scripts, while all generation of figures and results must be done in Jupyter notebooks and R Markdown files.

Software requirements: - Python 3 - R

2. Setup

Fork and clone this repository locally as normal.

Python

Use a bash shell to run all scripts and Jupyter notebooks. To see what shell is running, use echo $SHELL. Run the following line to append the path to the parent directory of the repository clone to the Python path:

repo_parent_dir=/PATH/TO/REPO/PARENT/DIR echo "export PYTHONPATH=$PYTHONPATH:$repo_parent_dir" >> ~/.bash_profile

Note: the local clone of EpiClockInvasiveBRCA should be located directly in repoparentdir.

R

In your R environment, preferably Rstudio, run the following line and copy the path outputted:

file.path(Sys.getenv("R_HOME"), 'etc', 'Rprofile.site')

Append the following line to the file at the path outputted (create the file if necessary):

repo_dir <- '/PATH/TO/REPO/PARENT/DIR/EpiClockInvasiveBRCA'

replacing the code above appropriately with the path to the local repository clone.

Path variables

Open src/consts.json in a text editor and insert appropriate paths for the following attributes: - repo_dir — Path to the repository (same as the R variable "repodir" in the previous step) - **officialindir** — Path to a directory in an external file location (preferably Box) that can hold terabytes of data - TCGA_datadir — Path to a directory that will hold the TCGA data (preferably a subdirectory of officialindir) - **Lunddatadir** — Path to a directory that will hold the Lund cohort data (preferably a subdirectory of official_indir)

2. Supplementary Data Retrieval

In the external file location (preferably Box), create directories with the name of each cohort, to which one should download the relevant data from Gene Expression Omnibus (GEO): - Aurora: GSE212370 - Methylation - Clinical - Download supplementary data directory 430182022491MOESM2ESM from https://doi.org/10.1038/s43018-022-00491-x - Desmedt: GSE39451 - Series Matrix - Germany: GSE69914 - Series Matrix - Lund: GSE25307 - Methylation - Clinical - Luo: GSE106360 - Series Matrix - Reyngold: GSE58999 - Series Matrix

3. Pipeline

1. Simulation

To run all simulations, do: sh runAllSimulations.sh

2. TCGA Retrieval

To retrieve the TCGA data and generate the HTML output, set the header parameters accordingly in DataPrep.Rmd, and in bash, do: ``` Rscript -e "rmarkdown::render('DataPrep.Rmd', outputformat = 'htmldocument', outputfile = paste0('DataPrep ', Sys.time(), '.html'))" ```

It could take up to a few hours to run, though it will likely take less than 1 hour. This script should be run on a machine of at least 16 GB of memory.

3. Select fCpGs

Run all cells in the Jupyter notebook "Select_fCpGs-Revision.ipynb"

4. Process Supplementary Data

To process all supplementary data, do: sh processAllData.sh

5. Subtyping

To calculate PAM50 subtype for the TCGA tumors, do:

Rscript "subtype.R"

6. Beta Mixture Model

To run the beta mixture model decomposition analysis on the TCGA and Lund data and generate the HTML output, set the header parameters accordingly in FitBetaMixture.Rmd, and in bash, do: ``` Rscript -e "rmarkdown::render('FitBetaMixture.Rmd', outputformat = 'htmldocument', outputfile = paste0('FitBetaMixture ', Sys.time(), '.html'))" ```

It should take no more than 10 minutes to render.

7. Analysis

In this order:

Run all cells in: 1. c_beta Analysis.ipynb 2. Estimate ages.ipynb 3. Multi-sample.ipynb

To perform beta value adjustment and subsequent analysis, and generate the HTML output, set the header parameters accordingly in beta_adjustment.Rmd, and in bash, do:

Rscript -e "rmarkdown::render('beta_adjustment.Rmd', output_format = 'html_document', output_file = paste0('beta_adjustment ', Sys.time(), '.html'))"

To generate the GSEA-related figures and generate the HTML output, set the header parameters accordingly in GSEA_Figure.Rmd, and in bash, do:

Rscript -e "rmarkdown::render('GSEA_Figure.Rmd', output_format = 'html_document', output_file = paste0('GSEA_Figure ', Sys.time(), '.html'))"

Owner

Name: Daniel Monyak
Login: danmonyak
Kind: user

Repositories: 1
Profile: https://github.com/danmonyak

Citation (citations.md)

Some code was adapted from https://rowannicholls.github.io/python/graphs/ax_based/boxplots_significance.html to create the significance bars in the saveBoxPlotNew function in src/util.py.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science