https://github.com/bioconductor-source/regionalpcs

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:
○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation
Repository

Basic Info

Host: GitHub
Owner: bioconductor-source
License: other
Language: R
Default Branch: devel
Size: 1.48 MB
Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files

Readme License
README.Rmd

---
title: "regionalpcs"
output: github_document
vignette: >
  %\VignetteIndexEntry{regionalpcs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Table of Contents

1. [Introduction](#introduction)
2. [Installation](#installation)
3. [regionalpcs R Package Tutorial](#regionalpcs-r-package-tutorial)
    - 3.1 [Loading Required Packages](#loading-required-packages)
    - 3.2 [Load the Dataset](#load-the-dataset)
        - 3.2.1 [Overview](#overview)
        - 3.2.2 [Inspecting the Data](#inspecting-the-data)
    - 3.3 [Obtaining Methylation Array Probe 
    Positions](#obtaining-methylation-array-probe-positions)
        - 3.3.1 [Introduction](#introduction-1)
        - 3.3.2 [Extract Probe Names and
        Positions](#extract-probe-names-and-positions)
        - 3.3.3 [Load Illumina 450k Array Probe
        Positions](#load-illuminaprobe-positions)
        - 3.3.4 [Merge Data Frames](#merge-data-frames)
        - 3.3.5 [Addressing Genome Build
        Discrepancies](#addressing-genome-build-discrepancies)
    - 3.4 [Processing and Filtering Methylation
    Data](#processing-and-filtering-methylation-data)
        - 3.4.1 [Introduction](#introduction-2)
        - 3.4.2 [Remove Low Variance CpGs](#remove-low-variance-cpgs)
        - 3.4.3 [Normalize Methylation Values](#normalize-methylation-values)
    - 3.5 [Summarizing Gene Region Types](#summarizing-gene-region-types)
        - 3.5.1 [Introduction](#introduction-3)
        - 3.5.2 [Load Gene Region Annotations](#load-gene-region-annotations)
        - 3.5.3 [Create a Region Map](#create-a-region-map)
            - 3.5.3.1 [Extract CpG Positions](#extract-cpg-positions)
            - 3.5.3.2 [Convert to GenomicRanges and Find
            Overlaps](#convert-to-genomicranges-and-find-overlaps)
        - 3.5.4 [Summarizing Gene Regions with Regional Principal
        Components](#summarizing-gene-regions)
            - 3.5.4.1 [Compute Regional PCs](#compute-regional-pcs)
            - 3.5.4.2 [Inspecting the Output](#inspecting-the-output)
            - 3.5.4.3 [Extracting and Viewing Regional
            PCs](#extracting-and-viewing-regional-pcs)
            - 3.5.4.4 [Understanding the Results](#understanding-the-results)




```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    echo = TRUE
)
```

# regionalpcs

Tiffany Eulalio

The `regionalpcs` package aims to address the challenge of summarizing 
and interpreting DNA methylation data at a regional level. Traditional methods 
of analysis may not capture the biological complexity of methylation patterns,
potentially leading to less accurate or less meaningful interpretations. 
This package introduces the concept of regional principal components (rPCs) 
as a tool for capturing more biologically relevant signals in DNA methylation 
data. By using rPCs, researchers can gain new insights into complex 
interactions and effects in methylation data that might otherwise be missed.


# Installation
You can install the regionalpcs package from Bioconductor 
using the following command:

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("regionalpcs")
```

You can install the development version of regionalpcs from GitHub with:

```{r eval=FALSE}
# install devtool package if needed
if (!requireNamespace("devtools", quietly=TRUE))
    install.packages("devtools")

# download the regionalpcs package
devtools::install_github("tyeulalio/regionalpcs")
```

# `regionalpcs` R Package Tutorial 

## Loading Required Packages
This tutorial depends on several Bioconductor packages. These packages should be
loaded at the beginning of the analysis.

```{r load-packages, message=FALSE, warning=FALSE}
library(regionalpcs)
library(RNOmni)
library(GenomicRanges)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(liftOver)
library(magrittr)
library(tidyr)
library(tibble)
library(dplyr)
```

Here, we load the regionalpcs package, which is the main package we’ll be 
using in this tutorial. We also load RNOmni, which provides normalization
functions, GenomicRanges, which provides tools for working with genomic
intervals, and tidyverse, which provides a suite of tools for data 
manipulation and visualization.

It’s important to note that you need to have these packages installed on your
machine before loading them. You can install them using the install.packages()
function in R.

Once the packages are loaded, we can start using the functions provided by each
package.


## Load the dataset

### Overview
The betas dataset in the `regionalpcs` package is a subset of 450k array
methylation data from TCGA, containing 293 methylation sites and 300 samples. We'll load this data into our R session and explore its structure.

```{r load-data}
data("betas", package = "regionalpcs")
```

### Inspecting the Data
We can take a quick look at the dimensions of the dataset and the first few 
rows to understand its structure.

```{r inspect-data}
head(betas)[, 1:3]
dim(betas)
```

Note that the row names are CpG IDs and genomic positions, and the columns 
contain methylation beta values ranging from 0 to 1 for individual samples.


## Obtaining Methylation Array Probe Positions

### Introduction
To perform accurate and informative analyses on methylation array data, it is 
critical to have precise genomic positions for each probe. The
`IlluminaHumanMethylation450kanno.ilmn12.hg19` package contains annotations for 
450k methylation arrays, which can be utilized for this purpose. This section 
will walk you through the steps to associate each probe in your dataset with 
its genomic position.

### Extract Probe Names and Positions
First, we'll extract the probe names from the betas data frame and use regular
expressions to separate the CpG identifier from its genomic position.

```{r extract-probe}
# Extract probe names and CpG positions from row names of 'betas'
cpg_df <- data.frame(cpgs = rownames(betas)) %>%
    separate(cpgs,
        into = c("cpg_pos", "probe"), sep = "_(?=[^_]+$)",
        extra = "merge"
    )
head(cpg_df)
```

### Load Illumina 450k Array Probe Positions
Next, let's load the Illumina 450k array probe positions for further annotation.

```{r illumina-anno}
data(Locations)
probe_positions <- data.frame(Locations)
head(probe_positions)
```

### Merge Data Frames
Now, we merge the extracted probe names with the Illumina 450k array probe 
positions.

```{r merge-probe}
formatted_probe_positions <- probe_positions %>%
    rownames_to_column("probe")

new_cpg_df <- cpg_df %>%
    left_join(formatted_probe_positions, by = "probe")
head(new_cpg_df)
```

### Addressing Genome Build Discrepancies
It's critical to ensure that the genome builds match across datasets. In this 
example, we'll use the `GenomicRanges` and `liftOver` packages to convert the 
genomic positions from hg19 to hg38. Here’s a quick example on how to lift over positions from one build to another. **Always ensure that you are working with 
the correct genome build and that the build matches across all your datasets, 
or else you will run into big issues!**

We need a chain file to lift the genomic positions. The chain file is an
annotation file that links the positions between the genome builds. You can
download this file from the (UCSC golden path download
site)[https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/]. Be sure to
download the file that maps between the appropriate builds. We’ll be mapping 
from hg19 to hg38. We’ve included the chain used in this analysis as a part of
the regionalpcs package, which can be accessed in the “extdata” folder as shown
in the code below.

```{r lift-over}
# Convert hg19 positions into a GenomicRanges object
hg19_pos <- new_cpg_df %>%
    select("chr", "pos", "strand", "probe") %>%
    mutate(start = pos, end = pos + 1)

hg19_pos_gr <- makeGRangesFromDataFrame(hg19_pos, keep.extra.columns = TRUE)

# Load chain file and liftOver positions
chain_file <- system.file("extdata", "hg19ToHg38.over.chain",
    package = "regionalpcs"
)
print(paste("Using chain file for liftOver", chain_file))
print(file.exists(chain_file))
chain <- import.chain(chain_file)

hg38_pos <- liftOver(hg19_pos_gr, chain) %>%
    as.data.frame()

# Merge the lifted positions back to the original data frame
formatted_hg38 <- hg38_pos %>%
    select(chrom_hg38 = seqnames, pos_hg38 = start, probe)

lifted_cpg_df <- new_cpg_df %>%
    left_join(formatted_hg38, by = "probe")
head(lifted_cpg_df)
```

Now that we have accurate genomic positions for each probe and have harmonized 
genome builds, we can proceed with preprocessing the methylation data.


## Processing and Filtering Methylation Data

### Introduction
Before conducting downstream analyses, it is essential to preprocess and clean 
the methylation data. In this section, we'll walk you through the steps to 
remove low variance CpGs and normalize the methylation beta values.

### Remove Low Variance CpGs
Firstly, we aim to filter out low variance CpGs. Variability is a crucial 
factor, as low variance CpGs may not provide much information for downstream 
analyses.

In this section, we'll remove low variance CpGs and normalize our methylation 
beta values using the inverse normal transform.

```{r remove-low-var}
# Remove CpGs with zero variance
var_betas <- betas[apply(betas, 1, var, na.rm = TRUE) != 0, ] %>%
    na.omit()
dim(var_betas)
```
We only remove CpGs that have zero variance in this example. You can adjust 
this threshold according to the requirements of your specific analysis.

### Normalize Methylation Values
Methylation data often exhibit heteroscedasticity. Therefore, we'll normalize 
the beta values using inverse normal transformation. For this, we'll use the
`RankNorm` function from the `RNOmni`  package.

```{r normalize-values}
# Apply inverse normal transformation to methylation beta values
int_meth <- apply(var_betas, 1, RankNorm) %>%
    t() %>%
    as.data.frame()
```

After these preprocessing steps, you will have a dataset ready for downstream 
analysis with the `regionalpcs` package. We'll cover how to perform these 
analyses in subsequent sections of this tutorial.

## Summarizing Gene Region Types

### Introduction
Gene regions are significant functional units of the genome, such as promoters, 
gene bodies, and intergenic regions. We'll focus on summarizing these regions 
to prepare for downstream analyses. We will use the `regionalpcs` package to 
perform these tasks.

### Load Gene Region Annotations
First, let's load the gene region annotations. Make sure to align the genomic 
builds of your annotations and methylation data.

**All annotations included with the `regionalpcs` package are in build hg38.**

```{r}
# Load the gene region annotation file
data("gene_annots", package = "regionalpcs")
head(gene_annots)
```
The `gene_annots`  dataset includes annotations for various gene regions.

### Create a Region Map

Before summarizing gene regions using `compute_regional_pcs`, we need to create 
a region map that assigns CpGs to gene regions. This map enables us to identify
which CpGs fall into each gene region.

#### Extract CpG Positions
Start by extracting the CpG positions from your methylation data frame's row 
names.

```{r extract-cpg}
head(int_meth)[1:4]
# Extract CpG information
cpg_info <- data.frame(cpg_id = rownames(int_meth)) %>%
    separate(cpg_id,
        into = c("chrom", "start", "end", "cpg_name"),
        sep = "_", remove = FALSE
    )
head(cpg_info)
```

#### Convert to GenomicRanges and Find Overlaps
Next, we'll use the `GenomicRanges` package to find overlaps between CpGs and 
gene regions.

```{r, warning=FALSE}
# Convert to GenomicRanges
cpg_gr <- makeGRangesFromDataFrame(cpg_info, keep.extra.columns = TRUE)
annots_gr <- makeGRangesFromDataFrame(gene_annots, keep.extra.columns = TRUE)

# Find overlaps between the two GRanges objects
overlaps <- findOverlaps(query = cpg_gr, subject = annots_gr) %>%
    as.data.frame()
head(overlaps)

# Match overlaps
matched_cpg <- cpg_gr[overlaps$queryHits, ] %>%
    as.data.frame() %>%
    select(cpg_id)

# Select overlapped rows and just keep the columns we need
matched_annots <- annots_gr[overlaps$subjectHits, ] %>%
    as.data.frame() %>%
    select(gencode_gene_id)

# Combine the matched CpGs and gene annotations to form the region map
region_map <- cbind(matched_annots, matched_cpg)
head(region_map)
length(unique(region_map$gencode_gene_id))
```
With these steps, you'll have a region map that assigns CpGs to specific gene 
regions, which can be essential for downstream analyses.




### Summarizing Gene Regions with Regional Principal Components


In this final section, we'll summarize gene regions using Principal 
Components (PCs) to capture the maximum variation. We'll utilize the
`compute_regional_pcs` function from the `regionalpcs` package for this.

#### Compute Regional PCs
Let's calculate the regional PCs using a subset of our gene regions for 
demonstration purposes.

```{r compute-regional-pcs}
# Display head of region map
head(region_map)

# Subset the region map
sub_region_map <- region_map %>%
    filter(gencode_gene_id %in% unique(region_map$gencode_gene_id)[1:1000])

# Compute regional PCs
res <- compute_regional_pcs(int_meth, sub_region_map)
```

#### Inspecting the Output
The function returns a list containing multiple elements. Let's first look at 
what these elements are.

```{r inspect-output}
# Inspect the output list elements
names(res)
```

#### Extracting and Viewing Regional PCs
The first element (`res$regional_pcs`) is a data frame containing the 
calculated regional PCs.


```{r extract-regional-pcs}
# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
```

#### Understanding the Results
The output is a data frame with regional PCs for each region as rows and our
samples as columns. This is our new representation of methylation values, now 
on a gene regional PC scale. We can feed these into downstream analyses as is.

The number of regional PCs representing each gene region was determined by the
Gavish-Donoho method. This method allows us to identify PCs that capture actual 
signal in our data and not the noise that is inherent in any dataset. 
To explore alternative methods, we can change the `pc_method` parameter.


```{r count-pcs-regions}
# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) %>%
    separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)

# number of genes that have been summarized
length(unique(regions$gene))

# how many of each PC did we get
table(regions$pc)
```
We have summarized each of our genes using just one PC. The number of PCs 
depends on three main factors: the number of samples, the number of CpGs in 
the gene region, and the noise in the methylation data.

By default, the `compute_regional_pcs` function uses the Gavish-Donoho method.
However, we can also use the Marcenko-Pasteur method by setting the `pc_method`
parameter:

```{r marcenko-pasteur-method}
# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(int_meth, sub_region_map, pc_method = "mp")

# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs

# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) %>%
    separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)

# number of genes that have been summarized
length(unique(mp_regions$gene))

# how many of each PC did we get
table(mp_regions$pc)
```

The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random 
Matrix Theory, and they aim to identify the number of significant PCs that 
capture the true signal in the data and not just the noise. However, these 
methods differ in how they select the number of significant PCs. The 
Marcenko-Pasteur method typically selects more PCs to represent a gene region 
compared to the Gavish-Donoho method. This may be due to the different ways in 
which the two methods estimate the noise level in the data.

Ultimately, the choice between the two methods depends on the specific needs 
and goals of the analysis. The Gavish-Donoho method tends to provide more 
conservative results, while the Marcenko-Pasteur method may capture more of 
the underlying signal in the data. Researchers should carefully consider 
their objectives and the characteristics of their data when selecting a method.


# Session Information

```{r sessionInfo}
sessionInfo()
```
Owner

Name: (WIP DEV) Bioconductor Packages
Login: bioconductor-source
Kind: organization
Email: maintainer@bioconductor.org
Website: https://bioconductor.org
Repositories: 1
Profile: https://github.com/bioconductor-source
Source code for packages accepted into Bioconductor
GitHub Events

Total

Last Year

Dependencies

DESCRIPTION cran
R >= 4.3.0 depends
GenomicRanges * imports
PCAtools * imports
dplyr * imports
tibble * imports
BiocStyle * suggests
IRanges * suggests
RMTstat * suggests
TxDb.Hsapiens.UCSC.hg19.knownGene * suggests
knitr * suggests
minfiData * suggests
rmarkdown * suggests
testthat >= 3.0.0 suggests
tidyr * suggests
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science