https://github.com/bioconductor-source/regionalpcs
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: bioconductor-source
- License: other
- Language: R
- Default Branch: devel
- Size: 1.48 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created almost 2 years ago
· Last pushed almost 2 years ago
Metadata Files
Readme
License
README.Rmd
---
title: "regionalpcs"
output: github_document
vignette: >
%\VignetteIndexEntry{regionalpcs}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Table of Contents
1. [Introduction](#introduction)
2. [Installation](#installation)
3. [regionalpcs R Package Tutorial](#regionalpcs-r-package-tutorial)
- 3.1 [Loading Required Packages](#loading-required-packages)
- 3.2 [Load the Dataset](#load-the-dataset)
- 3.2.1 [Overview](#overview)
- 3.2.2 [Inspecting the Data](#inspecting-the-data)
- 3.3 [Obtaining Methylation Array Probe
Positions](#obtaining-methylation-array-probe-positions)
- 3.3.1 [Introduction](#introduction-1)
- 3.3.2 [Extract Probe Names and
Positions](#extract-probe-names-and-positions)
- 3.3.3 [Load Illumina 450k Array Probe
Positions](#load-illuminaprobe-positions)
- 3.3.4 [Merge Data Frames](#merge-data-frames)
- 3.3.5 [Addressing Genome Build
Discrepancies](#addressing-genome-build-discrepancies)
- 3.4 [Processing and Filtering Methylation
Data](#processing-and-filtering-methylation-data)
- 3.4.1 [Introduction](#introduction-2)
- 3.4.2 [Remove Low Variance CpGs](#remove-low-variance-cpgs)
- 3.4.3 [Normalize Methylation Values](#normalize-methylation-values)
- 3.5 [Summarizing Gene Region Types](#summarizing-gene-region-types)
- 3.5.1 [Introduction](#introduction-3)
- 3.5.2 [Load Gene Region Annotations](#load-gene-region-annotations)
- 3.5.3 [Create a Region Map](#create-a-region-map)
- 3.5.3.1 [Extract CpG Positions](#extract-cpg-positions)
- 3.5.3.2 [Convert to GenomicRanges and Find
Overlaps](#convert-to-genomicranges-and-find-overlaps)
- 3.5.4 [Summarizing Gene Regions with Regional Principal
Components](#summarizing-gene-regions)
- 3.5.4.1 [Compute Regional PCs](#compute-regional-pcs)
- 3.5.4.2 [Inspecting the Output](#inspecting-the-output)
- 3.5.4.3 [Extracting and Viewing Regional
PCs](#extracting-and-viewing-regional-pcs)
- 3.5.4.4 [Understanding the Results](#understanding-the-results)
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
echo = TRUE
)
```
# regionalpcs
Tiffany Eulalio
The `regionalpcs` package aims to address the challenge of summarizing
and interpreting DNA methylation data at a regional level. Traditional methods
of analysis may not capture the biological complexity of methylation patterns,
potentially leading to less accurate or less meaningful interpretations.
This package introduces the concept of regional principal components (rPCs)
as a tool for capturing more biologically relevant signals in DNA methylation
data. By using rPCs, researchers can gain new insights into complex
interactions and effects in methylation data that might otherwise be missed.
# Installation
You can install the regionalpcs package from Bioconductor
using the following command:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("regionalpcs")
```
You can install the development version of regionalpcs from GitHub with:
```{r eval=FALSE}
# install devtool package if needed
if (!requireNamespace("devtools", quietly=TRUE))
install.packages("devtools")
# download the regionalpcs package
devtools::install_github("tyeulalio/regionalpcs")
```
# `regionalpcs` R Package Tutorial
## Loading Required Packages
This tutorial depends on several Bioconductor packages. These packages should be
loaded at the beginning of the analysis.
```{r load-packages, message=FALSE, warning=FALSE}
library(regionalpcs)
library(RNOmni)
library(GenomicRanges)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(liftOver)
library(magrittr)
library(tidyr)
library(tibble)
library(dplyr)
```
Here, we load the regionalpcs package, which is the main package we’ll be
using in this tutorial. We also load RNOmni, which provides normalization
functions, GenomicRanges, which provides tools for working with genomic
intervals, and tidyverse, which provides a suite of tools for data
manipulation and visualization.
It’s important to note that you need to have these packages installed on your
machine before loading them. You can install them using the install.packages()
function in R.
Once the packages are loaded, we can start using the functions provided by each
package.
## Load the dataset
### Overview
The betas dataset in the `regionalpcs` package is a subset of 450k array
methylation data from TCGA, containing 293 methylation sites and 300 samples. We'll load this data into our R session and explore its structure.
```{r load-data}
data("betas", package = "regionalpcs")
```
### Inspecting the Data
We can take a quick look at the dimensions of the dataset and the first few
rows to understand its structure.
```{r inspect-data}
head(betas)[, 1:3]
dim(betas)
```
Note that the row names are CpG IDs and genomic positions, and the columns
contain methylation beta values ranging from 0 to 1 for individual samples.
## Obtaining Methylation Array Probe Positions
### Introduction
To perform accurate and informative analyses on methylation array data, it is
critical to have precise genomic positions for each probe. The
`IlluminaHumanMethylation450kanno.ilmn12.hg19` package contains annotations for
450k methylation arrays, which can be utilized for this purpose. This section
will walk you through the steps to associate each probe in your dataset with
its genomic position.
### Extract Probe Names and Positions
First, we'll extract the probe names from the betas data frame and use regular
expressions to separate the CpG identifier from its genomic position.
```{r extract-probe}
# Extract probe names and CpG positions from row names of 'betas'
cpg_df <- data.frame(cpgs = rownames(betas)) %>%
separate(cpgs,
into = c("cpg_pos", "probe"), sep = "_(?=[^_]+$)",
extra = "merge"
)
head(cpg_df)
```
### Load Illumina 450k Array Probe Positions
Next, let's load the Illumina 450k array probe positions for further annotation.
```{r illumina-anno}
data(Locations)
probe_positions <- data.frame(Locations)
head(probe_positions)
```
### Merge Data Frames
Now, we merge the extracted probe names with the Illumina 450k array probe
positions.
```{r merge-probe}
formatted_probe_positions <- probe_positions %>%
rownames_to_column("probe")
new_cpg_df <- cpg_df %>%
left_join(formatted_probe_positions, by = "probe")
head(new_cpg_df)
```
### Addressing Genome Build Discrepancies
It's critical to ensure that the genome builds match across datasets. In this
example, we'll use the `GenomicRanges` and `liftOver` packages to convert the
genomic positions from hg19 to hg38. Here’s a quick example on how to lift over positions from one build to another. **Always ensure that you are working with
the correct genome build and that the build matches across all your datasets,
or else you will run into big issues!**
We need a chain file to lift the genomic positions. The chain file is an
annotation file that links the positions between the genome builds. You can
download this file from the (UCSC golden path download
site)[https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/]. Be sure to
download the file that maps between the appropriate builds. We’ll be mapping
from hg19 to hg38. We’ve included the chain used in this analysis as a part of
the regionalpcs package, which can be accessed in the “extdata” folder as shown
in the code below.
```{r lift-over}
# Convert hg19 positions into a GenomicRanges object
hg19_pos <- new_cpg_df %>%
select("chr", "pos", "strand", "probe") %>%
mutate(start = pos, end = pos + 1)
hg19_pos_gr <- makeGRangesFromDataFrame(hg19_pos, keep.extra.columns = TRUE)
# Load chain file and liftOver positions
chain_file <- system.file("extdata", "hg19ToHg38.over.chain",
package = "regionalpcs"
)
print(paste("Using chain file for liftOver", chain_file))
print(file.exists(chain_file))
chain <- import.chain(chain_file)
hg38_pos <- liftOver(hg19_pos_gr, chain) %>%
as.data.frame()
# Merge the lifted positions back to the original data frame
formatted_hg38 <- hg38_pos %>%
select(chrom_hg38 = seqnames, pos_hg38 = start, probe)
lifted_cpg_df <- new_cpg_df %>%
left_join(formatted_hg38, by = "probe")
head(lifted_cpg_df)
```
Now that we have accurate genomic positions for each probe and have harmonized
genome builds, we can proceed with preprocessing the methylation data.
## Processing and Filtering Methylation Data
### Introduction
Before conducting downstream analyses, it is essential to preprocess and clean
the methylation data. In this section, we'll walk you through the steps to
remove low variance CpGs and normalize the methylation beta values.
### Remove Low Variance CpGs
Firstly, we aim to filter out low variance CpGs. Variability is a crucial
factor, as low variance CpGs may not provide much information for downstream
analyses.
In this section, we'll remove low variance CpGs and normalize our methylation
beta values using the inverse normal transform.
```{r remove-low-var}
# Remove CpGs with zero variance
var_betas <- betas[apply(betas, 1, var, na.rm = TRUE) != 0, ] %>%
na.omit()
dim(var_betas)
```
We only remove CpGs that have zero variance in this example. You can adjust
this threshold according to the requirements of your specific analysis.
### Normalize Methylation Values
Methylation data often exhibit heteroscedasticity. Therefore, we'll normalize
the beta values using inverse normal transformation. For this, we'll use the
`RankNorm` function from the `RNOmni` package.
```{r normalize-values}
# Apply inverse normal transformation to methylation beta values
int_meth <- apply(var_betas, 1, RankNorm) %>%
t() %>%
as.data.frame()
```
After these preprocessing steps, you will have a dataset ready for downstream
analysis with the `regionalpcs` package. We'll cover how to perform these
analyses in subsequent sections of this tutorial.
## Summarizing Gene Region Types
### Introduction
Gene regions are significant functional units of the genome, such as promoters,
gene bodies, and intergenic regions. We'll focus on summarizing these regions
to prepare for downstream analyses. We will use the `regionalpcs` package to
perform these tasks.
### Load Gene Region Annotations
First, let's load the gene region annotations. Make sure to align the genomic
builds of your annotations and methylation data.
**All annotations included with the `regionalpcs` package are in build hg38.**
```{r}
# Load the gene region annotation file
data("gene_annots", package = "regionalpcs")
head(gene_annots)
```
The `gene_annots` dataset includes annotations for various gene regions.
### Create a Region Map
Before summarizing gene regions using `compute_regional_pcs`, we need to create
a region map that assigns CpGs to gene regions. This map enables us to identify
which CpGs fall into each gene region.
#### Extract CpG Positions
Start by extracting the CpG positions from your methylation data frame's row
names.
```{r extract-cpg}
head(int_meth)[1:4]
# Extract CpG information
cpg_info <- data.frame(cpg_id = rownames(int_meth)) %>%
separate(cpg_id,
into = c("chrom", "start", "end", "cpg_name"),
sep = "_", remove = FALSE
)
head(cpg_info)
```
#### Convert to GenomicRanges and Find Overlaps
Next, we'll use the `GenomicRanges` package to find overlaps between CpGs and
gene regions.
```{r, warning=FALSE}
# Convert to GenomicRanges
cpg_gr <- makeGRangesFromDataFrame(cpg_info, keep.extra.columns = TRUE)
annots_gr <- makeGRangesFromDataFrame(gene_annots, keep.extra.columns = TRUE)
# Find overlaps between the two GRanges objects
overlaps <- findOverlaps(query = cpg_gr, subject = annots_gr) %>%
as.data.frame()
head(overlaps)
# Match overlaps
matched_cpg <- cpg_gr[overlaps$queryHits, ] %>%
as.data.frame() %>%
select(cpg_id)
# Select overlapped rows and just keep the columns we need
matched_annots <- annots_gr[overlaps$subjectHits, ] %>%
as.data.frame() %>%
select(gencode_gene_id)
# Combine the matched CpGs and gene annotations to form the region map
region_map <- cbind(matched_annots, matched_cpg)
head(region_map)
length(unique(region_map$gencode_gene_id))
```
With these steps, you'll have a region map that assigns CpGs to specific gene
regions, which can be essential for downstream analyses.
### Summarizing Gene Regions with Regional Principal Components
In this final section, we'll summarize gene regions using Principal
Components (PCs) to capture the maximum variation. We'll utilize the
`compute_regional_pcs` function from the `regionalpcs` package for this.
#### Compute Regional PCs
Let's calculate the regional PCs using a subset of our gene regions for
demonstration purposes.
```{r compute-regional-pcs}
# Display head of region map
head(region_map)
# Subset the region map
sub_region_map <- region_map %>%
filter(gencode_gene_id %in% unique(region_map$gencode_gene_id)[1:1000])
# Compute regional PCs
res <- compute_regional_pcs(int_meth, sub_region_map)
```
#### Inspecting the Output
The function returns a list containing multiple elements. Let's first look at
what these elements are.
```{r inspect-output}
# Inspect the output list elements
names(res)
```
#### Extracting and Viewing Regional PCs
The first element (`res$regional_pcs`) is a data frame containing the
calculated regional PCs.
```{r extract-regional-pcs}
# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
```
#### Understanding the Results
The output is a data frame with regional PCs for each region as rows and our
samples as columns. This is our new representation of methylation values, now
on a gene regional PC scale. We can feed these into downstream analyses as is.
The number of regional PCs representing each gene region was determined by the
Gavish-Donoho method. This method allows us to identify PCs that capture actual
signal in our data and not the noise that is inherent in any dataset.
To explore alternative methods, we can change the `pc_method` parameter.
```{r count-pcs-regions}
# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) %>%
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)
# number of genes that have been summarized
length(unique(regions$gene))
# how many of each PC did we get
table(regions$pc)
```
We have summarized each of our genes using just one PC. The number of PCs
depends on three main factors: the number of samples, the number of CpGs in
the gene region, and the noise in the methylation data.
By default, the `compute_regional_pcs` function uses the Gavish-Donoho method.
However, we can also use the Marcenko-Pasteur method by setting the `pc_method`
parameter:
```{r marcenko-pasteur-method}
# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(int_meth, sub_region_map, pc_method = "mp")
# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs
# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) %>%
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)
# number of genes that have been summarized
length(unique(mp_regions$gene))
# how many of each PC did we get
table(mp_regions$pc)
```
The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random
Matrix Theory, and they aim to identify the number of significant PCs that
capture the true signal in the data and not just the noise. However, these
methods differ in how they select the number of significant PCs. The
Marcenko-Pasteur method typically selects more PCs to represent a gene region
compared to the Gavish-Donoho method. This may be due to the different ways in
which the two methods estimate the noise level in the data.
Ultimately, the choice between the two methods depends on the specific needs
and goals of the analysis. The Gavish-Donoho method tends to provide more
conservative results, while the Marcenko-Pasteur method may capture more of
the underlying signal in the data. Researchers should carefully consider
their objectives and the characteristics of their data when selecting a method.
# Session Information
```{r sessionInfo}
sessionInfo()
```
Owner
- Name: (WIP DEV) Bioconductor Packages
- Login: bioconductor-source
- Kind: organization
- Email: maintainer@bioconductor.org
- Website: https://bioconductor.org
- Repositories: 1
- Profile: https://github.com/bioconductor-source
Source code for packages accepted into Bioconductor
GitHub Events
Total
Last Year
Dependencies
DESCRIPTION
cran
- R >= 4.3.0 depends
- GenomicRanges * imports
- PCAtools * imports
- dplyr * imports
- tibble * imports
- BiocStyle * suggests
- IRanges * suggests
- RMTstat * suggests
- TxDb.Hsapiens.UCSC.hg19.knownGene * suggests
- knitr * suggests
- minfiData * suggests
- rmarkdown * suggests
- testthat >= 3.0.0 suggests
- tidyr * suggests