hydroxymethylater
R workflow for preprocessing, analyzing, and annotating Illumina HumanMethylationEPIC hydroxymethylation data.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary
Repository
R workflow for preprocessing, analyzing, and annotating Illumina HumanMethylationEPIC hydroxymethylation data.
Basic Info
- Host: GitHub
- Owner: eirinisparaki
- License: apache-2.0
- Language: R
- Default Branch: main
- Size: 438 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
HydroxymethylateR
R workflow for preprocessing, analyzing, and annotating Illumina HumanMethylationEPIC hydroxymethylation data.
Computational Environment Requirements
Developed and tested on Linux. Other platforms (e.g., macOS, Windows) might also work.
System Requirements
- A Linux-based computer (tested on Ubuntu)
R >= 4.5Bioconductor >= 3.2ChAMP >= 2.36
Overview
This workflow is built to:
- Import and preprocess bisulfite (BS) and oxidative bisulfite (oxBS) array data.
- Normalize data using NOOB/FunNorm/RAW and filter out problematic probes.
- Estimate sex, cell type proportions and predict smoking status and age.
- Run the MLML method to quantify hydroxymethylation (5hmC) levels.
preprocess_hydroxymethylation_data()
Workflow Overview

Function Signature
r
preprocess_hydroxymethylation_data(
ox_file, bs_file,
annotation_array = "IlluminaHumanMethylationEPICv2",
annotation_version = "20a1.hg38",
normalization = "NOOB",
ChAMPfilter_arraytype_bs = "EPICv2",
ChAMPfilter_ProbeCutoff_bs = 0.01,
ChAMPfilter_arraytype_ox = "EPICv2",
ChAMPfilter_ProbeCutoff_ox = 0.01,
file_inaccuracies = NULL,
low_variance_threshold_hmc = 0,
predictSex = FALSE,
predictSmoking = FALSE,
predictAge = FALSE,
calculateCellPropPCs = FALSE,
plotCellProps = FALSE,
plotPCA = FALSE,
plotSVD = FALSE,
plotHmC = FALSE,
output_dir = getwd()
)
Arguments & Options
| Argument | Type / Accepted values | Default | Description |
| ---------------------------- | ------------------------------ | ---------------------------------- | ----------------------------------------------- |
| ox_file | character (path) | required | csv of metadata for OxBS arrays. |
| bs_file | character (path) | required | csv of metadata for BS arrays. |
| annotation_array | Valid minfi array string | "IlluminaHumanMethylationEPICv2" | Probe annotation. |
| annotation_version | character | "20a1.hg38" | Annotation version. |
| normalization | "NOOB", "FUNORM", "RAW" | "NOOB" | Choose normalisation. |
| ChAMPfilter_arraytype_bs | "450K", "EPIC", "EPICv2" | "EPICv2" | Array type for ChAMP filter (BS). |
| ChAMPfilter_ProbeCutoff_bs | numeric 0-1 | 0.01 | ProbeCutoff (BS). |
| ChAMPfilter_arraytype_ox | As above | "EPICv2" | Array type for ChAMP filter (OxBS). |
| ChAMPfilter_ProbeCutoff_ox | numeric 0–1 | 0.01 | ProbeCutoff (OxBS). |
| file_inaccuracies | NULL or path | NULL | Inaccurancies probes list (column IlmnID).|
| low_variance_threshold_hmc | numeric ≥ 0 | 0 | Low variance threshold 5hmc. |
| predictSex | logical | FALSE | Add sex prediction via minfi::getSex(). |
| predictSmoking | logical | FALSE | Add smoking score via EpiSmokEr. |
| predictAge | logical | FALSE | Add DNAm age (Horvath) via wateRmelon. |
| calculateCellPropPCs | logical | FALSE | Estimate blood-cell composition-PCs. |
| plotCellProps | logical | FALSE | Save stacked-bar chart cell-composition plot. |
| plotPCA | logical | FALSE | Save PCA. |
| plotSVD | logical | FALSE | Save ChAMP SVD plots. |
| plotHmC | logical | FALSE | Save 5hmC density plot. |
| output_dir | character (path) | getwd() | Destination folder for all outputs. |
Required Inputs
1. Metadata csv (ox_file, bs_file)
Each csv must contain one row per array and these five columns (case-sensitive):
| Column | Description | Example |
| ------------- | ---------------------------------------------------- | -------------- |
| Sample_Name | Unique experiment ID (overwritten internally). | S01 |
| Array | Illumina barcode (last 10 digits of iDAT filenames). | 1234567890 |
| Slide | Illumina slide ID (first part of iDAT filenames). | 204905210066 |
| iDAT_PATH | Directory containing Red + Grn iDATs for that slide. | /data/iDATs/ |
| status | Custom label (case, control, etc.). | case |
Expected folder layout
iDAT_PATH/ └── SLIDE/ ├── SLIDE_ARRAY_Red.iDAT └── SLIDE_ARRAY_Grn.iDAT
2. Optional Probe Inaccuracies (file_inaccuracies)
Csv with a column IlmnID listing probes to exclude.
Outputs
Everything is written to output_dir (default: working directory):
output_dir/
├── phenotype_table.csv # per-sample metadata (+ optional sex, PCs, etc.)
├── filtered_hmC.csv # long-format 5hmC after variance filtering
├── cell_props.png # optional barplot of blood-cell composition
├── explained_variance.png # Explained variance
├── Hydroxymethylation Density by Sample.png # optional 5hmC densities
└── SVDsummary.pdf # created when plotSVD = TRUE
The function returns an invisible list:
phenotype_df_bs– BS sample metadatafiltered_hmC– long-format 5hmC
Core Workflow (14 Steps)
- Read and validate metadata
- Read OxBS iDAT
- Read BS iDAT
- (Optional) predict sex
- Normalise (NOOB / FunNorm / Raw)
- ChAMP filter - BS
- ChAMP filter - OxBS
- Remove inaccuracies probes
- Build phenotype dataframe for BS
- (Optional) estimate cell proportions
- (Optional) predict smoking score
- (Optional) predict DNAm age (Horvath)
- Estimate 5hmC via MLML2R
- Low-variance filter & write outputs
Each step frees memory with rm(); gc().
Minimal Example
```r library(HydroxymethylateR)
results <- preprocesshydroxymethylationdata( oxfile = "metadataoxbs.csv", bsfile = "metadatabs.csv", output_dir = "results" )
Access outputs
head(results$phenotypedfbs) head(results$filtered_hmC) ```
Required Packages
The following R packages (from CRAN and Bioconductor) are required:
CRAN Packages:
viridis,ggplot2,reshape2
Bioconductor Packages:
- FlowSorted.Blood.EPIC, sesame, wateRmelon,MLML2R, EpiSmokEr, minfi, ChAMP
Installation Instructions
To install this workflow:
```bash devtools::intall_github("eirinisparaki/HydroxymethylateR")
```
Here's the Markdown text you can add to your README.md to cover both the citation of your tool and a reference to the citation list for dependencies:
📖 Citation
If you use HydroxymethylateR in your research, please cite:
- This GitHub repository: eirinisparaki/HydroxymethylateR
For citations of the R packages used in this project, please refer to CITATIONS.md.
Contact
For questions or collaborations, feel free to contact:
Eirini Sparaki
📧 sparakiirini@gmail.com
🔗 https://github.com/eirinisparaki
Owner
- Name: Eirini
- Login: eirinisparaki
- Kind: user
- Repositories: 1
- Profile: https://github.com/eirinisparaki
Citation (CITATIONS.md)
### Package Citations This project makes use of the following R packages. Please cite them as follows: | Package | Citation | | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `viridis` | Garnier S. *Colorblind-Friendly Color Maps for R*. [Link](https://sjmgarnier.github.io/viridis/) | | `ggplot2` | *ggplot2: A system for declaratively creating graphics, based on “The Grammar of Graphics”*. [Link](https://ggplot2.tidyverse.org) | | `reshape2` | Wickham H. *Reshaping data with the reshape package*. J. Stat. Softw., 21(12), 1–20 (2007). [DOI](https://doi.org/10.18637/jss.v021.i12) | | `FlowSorted.Blood.EPIC` | Salas LA et al. *An optimized library for reference-based deconvolution...*. Genome Biol., 19(1), 64 (2018). [DOI](https://doi.org/10.1186/s13059-018-1448-7) | | `sesame` | Zhou W et al. *SeSAMe...*. Nucleic Acids Res., 46(20), e123 (2018). [DOI](https://doi.org/10.1093/nar/gky691) | | `wateRmelon` | Pidsley R et al. *A data-driven approach to preprocessing...*. BMC Genomics, 14:293 (2013). [DOI](https://doi.org/10.1186/1471-2164-14-293) | | `MLML2R` | *Maximum Likelihood Estimation of DNA Methylation and Hydroxymethylation Proportions*. [CRAN](https://cran.r-project.org/package=MLML2R) | | `EpiSmokEr` | Bollepalli S. *EpiSmokEr: Epigenetic Smoking status Estimator*. [GitHub](https://github.com/sailalithabollepalli/EpiSmokEr) | | `minfi` | Aryee MJ et al. *Minfi...*. Bioinformatics, 30(10), 1363–1369 (2014). [DOI](https://doi.org/10.1093/bioinformatics/btu049) | | `ChAMP` | Morris TJ et al. *ChAMP...*. Bioinformatics, 30(3), 428–430 (2014). [DOI](https://doi.org/10.1093/bioinformatics/btt684) |
GitHub Events
Total
- Push event: 3
- Fork event: 1
Last Year
- Push event: 3
- Fork event: 1