preciseTAD
preciseTAD: A machine learning framework for precise 3D boundary prediction at base-level resolution
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
✓Committers with academic emails
1 of 4 committers (25.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
·
Repository
preciseTAD: A machine learning framework for precise 3D boundary prediction at base-level resolution
Basic Info
- Host: GitHub
- Owner: dozmorovlab
- License: other
- Language: R
- Default Branch: master
- Homepage: https://dozmorovlab.github.io/preciseTAD/
- Size: 4.57 MB
Statistics
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created over 5 years ago
· Last pushed almost 4 years ago
Metadata Files
Readme
Changelog
License
Citation
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution
[](https://www.tidyverse.org/lifecycle/#maturing)
**preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution** Spiro C. Stilianoudakis, Maggie A. Marshall, Mikhail G. Dozmorov. bioRxiv 2020.09.03.282186; doi: https://doi.org/10.1101/2020.09.03.282186
Predicted preciseTAD boundary points (PTBPs) and regions (PTBRs) for 60 cell lines are available [here](https://drive.google.com/drive/folders/15Rc6PhrrBjThwE-5dSyNX-ILELaUu6uG?usp=sharing).
## Overview
preciseTAD provides functions to predict the location of boundaries of topologically associated domains (TADs) and chromatin loops at base-level resolution. As an input, it takes BED-formatted genomic coordinates of domain boundaries detected from low-resolution Hi-C data, and coordinates of high-resolution genomic annotations from ENCODE or other consortia. preciseTAD employs several feature engineering strategies and resampling techniques to address class imbalance, and trains an optimized random forest model for predicting low-resolution domain boundaries. Translated on a base-level, preciseTAD predicts the probability for each base to be a boundary. Density-based clustering and scalable partitioning techniques are used to detect precise boundary regions and summit points. Compared with low-resolution boundaries, preciseTAD boundaries are highly enriched for CTCF, RAD21, SMC3, and ZNF143 signal and more conserved across cell lines. The pre-trained model can accurately predict boundaries in another cell line using CTCF, RAD21, SMC3, and ZNF143 annotation data for this cell line.
The main functions (in order of implementation) are:
- `extractBoundaries()` accepts a 3-column data.frame or matrix with the chromosomal coordinates of user-defined domains and outputs the unique boundaries. The second and third columns are the domain anchor centers.
- `bedToGRangesList()` accepts a filepath containing BED files representing the coordinates of ChIP-seq defined functional genomic annotations
- `createTADdata()` accepts a set of unique boundaries and genomic annotations derived from `extractBoundaries()` and `bedToGRangesList()`, respectively, to create the data matrix used to build a model to predict domain boundary regions
- `TADrandomForest()` a wrapper of the `randomForest` package which implements a random forest binary classification algorithm on domain boundary data
- `preciseTAD()` which leverages a domain boundary prediction model (i.e., random forest) and density-based clustering to predict TAD boundary coordinates at a base-level resolution
## Installation
`preciseTAD` can be installed from Bioconductor:
```{r}
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("preciseTAD")
library(preciseTAD)
```
The latest version of `preciseTAD` can be directly installed from Github:
```{r eval=FALSE}
devtools::install_github("dozmorovlab/preciseTAD", build_vignettes = TRUE)
library(preciseTAD)
```
## Usage
Below is a brief workflow of how to implement `preciseTAD` on binned data from CHR1 to get precise base pair coordinates of TAD boundaries for a 10mb section of CHR 22. For more details, including the example how to use the pre-trained model, see `vignette("preciseTAD")`
First, you need to obtain called TAD boundaries using an established TAD-caller. As an example, consider the [Arrowhead](https://github.com/aidenlab/juicer/wiki/Arrowhead) TAD-caller, a part of the juicer suite of tools developed by the Aiden Lab. Arrowhead outputs a .txt file with the chromosomal start and end coordinates of their called TADs. As an example, we have provided Arrowhead TADs for GM12878 at 5kb resolution.
```{r}
data("arrowhead_gm12878_5kb")
head(arrowhead_gm12878_5kb)
```
The unique boundaries for CHR1 and CHR22 can be extracted as:
```{r}
bounds <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb, filter = FALSE, CHR = c("CHR1", "CHR22"), resolution = 5000)
bounds
```
Next, you will need to download cell line-specific ChIP-seq data in the form of BED files from [ENCODE](https://www.encodeproject.org/chip-seq-matrix/?type=Experiment&replicates.library.biosample.donor.organism.scientific_name=Homo%20sapiens&assay_title=TF%20ChIP-seq&status=released). Once, you have downloaded your preferred list of functional genomic annotations, store them in a specific file location. These files can then be converted into a GRangesList object and used for downstream modeling using the following command:
```{r eval=FALSE}
path <- "pathToBEDfiles"
tfbsList <- bedToGRangesList(filepath = path, bedList = NULL, bedNames = NULL, pattern = "*.bed", signal = 4)
```
As an example, we have already provided a GRangesList object with a variety of transcription factor binding sites specific to the GM12878 cell line. Once you load it in, you can see the list of transcription factors using the following:
```{R}
data("tfbsList")
names(tfbsList)
```
For the purposes of this example, let's focus only on CTCF, RAD21, SMC3, and ZNF143 transcription factors.
```{r}
tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad", "Gm12878-Rad21-Haib", "Gm12878-Smc3-Sydh", "Gm12878-Znf143-Sydh")]
```
Now, using the “ground-truth” boundaries and the following TFBS, we can build the data matrix that will be used for predictive modeling. The following command creates the training data from CHR1 and reserves the testing data from CHR22. We specify 5kb sized genomic bins (to match the resolution used to call the original TADs), a distance-type feature space, and apply random under-sampling (RUS) on the training data only.
```{r}
set.seed(123)
tadData <- createTADdata(bounds.GR = bounds,
resolution = 5000,
genomicElements.GR = tfbsList_filt,
featureType = "distance",
resampling = "rus",
trainCHR = "CHR1",
predictCHR = "CHR22"
)
```
We can now implement our machine learning algorithm of choice to predict TAD-boundary regions. Here, we opt for the random forest algorithm.
```{r}
set.seed(123)
tadModel <- TADrandomForest(trainData = tadData[[1]],
testData = tadData[[2]],
tuneParams = list(mtry = 2, ntree = 500, nodesize = 1),
cvFolds = 3,
cvMetric = "Accuracy",
verbose = TRUE,
model = TRUE,
importances = TRUE,
impMeasure = "MDA",
performances = TRUE)
# The model itself
tadModel[[1]]
# Variable importances (mean decrease in accuracy)
tadModel[[2]]
# Model performance metrics
tadModel[[3]]
```
Lastly, we take our TAD-boundary region predictive model and use it to make predictions on a 10mb section of CHR22:35,000,000-45,000,000.
```{r}
# Run preciseTAD
set.seed(123)
pt <- preciseTAD( genomicElements.GR = tfbsList_filt,
featureType = "distance",
CHR = "CHR22",
chromCoords = list(35000000, 45000000),
tadModel = tadModel[[1]],
threshold = 1.0,
verbose = FALSE,
parallel = 2,
DBSCAN_params = list(30000, 100),
slope = 5000,
genome = "hg19")
# View preciseTAD predicted boundary coordinates between CHR22:35mb-45mb
pt[[2]]
```
Owner
- Name: Dozmorov Lab
- Login: dozmorovlab
- Kind: organization
- Website: https://dozmorovlab.github.io/
- Twitter: mikhaildozmorov
- Repositories: 5
- Profile: https://github.com/dozmorovlab
Genomics, bionformatics, computational biology, 3D genome, Hi-C
Citation (CITATION.cff)
# YAML 1.2
---
authors:
cff-version: "1.1.0"
message: |
"If you use this software, please cite preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution
Spiro C. Stilianoudakis, Maggie A. Marshall, Mikhail G. Dozmorov
bioRxiv 2020.09.03.282186; doi: https://doi.org/10.1101/2020.09.03.282186"
...
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| stilianoudakis | s****c@m****u | 80 |
| Mikhail Dozmorov | m****v@g****m | 69 |
| Nitesh Turaga | n****a@g****m | 6 |
| Stilianoudakis | s****c@p****m | 1 |
Committer Domains (Top 20 + Academic)
pg.com: 1
mymail.vcu.edu: 1
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- bioconductor 8,106 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 5
- Total maintainers: 1
bioconductor.org: preciseTAD
preciseTAD: A machine learning framework for precise TAD boundary prediction
- Homepage: https://github.com/dozmorovlab/preciseTAD
- Documentation: https://bioconductor.org/packages/release/bioc/vignettes/preciseTAD/inst/doc/preciseTAD.pdf
- License: MIT + file LICENSE
-
Latest release: 1.18.0
published 10 months ago
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 27.8%
Downloads: 83.3%
Maintainers (1)
Last synced:
6 months ago