regulatorygenomerankingmetrics

https://github.com/aj95b/regulatorygenomerankingmetrics

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: nature.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: aj95b
Language: Jupyter Notebook
Default Branch: master
Size: 2.81 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 11 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

regulatorygenomeranking_metrics

Regulatory elements of the human genome correspond to approximately 100-200 basepair long regions where the chromatin is more prone to be cleaved by the DNAse-I enzyme. Such a site is referred to as the DNAse-I Hypersensitive Site (DHS). The underlying data was curated as a part of https://www.nature.com/articles/s41586-020-2559-3

information_metrics

To sample interesting regions, there are metrics to measure the information content and quality of individual DHSs, namely:

1. Entropy:

the randomness of a DHS

2. Average Normalized Signal:

a comparable way to measure the accesibility levels of DHSs

3. Signal to Noise Ratio (SNR):

the fraction that represents the ratio of highly represented biosamples in a DHS to that of lowly expressed ones.

4. Mean Cosine Similarity (MCS):

measure of similarity among the most highly expressed biosamples of a DHS

5. Concordance Metric:

the only metric that utilizes the reduced representation of the data, using NMF. It is computed as the average similarity is classification of a DHS into a cell type with that of the classification of the corresponding biosamples that are highly represented in it. We use this metric as teh ground truth to evaluate others. Co-ranking just the SNR and MCS produced best results.

metric_exploration

We cluster the most highly ranked DHSs by each metric and order them optimally to create a heatmap to visualize the effect of ranking by each metic and find out of there is any cell-type specificity to the DHSs picked by a certain metric.

metricevaluationand_comparison

We compare the ranked lists from various metrics and compute the similarity in ranking using Fisher exact statistic, Rank Biased Overlap.

25kbdhsregions

As a proof of concept to measure information content at scale, using the ranking metrics above, we divided the whole human genome into 25kb regions created as a result of using the chromatin states data that use an information theoretic metric to extract surprisal scores along the entire genome. We ranked the resulting approximately 100,000 regions based on: 1. Their significance scores obtained using the mean co-ranks of DHSs in the region. Co-ranks based on the information metrics descrined above. Then use the Central Limit Theorem to ascertain their significance. 2. The homogeneity of enrichment of the signal from various NMF components (cell types). 3. Again, we co-ranked the two scores from above to obtain a ranking of the 25kb regions of DHS data that are significantly informative in terms of their constituent DHSs.

scalemaxsignificance

rankingroiatscaleAJ20221122.ipynb: Given a genomic region of interest, find the scale at which it maximizes its information content.

Owner

Name: Arpita
Login: aj95b
Kind: user

Repositories: 2
Profile: https://github.com/aj95b

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Joshi
    given-names: Arpita
 
title: "Ranking Metrics for Genomic Regions"
#version: 1.2
#doi: 10.5281/zenodo.1234
date-released: 2022-12-25

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science