regulatorygenomerankingmetrics
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: aj95b
- Language: Jupyter Notebook
- Default Branch: master
- Size: 2.81 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
regulatorygenomeranking_metrics
Regulatory elements of the human genome correspond to approximately 100-200 basepair long regions where the chromatin is more prone to be cleaved by the DNAse-I enzyme. Such a site is referred to as the DNAse-I Hypersensitive Site (DHS). The underlying data was curated as a part of https://www.nature.com/articles/s41586-020-2559-3
information_metrics
To sample interesting regions, there are metrics to measure the information content and quality of individual DHSs, namely:
1. Entropy:
the randomness of a DHS
2. Average Normalized Signal:
a comparable way to measure the accesibility levels of DHSs
3. Signal to Noise Ratio (SNR):
the fraction that represents the ratio of highly represented biosamples in a DHS to that of lowly expressed ones.
4. Mean Cosine Similarity (MCS):
measure of similarity among the most highly expressed biosamples of a DHS
5. Concordance Metric:
the only metric that utilizes the reduced representation of the data, using NMF. It is computed as the average similarity is classification of a DHS into a cell type with that of the classification of the corresponding biosamples that are highly represented in it. We use this metric as teh ground truth to evaluate others. Co-ranking just the SNR and MCS produced best results.
metric_exploration
We cluster the most highly ranked DHSs by each metric and order them optimally to create a heatmap to visualize the effect of ranking by each metic and find out of there is any cell-type specificity to the DHSs picked by a certain metric.
metricevaluationand_comparison
We compare the ranked lists from various metrics and compute the similarity in ranking using Fisher exact statistic, Rank Biased Overlap.
25kbdhsregions
As a proof of concept to measure information content at scale, using the ranking metrics above, we divided the whole human genome into 25kb regions created as a result of using the chromatin states data that use an information theoretic metric to extract surprisal scores along the entire genome. We ranked the resulting approximately 100,000 regions based on: 1. Their significance scores obtained using the mean co-ranks of DHSs in the region. Co-ranks based on the information metrics descrined above. Then use the Central Limit Theorem to ascertain their significance. 2. The homogeneity of enrichment of the signal from various NMF components (cell types). 3. Again, we co-ranked the two scores from above to obtain a ranking of the 25kb regions of DHS data that are significantly informative in terms of their constituent DHSs.
scalemaxsignificance
rankingroiatscaleAJ20221122.ipynb: Given a genomic region of interest, find the scale at which it maximizes its information content.
Owner
- Name: Arpita
- Login: aj95b
- Kind: user
- Repositories: 2
- Profile: https://github.com/aj95b
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Joshi
given-names: Arpita
title: "Ranking Metrics for Genomic Regions"
#version: 1.2
#doi: 10.5281/zenodo.1234
date-released: 2022-12-25