ngd
Normalized Google Distance between amino acid sequences
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.4%) to scientific vocabulary
Repository
Normalized Google Distance between amino acid sequences
Basic Info
- Host: GitHub
- Owner: christopher-riccardi
- Language: TeX
- Default Branch: main
- Size: 551 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Normalized Google Distance
To the best of my knowlegde, Choi and Rashid (2008, see citation file) first integrated the Normalized Google Distance (NGD) into protein sequence comparison.
Given two amino acid sequences $X$ and $Y$, the NGD is calculated using the following formula:
math
NGD_{X,Y} = \frac{\max\left(\sum W_X, \sum W_Y\right) - \sum \min(W_X, W_Y)}{\left( \sum W_X + \sum W_Y \right) - \min \left( \sum W_X, \sum W_Y\right)}
where $\sum WX$ and $\sum WY$ are the total $k$-mer counts of both sequences, respectively, $\sum \min(WX, WY)$ is the sum of the $k$-mers common to both sequences, with the sum using whichever $k$-mer count is the lesser between the two sequences, $\max\left(\sum WX, \sum WY\right)$ is the maximum and $\min \left( \sum WX, \sum WY\right)$ is the minimum count between sequences $X$ and $Y$, respectively. This distance statistic is calculated to quantify the dissimilarity between two sequences based on their $k$-tuple compositions as it is derived from the Google similarity distance, used to find semantic similarity in web pages.
Usage:
python
python ngd <fasta1> <fasta2> [optional, the k-mer size. Default=3]
I include three testing files s1, s2 and s3. s1 and s2 are more similar to each other than any is to s3.
Owner
- Name: Christopher Riccardi
- Login: christopher-riccardi
- Kind: user
- Location: Los Angeles
- Repositories: 1
- Profile: https://github.com/christopher-riccardi
Evolutionary Biology and Ecology PhD Student, University of Florence (Italy). Currently at Sun Lab (University of Southern California) for 12 months.
Citation (Citation.bib)
@inproceedings{choi_adapting_2008,
title = {Adapting normalized google similarity in protein sequence comparison},
volume = {1},
url = {https://ieeexplore.ieee.org/document/4631601},
doi = {10.1109/ITSIM.2008.4631601},
abstract = {Biological sequence comparison faced various challenges. Although dynamic programming based solution claimed to be the optimal solution for the comparison process, the computation limitation and some fundamental challenges still make it inefficient for mass sequence comparison. Statistical method explores the statistics of sequences by the frequency of the words in the sequence; it provides a comparison solution without loss of statistical information, and also caters some of the fundamental problem in sequence comparison. Normalized Google Distance is a way of finding semantic similarity in web pages, with significant related characteristics; in this research, we propose an algorithm that will integrate Normalized Google Similarity into protein sequence comparison.},
eventtitle = {2008 International Symposium on Information Technology},
pages = {1--5},
booktitle = {2008 International Symposium on Information Technology},
author = {Choi, Lee Jun and Rashid, Nur'Aini Abdul},
urldate = {2024-08-22},
date = {2008-08},
note = {{ISSN}: 2155-899X},
keywords = {Accuracy, Bioinformatics, Biology, Correlation, Distance measurement, Frequency measurement, Proteins},
file = {IEEE Xplore Abstract Record:C\:\\Users\\chris\\Zotero\\storage\\I8BVP2PQ\\4631601.html:text/html},
}