ngd

Normalized Google Distance between amino acid sequences

https://github.com/christopher-riccardi/ngd

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Normalized Google Distance between amino acid sequences

Basic Info

Host: GitHub
Owner: christopher-riccardi
Language: TeX
Default Branch: main
Size: 551 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Citation

README.md

Normalized Google Distance

To the best of my knowlegde, Choi and Rashid (2008, see citation file) first integrated the Normalized Google Distance (NGD) into protein sequence comparison.

Given two amino acid sequences $X$ and $Y$, the NGD is calculated using the following formula: math NGD_{X,Y} = \frac{\max\left(\sum W_X, \sum W_Y\right) - \sum \min(W_X, W_Y)}{\left( \sum W_X + \sum W_Y \right) - \min \left( \sum W_X, \sum W_Y\right)}

where $\sum WX$ and $\sum WY$ are the total $k$-mer counts of both sequences, respectively, $\sum \min(WX, WY)$ is the sum of the $k$-mers common to both sequences, with the sum using whichever $k$-mer count is the lesser between the two sequences, $\max\left(\sum WX, \sum WY\right)$ is the maximum and $\min \left( \sum WX, \sum WY\right)$ is the minimum count between sequences $X$ and $Y$, respectively. This distance statistic is calculated to quantify the dissimilarity between two sequences based on their $k$-tuple compositions as it is derived from the Google similarity distance, used to find semantic similarity in web pages.

Usage: python python ngd <fasta1> <fasta2> [optional, the k-mer size. Default=3]

I include three testing files s1, s2 and s3. s1 and s2 are more similar to each other than any is to s3.

Owner

Name: Christopher Riccardi
Login: christopher-riccardi
Kind: user
Location: Los Angeles

Repositories: 1
Profile: https://github.com/christopher-riccardi

Evolutionary Biology and Ecology PhD Student, University of Florence (Italy). Currently at Sun Lab (University of Southern California) for 12 months.

Citation (Citation.bib)

@inproceedings{choi_adapting_2008,
	title = {Adapting normalized google similarity in protein sequence comparison},
	volume = {1},
	url = {https://ieeexplore.ieee.org/document/4631601},
	doi = {10.1109/ITSIM.2008.4631601},
	abstract = {Biological sequence comparison faced various challenges. Although dynamic programming based solution claimed to be the optimal solution for the comparison process, the computation limitation and some fundamental challenges still make it inefficient for mass sequence comparison. Statistical method explores the statistics of sequences by the frequency of the words in the sequence; it provides a comparison solution without loss of statistical information, and also caters some of the fundamental problem in sequence comparison. Normalized Google Distance is a way of finding semantic similarity in web pages, with significant related characteristics; in this research, we propose an algorithm that will integrate Normalized Google Similarity into protein sequence comparison.},
	eventtitle = {2008 International Symposium on Information Technology},
	pages = {1--5},
	booktitle = {2008 International Symposium on Information Technology},
	author = {Choi, Lee Jun and Rashid, Nur'Aini Abdul},
	urldate = {2024-08-22},
	date = {2008-08},
	note = {{ISSN}: 2155-899X},
	keywords = {Accuracy, Bioinformatics, Biology, Correlation, Distance measurement, Frequency measurement, Proteins},
	file = {IEEE Xplore Abstract Record:C\:\\Users\\chris\\Zotero\\storage\\I8BVP2PQ\\4631601.html:text/html},
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

ngd