https://github.com/arvestad/snc
My version of Neighborhood Correlation sequence comparisons
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
My version of Neighborhood Correlation sequence comparisons
Basic Info
- Host: GitHub
- Owner: arvestad
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 104 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created about 4 years ago
· Last pushed 11 months ago
Metadata Files
Readme
License
README.md
SNC
An attempt at understanding and developing Dannie Durand's Neighborhood Correlation method (see references below) for sequence comparisons.
There are several ideas tested here.
- A pure Python approach, making use of SciPy and numpy for main computations.
- We are not assuming all-against-all comparisons. Rather, the input sequences we are interestedin, call them Q, are assumed to be compared to a reference database R, which may or may not overlap with Q. In fact, you can have R=Q, but we want to enable the use of a recurrently used reference database, and the goal is to reduce the number of sequence comparisons.
NC_standaloneused a log-transform on E values instead of bitscores. The transform gave improved results. We are trying several other transforms, including log-transform on "bitscore + 1" to avoid complications when the score/E value is close to zero.
The code for snc is way easier to understand and modify than the code for NC_standalone (well, I
am biased), but snc is also far slower.
Install
For experimentation:
- Download this repo
- Ensure you have Python 3.8(?) and SciPy version 1.8.0 or later installed.
- Install the python module
flit(e.g.,pip install flit) - Run
flit install
You should then be able to run snc similarities.tab on a file produced by Diamond or BLAST
using the option --outfmt 6.
Using snc
Notice the options!
-3: Three-column input. Default is BLAST's m8 format.-c x: set the "consideration threshold" to x. If two seqeuences a and b both have a similarity score x or higher to a sequence r, then NC will be calculated for a and b. This has a big effect on runtime! Default is set to 30, expecting bitscores to be used. That is a low threshold, in my opinion, but I have not benchmarked this at all.-ci: Compute and output a confidence interval for your NC scores. This is only meaningful on smaller datasets, because on proteome-scale computations the NC score is relatively well estimated. The methodological differences with NC_standalone cause larger differences than the data uncertainty, it seems.-a: Don't bother, it is not implemented yet. The idea was to mimic NC_standalone with this option, but the NC statistics has not been adapted to that option.-t: This threshold is purely on output, so can be used as a means of reducing output space.-v: Output some progress information.
References
- JM Joseph and D Durand, Family classification without domain chaining, Bioinformatics, 2009.
- N Song, R Sedgewick, D Durand, Domain architecture comparison for multidomain homology identification, J Comp Biol, 2007.
Owner
- Name: Lars Arvestad
- Login: arvestad
- Kind: user
- Location: Kräftriket, Stockholm
- Company: Stockholm University
- Website: https://www.su.se/english/profiles/arve-1.232358
- Repositories: 16
- Profile: https://github.com/arvestad
GitHub Events
Total
- Push event: 1
- Pull request event: 1
Last Year
- Push event: 1
- Pull request event: 1