https://github.com/arvestad/snc

My version of Neighborhood Correlation sequence comparisons

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

My version of Neighborhood Correlation sequence comparisons

Basic Info

Host: GitHub
Owner: arvestad
License: gpl-3.0
Language: Python
Default Branch: main
Size: 104 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created about 4 years ago · Last pushed 11 months ago

Metadata Files

Readme License

README.md

SNC

An attempt at understanding and developing Dannie Durand's Neighborhood Correlation method (see references below) for sequence comparisons.

There are several ideas tested here.

A pure Python approach, making use of SciPy and numpy for main computations.
We are not assuming all-against-all comparisons. Rather, the input sequences we are interestedin, call them Q, are assumed to be compared to a reference database R, which may or may not overlap with Q. In fact, you can have R=Q, but we want to enable the use of a recurrently used reference database, and the goal is to reduce the number of sequence comparisons.
NC_standalone used a log-transform on E values instead of bitscores. The transform gave improved results. We are trying several other transforms, including log-transform on "bitscore + 1" to avoid complications when the score/E value is close to zero.

The code for snc is way easier to understand and modify than the code for NC_standalone (well, I am biased), but snc is also far slower.

Install

For experimentation:

Download this repo
Ensure you have Python 3.8(?) and SciPy version 1.8.0 or later installed.
Install the python module flit (e.g., pip install flit)
Run flit install

You should then be able to run snc similarities.tab on a file produced by Diamond or BLAST using the option --outfmt 6.

Using snc

Notice the options!

-3: Three-column input. Default is BLAST's m8 format.
-c x: set the "consideration threshold" to x. If two seqeuences a and b both have a similarity score x or higher to a sequence r, then NC will be calculated for a and b. This has a big effect on runtime! Default is set to 30, expecting bitscores to be used. That is a low threshold, in my opinion, but I have not benchmarked this at all.
-ci: Compute and output a confidence interval for your NC scores. This is only meaningful on smaller datasets, because on proteome-scale computations the NC score is relatively well estimated. The methodological differences with NC_standalone cause larger differences than the data uncertainty, it seems.
-a: Don't bother, it is not implemented yet. The idea was to mimic NC_standalone with this option, but the NC statistics has not been adapted to that option.
-t: This threshold is purely on output, so can be used as a means of reducing output space.
-v: Output some progress information.

References

JM Joseph and D Durand, Family classification without domain chaining, Bioinformatics, 2009.
N Song, R Sedgewick, D Durand, Domain architecture comparison for multidomain homology identification, J Comp Biol, 2007.

Owner

Name: Lars Arvestad
Login: arvestad
Kind: user
Location: Kräftriket, Stockholm
Company: Stockholm University

Website: https://www.su.se/english/profiles/arve-1.232358
Repositories: 16
Profile: https://github.com/arvestad

GitHub Events

Total

Push event: 1
Pull request event: 1

Last Year

Push event: 1
Pull request event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/arvestad/snc

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SNC

Install

Using snc

References

Owner

GitHub Events

Total

Last Year