https://github.com/arvestad/snc

My version of Neighborhood Correlation sequence comparisons

https://github.com/arvestad/snc

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

My version of Neighborhood Correlation sequence comparisons

Basic Info
  • Host: GitHub
  • Owner: arvestad
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 104 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 4 years ago · Last pushed 11 months ago
Metadata Files
Readme License

README.md

SNC

An attempt at understanding and developing Dannie Durand's Neighborhood Correlation method (see references below) for sequence comparisons.

There are several ideas tested here.

  1. A pure Python approach, making use of SciPy and numpy for main computations.
  2. We are not assuming all-against-all comparisons. Rather, the input sequences we are interestedin, call them Q, are assumed to be compared to a reference database R, which may or may not overlap with Q. In fact, you can have R=Q, but we want to enable the use of a recurrently used reference database, and the goal is to reduce the number of sequence comparisons.
  3. NC_standalone used a log-transform on E values instead of bitscores. The transform gave improved results. We are trying several other transforms, including log-transform on "bitscore + 1" to avoid complications when the score/E value is close to zero.

The code for snc is way easier to understand and modify than the code for NC_standalone (well, I am biased), but snc is also far slower.

Install

For experimentation:

  • Download this repo
  • Ensure you have Python 3.8(?) and SciPy version 1.8.0 or later installed.
  • Install the python module flit (e.g., pip install flit)
  • Run flit install

You should then be able to run snc similarities.tab on a file produced by Diamond or BLAST using the option --outfmt 6.

Using snc

Notice the options!

  • -3: Three-column input. Default is BLAST's m8 format.
  • -c x: set the "consideration threshold" to x. If two seqeuences a and b both have a similarity score x or higher to a sequence r, then NC will be calculated for a and b. This has a big effect on runtime! Default is set to 30, expecting bitscores to be used. That is a low threshold, in my opinion, but I have not benchmarked this at all.
  • -ci: Compute and output a confidence interval for your NC scores. This is only meaningful on smaller datasets, because on proteome-scale computations the NC score is relatively well estimated. The methodological differences with NC_standalone cause larger differences than the data uncertainty, it seems.
  • -a: Don't bother, it is not implemented yet. The idea was to mimic NC_standalone with this option, but the NC statistics has not been adapted to that option.
  • -t: This threshold is purely on output, so can be used as a means of reducing output space.
  • -v: Output some progress information.

References

  • JM Joseph and D Durand, Family classification without domain chaining, Bioinformatics, 2009.
  • N Song, R Sedgewick, D Durand, Domain architecture comparison for multidomain homology identification, J Comp Biol, 2007.

Owner

  • Name: Lars Arvestad
  • Login: arvestad
  • Kind: user
  • Location: Kräftriket, Stockholm
  • Company: Stockholm University

GitHub Events

Total
  • Push event: 1
  • Pull request event: 1
Last Year
  • Push event: 1
  • Pull request event: 1