`SimilaritySearch.jl`

`SimilaritySearch.jl`: Autotuned nearest neighbor indexes for Julia - Published in JOSS (2022)

https://github.com/sadit/similaritysearch.jl

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

information-retrieval julia metric-search search-engine similarity-search

Keywords from Contributors

surrogate fluxes pde tracer standardization

Scientific Fields

Computer Science Computer Science - 32% confidence
Last synced: 4 months ago · JSON representation ·

Repository

A nearest neighbor search library with exact and approximate algorithms

Basic Info
Statistics
  • Stars: 48
  • Watchers: 2
  • Forks: 8
  • Open Issues: 10
  • Releases: 114
Topics
information-retrieval julia metric-search search-engine similarity-search
Created over 8 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

Dev Build Status Coverage DOI

SimilaritySearch.jl

SimilaritySearch.jl is a library for nearest neighbor search. In particular, it contains the implementation for SearchGraph, a fast and flexible search index using any metric function. It is designed to support multithreading in most of its functions and structures.

The package provides the following indexes:

  • ParallelExhaustiveSearch: A brute force search index where each query is solved using all available threads.
  • ExhaustiveSearch: A brute force search index, each query is solved using a single thread.
  • SearchGraph: An approximate search index with parallel construction.

The main set of functions are:

  • search: Solves a single query.
  • searchbatch: Solves a set of queries.
  • allknn: Computes the $k$ nearest neighbors for all elements in an index.
  • neardup: Removes near-duplicates from a metric dataset.
  • closestpair: Computes the closest pair in a metric dataset.

The precise definitions of these functions and the complete set of functions and structures can be found in the documentation.

Similarity search ecosystem in Julia

Currently, there exists several packages dedicated to nearest neighbor search, for instance we have NearestNeighbors.jl, RegionTrees.jl, and JuliaNeighbors implement search structures like kd-trees, ball trees, quadtrees, octrees, bk-trees, vp-tree and other multidimensional and metric structures. These structures work quite well for low dimensional data since they are designed to solve exact similarity queries.

There exist several packages performing approximate similarity search, like Rayuela.jl using product quantization schemes, the wrapper for the FAISS library Faiss.jl. The FAISS library provides high-performance implementations of product quantization schemes and locality-sensitive hashing schemes, along with an industrial-strength implementation of the HNSW index. The NearestNeighborDescent.jl implements the search algorithm behind pynndescent.

The SimilaritySearch.jl package tries to enrich the ecosystem with search structures and algorithms designed to take advantage of multithreading systems and a unique autotuning feature that simplifies its usage for practitioners. These features are succinctly and efficiently implemented due to the Julia programming language dynamism and performance. Regarding performance characteristics, the construction times are vastly reduced compared to similar approaches without reducing search performance or result quality.

Installing SimilaritySearch

You may install the package as follows julia ] add SimilaritySearch.jl

also, you can run the set of tests as follows julia ] test SimilaritySearch

Using the library

Please see examples. You will find a list of Jupyter and Pluto notebooks, and some scripts that exemplifies its usage.

Contribute

Contributions are welcome. Please fill a pull request for documentating and implementation contributions. For issues, please fill an issue with the necessary information (see below.) If you already have a solution please also provide a pull request.

Issues

Report issues in the package providing a minimal reproducible example. If the issue is data dependant, please don't forget to provide the necessary data to reproduce it.

Limitations of SearchGraph

The main search structure, the SearchGraph, is a graph with several characteristics, many of them induced by the dataset being indexed. Some of its known limitations are related to these characteristics. For instance:

  • Metric distances work well; on the other hand, semi-metric should work, but routing capabilities are not yet characterized.
  • Even when it performs pretty well compared to alternatives, discrete metrics like Levenshtein distance and others that take few possible values may also get low performances.
  • Something similar will happen when there are many near-duplicates (elements that are pretty close). In this case, it is necessary to remove near-duplicates and put them in bags associated with some of its near objects.
  • Very high dimensional datasets will produce long-tail distributions of the number of edges per vertex. In extreme cases, you must prune large neighborhoods and enrich single-edge paths.

About the structures and algorithms

The following manuscript describes and benchmarks the SearchGraph index (package version 0.6):

``` @article{tellezscalable, title={A scalable solution to the nearest neighbor search problem through local-search methods on neighbor graphs}, author={Tellez, Eric S and Ruiz, Guillermo and Chavez, Edgar and Graff, Mario}, journal={Pattern Analysis and Applications}, pages={1--15}, publisher={Springer} }

```

The current algorithm (version 0.8 and 0.9) is described and benchmarked in the following manuscript: ```

@misc{tellez2022similarity, title={Similarity search on neighbor's graphs with automatic Pareto optimal performance and minimum expected quality setups based on hyperparameter optimization}, author={Eric S. Tellez and Guillermo Ruiz}, year={2022}, eprint={2201.07917}, archivePrefix={arXiv}, primaryClass={cs.IR} } ```

This package is also described in the JOSS paper:

Eric S. Tellez and Guillermo Ruiz. SimilaritySearch.jl: Autotuned nearest neighbor indexes for Julia. Journal of Open Source Software https://doi.org/10.21105/joss.04442.

About v0.9.X series

The algorithms of this version are the same as v0.8 but break API compatibility:

  • Now, it uses the Polyester package to handle multithreading instead of Threads.@threads
  • Multithreading methods are enabled by default if the process is started with several threads; in v0.8 was the contrary
  • allknn now preserves self-references to simplify algorithms and improve efficiency (allknn in v0.8 removes self-references automatically)

Others:

  • Adds function docs and benchmarks
  • Adds SearchGraph graph pruning methods
  • Removes the timedsearchbatch function

About v0.10.X series

It makes easy to adjust the SearchGraph structure to different workloads and applications. For instance, - More control for construction parameters - Loading and saving - Refactors search API to be consistent across structs

Please refer to https://github.com/sadit/SimilaritySearchDemos and https://github.com/sadit/SimilaritySearch.jl/blob/main/test/testsearchgraph.jl for working examples.

About v0.11 series

It introduces a major refactoring. In particular, it makes explicit use of context objects for most functions. It also introduces simple logging procedures. However, we preserve compatibility in many public functions using implicit use of default context objects.

About v0.12 series

Breaking changes: - The context objects are now required; there are no default use of them. - Added ProgressMeter for allknn, a small impact in the performance but pretty nice for large datasets. - Removes dependencies of LoopVectorization; we only use it for @turbo based distance functions; these functions can be deployed in another package.

New features: - Distances and dataset wrappers to handle non-Float32 that are casted to Float32 just before distance computations. This could improve the performance in several high throughput setups.

Owner

  • Name: Eric S. Tellez
  • Login: sadit
  • Kind: user
  • Location: Mexico
  • Company: CONACYT - INFOTEC

JOSS Publication

`SimilaritySearch.jl`: Autotuned nearest neighbor indexes for Julia
Published
July 05, 2022
Volume 7, Issue 75, Page 4442
Authors
Eric S. Tellez ORCID
Consejo Nacional de Ciencia y Tecnología, México., INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, México.
Guillermo Ruiz ORCID
Consejo Nacional de Ciencia y Tecnología, México., CentroGEO Centro de Investigación en Ciencias de Información Geoespacial, México.
Editor
Mehmet Hakan Satman ORCID
Tags
Autotuned similarity search indexes K nearest neighbor search All KNN queries Near duplicates detection Closest pair queries

Citation (CITATION.bib)

@article{SimilaritySearch-2021,
  title={A scalable solution to the nearest neighbor search problem through local-search methods on neighbor graphs},
  author={Tellez, Eric S. and Ruiz, Guillermo and Chavez, Edgar and Graff, Mario},
  journal={Pattern Analysis and Applications},
  pages={763--777},
  volume={24},
  year={2021},
  issn={1433-755X},
  doi={10.1007/s10044-020-00946-w}
  publisher={Springer}
}

GitHub Events

Total
  • Create event: 2
  • Commit comment event: 4
  • Issues event: 1
  • Release event: 3
  • Watch event: 5
  • Issue comment event: 6
  • Push event: 17
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Create event: 2
  • Commit comment event: 4
  • Issues event: 1
  • Release event: 3
  • Watch event: 5
  • Issue comment event: 6
  • Push event: 17
  • Pull request event: 2
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 588
  • Total Committers: 4
  • Avg Commits per committer: 147.0
  • Development Distribution Score (DDS): 0.009
Past Year
  • Commits: 43
  • Committers: 1
  • Avg Commits per committer: 43.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Eric S. Tellez d****t@g****m 583
Eric S. Tellez d****t@g****m 3
github-actions[bot] 4****] 1
Julia TagBot 5****t 1

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 14
  • Total pull requests: 31
  • Average time to close issues: 3 months
  • Average time to close pull requests: 2 months
  • Total issue authors: 8
  • Total pull request authors: 9
  • Average comments per issue: 8.21
  • Average comments per pull request: 0.48
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 16
Past Year
  • Issues: 2
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 3
Top Authors
Issue Authors
  • zsz00 (3)
  • attobot (3)
  • nlw0 (2)
  • davidoskky (1)
  • rssdev10 (1)
  • D3MZ (1)
  • AbrJA (1)
  • JuliaTagBot (1)
Pull Request Authors
  • github-actions[bot] (16)
  • dependabot[bot] (9)
  • msubrayada (4)
  • sadit (4)
  • AbrJA (2)
  • danielskatz (2)
  • PallHaraldsson (1)
  • jbytecode (1)
  • JuliaTagBot (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (9)

Packages

  • Total packages: 1
  • Total downloads:
    • julia 11 total
  • Total dependent packages: 11
  • Total dependent repositories: 16
  • Total versions: 107
juliahub.com: SimilaritySearch

A nearest neighbor search library with exact and approximate algorithms

  • Versions: 107
  • Dependent Packages: 11
  • Dependent Repositories: 16
  • Downloads: 11 Total
Rankings
Dependent repos count: 2.2%
Dependent packages count: 6.1%
Average: 10.7%
Stargazers count: 15.9%
Forks count: 18.8%
Last synced: 4 months ago

Dependencies

.github/workflows/TagBot.yml actions
  • JuliaRegistries/TagBot v1 composite
.github/workflows/ci.yml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • codecov/codecov-action v1 composite
  • julia-actions/julia-buildpkg v1 composite
  • julia-actions/julia-processcoverage v1 composite
  • julia-actions/julia-runtest v1 composite
  • julia-actions/setup-julia v1 composite
.github/workflows/documentation.yml actions
  • actions/checkout v2 composite
  • julia-actions/setup-julia latest composite
.github/workflows/CompatHelper.yml actions