scikit-hubness

scikit-hubness: Hubness Reduction and Approximate Neighbor Search - Published in JOSS (2020)

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: arxiv.org, joss.theoj.org, zenodo.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

approximate-nearest-neighbor-search data-mining data-science high-dimensional-data hubness machine-learning nearest-neighbor-search

Scientific Fields

Mathematics Computer Science - 84% confidence

Engineering Computer Science - 60% confidence

Last synced: 6 months ago · JSON representation

Repository

A Python package for hubness analysis and high-dimensional data mining

Basic Info

Host: GitHub
Owner: VarIr
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 3.85 MB

Statistics

Stars: 45
Watchers: 3
Forks: 9
Open Issues: 15
Releases: 7

Topics

approximate-nearest-neighbor-search data-mining data-science high-dimensional-data hubness machine-learning nearest-neighbor-search

Created over 6 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Code of conduct

scikit-hubness

scikit-hubness provides tools for the analysis and reduction of hubness in high-dimensional data. Hubness is an aspect of the curse of dimensionality and is detrimental to many machine learning and data mining tasks.

The skhubness.analysis and skhubness.reduction packages allow to

analyze, whether your data sets show hubness
reduce hubness via a variety of different techniques
perform downstream analysis (performance assessment) with scikit-learn due to compatible data structures

The skhubness.neighbors package provides approximate nearest neighbor (ANN) search. This is compatible with scikit-learn classes and functions relying on neighbors graphs due to compliance with KNeighborsTransformer APIs and data structures. Using ANN can speed up many scikit-learn classification, clustering, embedding and other methods, including: - KNeighborsClassifier - DBSCAN - TSNE - and many more.

scikit-hubness thus provides - approximate nearest neighbor search - hubness reduction - and combinations,

which allows for fast hubness-reduced neighbor search in large datasets (tested with >1M objects).

Installation

Make sure you have a working Python3 environment (at least 3.8).

Use pip to install the latest stable version of scikit-hubness from PyPI:

bash pip install scikit-hubness

NOTE: v0.30 is currently under development and not yet available on PyPI. Install from sources to obtain the bleeding edge version.

Dependencies are installed automatically, if necessary. scikit-hubness is based on the SciPy-stack, including numpy, scipy and scikit-learn. Approximate nearest neighbor search and approximate hubness reduction additionally require at least one of the following packages: * nmslib for hierachical navigable small-world graphs in skhubness.neighbors.NMSlibTransformer * ngtpy for nearest neighbor graphs (ANNG, ONNG) in skhubness.neighbors.NGTTransformer * puffinn for locality-sensitive hashing in skhubness.neighbors.PuffinnTransformer * annoy for random projection forests in skhubness.neighobrs.AnnoyTransformer * Additional ANN libraries might be added in future releases. Please reach out to us in a Github Issue, if you think a specific library is missing (pull requests welcome).

For more details and alternatives, please see the Installation instructions.

Documentation

Additional documentation is available online: http://scikit-hubness.readthedocs.io/en/latest/index.html

What's new

See the changelog to find what's new in the latest package version.

Quickstart

Users of scikit-hubness may want to

analyse, whether their data show hubness
reduce hubness
perform learning (classification, regression, ...)

The following example shows all these steps for an example dataset from the text domain (dexter). (Please make sure you have installed scikit-hubness).

```python from sklearn.modelselection import crossval_score from sklearn.neighbors import KNeighborsClassifier, KNeighborsTransformer

from skhubness import Hubness from skhubness.data import load_dexter from skhubness.neighbors import NMSlibTransformer from skhubness.reduction import MutualProximity

load the example dataset 'dexter' that is embedded in a

high-dimensional space, and could, thus, be prone to hubness.

X, y = load_dexter() print(f'X.shape = {X.shape}, y.shape = {y.shape}')

assess the actual degree of hubness in dexter

hub = Hubness(k=10, metric='cosine') hub.fit(X) kskew = hub.score() print(f'Skewness = {kskew:.3f}')

additional hubness indices are available, for example:

hub = Hubness(k=10, returnvalue="all", metric='cosine') scores = hub.fit(X).score() print(f'Robin hood index: {scores.get("robinhood"):.3f}') print(f'Antihub occurrence: {scores.get("antihuboccurrence"):.3f}') print(f'Hub occurrence: {scores.get("hub_occurrence"):.3f}')

There is considerable hubness in dexter. Let's see, whether

hubness reduction can improve kNN classification performance.

We first create a kNN graph:

knn = KNeighborsTransformer(n_neighbors=50, metric="cosine")

Alternatively, create an approximate KNeighborsTransformer, e.g.,

knn = NMSlibTransformer(n_neighbors=50, metric="cosine")

kneighborsgraph = knn.fittransform(X, y)

vanilla kNN without hubness reduction

clf = KNeighborsClassifier(nneighbors=5, metric='precomputed') accstandard = crossvalscore(clf, kneighbors_graph, y, cv=5)

kNN with hubness reduction (mutual proximity) reuses the

precomputed graph and works in sklearn workflows:

mp = MutualProximity(method="normal") mpgraph = mp.fittransform(kneighborsgraph) accmp = crossvalscore(clf, mp_graph, y, cv=5)

print(f'Accuracy (vanilla kNN): {accstandard.mean():.3f}') print(f'Accuracy (kNN with hubness reduction): {accmp.mean():.3f}')

Accuracy was considerably improved by mutual proximity.

Did it actually reduce hubness?

mpscores = hub.fit(mpgraph).score() print(f'k-skewness after MP: {mpscores.get("kskewness"):.3f} ' f'(reduction of {scores.get("kskewness") - mpscores.get("kskewness"):.3f})') print(f'Robinhood after MP: {mpscores.get("robinhood"):.3f} ' f'(reduction of {scores.get("robinhood") - mp_scores.get("robinhood"):.3f})') ```

Check the User Guide for additional example usage.

Development

The developers of scikit-hubness welcome all kinds of contributions! Get in touch with us if you have comments, would like to see an additional feature implemented, would like to contribute code or have any other kind of issue. Don't hesitate to file an issue here on GitHub.

For more information about contributing, please have a look at the contributors guidelines.

(c) 2018-2022, Roman Feldbauer
-2018: Austrian Research Institute for Artificial Intelligence (OFAI) and
-2021: University of Vienna, Division of Computational Systems Biology (CUBE)
2021-: Independent researcher
Contact: <sci@feldbauer.org>

Citation

If you use scikit-hubness in your scientific publication, please cite:

@Article{Feldbauer2020,
  author  = {Roman Feldbauer and Thomas Rattei and Arthur Flexer},
  title   = {scikit-hubness: Hubness Reduction and Approximate Neighbor Search},
  journal = {Journal of Open Source Software},
  year    = {2020},
  volume  = {5},
  number  = {45},
  pages   = {1957},
  issn    = {2475-9066},
  doi     = {10.21105/joss.01957},
}

To specifically acknowledge approximate hubness reduction, please cite:

@INPROCEEDINGS{8588814,
author={R. {Feldbauer} and M. {Leodolter} and C. {Plant} and A. {Flexer}},
booktitle={2018 IEEE International Conference on Big Knowledge (ICBK)},
title={Fast Approximate Hubness Reduction for Large High-Dimensional Data},
year={2018},
volume={},
number={},
pages={358-367},
keywords={computational complexity;data analysis;data mining;mobile computing;public domain software;software packages;mobile device;open source software package;high-dimensional data mining;fast approximate hubness reduction;massive mobility data;linear complexity;quadratic algorithmic complexity;dimensionality curse;Complexity theory;Indexes;Estimation;Data mining;Approximation algorithms;Time measurement;curse of dimensionality;high-dimensional data mining;hubness;linear complexity;interpretability;smartphones;transport mode detection},
doi={10.1109/ICBK.2018.00055},
ISSN={},
month={Nov},}

The technical report Fast approximate hubness reduction for large high-dimensional data is available at OFAI.

Additional reading

Local and Global Scaling Reduce Hubs in Space, Journal of Machine Learning Research 2012, Link.

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces, Knowledge and Information Systems 2018, DOI.

License

scikit-hubness is licensed under the terms of the BSD-3-Clause license.

Note: Individual files contain the following tag instead of the full license text.

    SPDX-License-Identifier: BSD-3-Clause

This enables machine processing of license information based on the SPDX License Identifiers that are here available: https://spdx.org/licenses/

Acknowledgements

Parts of scikit-hubness adapt code from scikit-learn. We thank all the authors and contributors of this project for the tremendous work they have done.

PyVmMonitor is being used to support the development of this free open source software package. For more information go to http://www.pyvmmonitor.com

Owner

Name: Roman Feldbauer
Login: VarIr
Kind: user
Location: Vienna, AT
Company: @ProxygenTx

Website: https://www.feldbauer.cc
Repositories: 2
Profile: https://github.com/VarIr

Researcher in bioinformatics and machine learning. Alumnus of @univieCUBE @OFAI

JOSS Publication

scikit-hubness: Hubness Reduction and Approximate Neighbor Search

Published

January 17, 2020

DOI

10.21105/joss.01957

Volume 5, Issue 45, Page 1957

Authors

Roman Feldbauer

Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, Althanstraße 14, 1090 Vienna, Austria

Thomas Rattei

Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, Althanstraße 14, 1090 Vienna, Austria

Arthur Flexer

Austrian Research Institute for Artificial Intelligence (OFAI), Freyung 6/6/7, 1010 Vienna, Austria

Editor

Yuan Tang

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 275
Total Committers: 2
Avg Commits per committer: 137.5
Development Distribution Score (DDS): 0.004

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Roman Feldbauer	s**i@f**g	274
Silvan David Peter	s**n@1**a	1

Committer Domains (Top 20 + Academic)

132.128-25.142.171.193.in-addr.arpa: 1 feldbauer.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 40
Total pull requests: 63
Average time to close issues: 4 months
Average time to close pull requests: 2 days
Total issue authors: 12
Total pull request authors: 2
Average comments per issue: 3.0
Average comments per pull request: 1.21
Merged pull requests: 58
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

VarIr (20)
ivan-marroquin (9)
QueSabz (2)
j-bac (1)
davnn (1)
clkruse (1)
mrdrozdov (1)
eduamf (1)
jolespin (1)
BlaiseMuhirwa (1)
jlevy44 (1)
ryEllison (1)

Pull Request Authors

VarIr (63)
sildater (1)

Top Labels

Issue Labels

enhancement (8) bug (3) documentation (1)

Pull Request Labels

enhancement (6) bug (1) documentation (1) invalid (1)

Packages

Total packages: 1
Total downloads:
- pypi 33 last-month

Total dependent packages: 0
Total dependent repositories: 2
Total versions: 10
Total maintainers: 1

pypi.org: scikit-hubness

Hubness reduction and analysis tools

Homepage: https://github.com/VarIr/scikit-hubness
Documentation: https://scikit-hubness.readthedocs.io
License: BSD License
Latest release: 0.21.2
published about 6 years ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 2
Downloads: 33 Last month

Rankings

Stargazers count: 10.1%

Dependent packages count: 10.1%

Dependent repos count: 11.5%

Forks count: 11.9%

Average: 12.8%

Downloads: 20.2%

Maintainers (1)

feldbauer

Last synced: 6 months ago

Dependencies

requirements-rtd.txt pypi

codecov *
flake8 *
graphviz *
joblib >=0.12
mock *
nose *
numpy *
numpydoc *
pandas *
pytest *
pytest-cov *
scikit-learn *
scipy >=1.2
sphinx >=2.1
sphinx-automodapi *
sphinx-gallery *
sphinx-pdj-theme *
tqdm *

requirements-win.txt pypi

annoy *
codecov *
flake8 *
joblib >=0.12
nmslib *
nose *
numpy *
pandas *
pytest *
pytest-cov *
scikit-learn *
scipy >=1.2
tqdm *

requirements.txt pypi

annoy *
codecov *
flake8 *
joblib >=0.12
ngt >=1.8
nmslib *
nose *
numba *
numpy *
pandas *
pytest *
pytest-cov *
scikit-learn *
scipy >=1.2
tqdm *

.github/workflows/scikit-hubness_ci.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

pyproject.toml pypi

scikit-hubness

Science Score: 93.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

scikit-hubness

Installation

Documentation

What's new

Quickstart

load the example dataset 'dexter' that is embedded in a

high-dimensional space, and could, thus, be prone to hubness.

assess the actual degree of hubness in dexter

additional hubness indices are available, for example:

There is considerable hubness in dexter. Let's see, whether

hubness reduction can improve kNN classification performance.

We first create a kNN graph:

Alternatively, create an approximate KNeighborsTransformer, e.g.,

knn = NMSlibTransformer(n_neighbors=50, metric="cosine")

vanilla kNN without hubness reduction

kNN with hubness reduction (mutual proximity) reuses the

precomputed graph and works in sklearn workflows:

Accuracy was considerably improved by mutual proximity.

Did it actually reduce hubness?

Development

Citation

Additional reading

License

Acknowledgements

Owner

JOSS Publication

scikit-hubness: Hubness Reduction and Approximate Neighbor Search

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: scikit-hubness

Rankings

Maintainers (1)

Dependencies