pyminhash

MinHash implementation in Python

https://github.com/fritshermans/pyminhash

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

MinHash implementation in Python

Basic Info
  • Host: GitHub
  • Owner: fritshermans
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 70.3 KB
Statistics
  • Stars: 11
  • Watchers: 1
  • Forks: 5
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Version Downloads Conda - Platform Conda (channel only) Conda Recipe Docs - GitHub.io

PyMinHash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Using PyPI

pip install pyminhash

Using conda

conda install -c conda-forge pyminhash

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]" python setup.py develop

Usage

Apply record matching to column name of your Pandas dataframe df as follows:

python myHasher = MinHash(n_hash_tables=10) myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.

Owner

  • Login: fritshermans
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
title: PyMinHash
message: >-
  If you use PyMinHash, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Frits
    family-names: Hermans
repository-code: 'https://github.com/fritshermans/pyminhash'
abstract: >-
  MinHashing is a very efficient way of finding similar
  records in a dataset based on Jaccard similarity.
  PyMinHash implements efficient minhashing for Pandas
  dataframes.
keywords:
  - minhash
  - string matching
  - fuzzy matching
license: MIT

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 33
  • Total Committers: 4
  • Avg Commits per committer: 8.25
  • Development Distribution Score (DDS): 0.121
Past Year
  • Commits: 4
  • Committers: 2
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
fritshermans p****t@f****l 29
frankhoogmoed f****d@r****m 2
Andrej Zachar a****j@c****u 1
Frits (F.K.) Hermans f****s@i****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 7
  • Average time to close issues: N/A
  • Average time to close pull requests: about 8 hours
  • Total issue authors: 0
  • Total pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.71
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • fritshermans (4)
  • azachar (2)
  • hokkiefrank (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 2,009 last-month
  • Total dependent packages: 2
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 12
  • Total maintainers: 1
pypi.org: pyminhash

Efficient MinHashing

  • Versions: 7
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 2,009 Last month
Rankings
Dependent packages count: 4.7%
Downloads: 9.6%
Average: 13.9%
Forks count: 15.3%
Stargazers count: 18.5%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 8 months ago
conda-forge.org: pyminhash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started. Developed by [Frits Hermans](https://www.linkedin.com/in/frits-hermans-data-scientist/) PyPI: [https://pypi.org/project/PyMinHash/](https://pypi.org/project/PyMinHash/)

  • Versions: 5
  • Dependent Packages: 1
  • Dependent Repositories: 0
Rankings
Dependent packages count: 48.7%
Average: 51.6%
Stargazers count: 52.3%
Forks count: 53.7%
Last synced: 8 months ago

Dependencies

docs/docs-requirements.txt pypi
  • nbsphinx *
  • sphinx ==3.5.4
  • sphinx_rtd_theme *
pyproject.toml pypi
setup.py pypi