Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
MinHash implementation in Python
Basic Info
- Host: GitHub
- Owner: fritshermans
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 70.3 KB
Statistics
- Stars: 11
- Watchers: 1
- Forks: 5
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
PyMinHash
MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.
Developed by Frits Hermans
Documentation
Documentation can be found here
Installation
Normal installation
Using PyPI
pip install pyminhash
Using conda
conda install -c conda-forge pyminhash
Install to contribute
Clone this Github repo and install in editable mode:
python -m pip install -e ".[dev]"
python setup.py develop
Usage
Apply record matching to column name of your Pandas dataframe df as follows:
python
myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')
This will return the row pairs from df that have non-zero Jaccard similarity.
Owner
- Login: fritshermans
- Kind: user
- Repositories: 3
- Profile: https://github.com/fritshermans
Citation (CITATION.cff)
cff-version: 1.2.0
title: PyMinHash
message: >-
If you use PyMinHash, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Frits
family-names: Hermans
repository-code: 'https://github.com/fritshermans/pyminhash'
abstract: >-
MinHashing is a very efficient way of finding similar
records in a dataset based on Jaccard similarity.
PyMinHash implements efficient minhashing for Pandas
dataframes.
keywords:
- minhash
- string matching
- fuzzy matching
license: MIT
GitHub Events
Total
Last Year
Committers
Last synced: almost 3 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| fritshermans | p****t@f****l | 29 |
| frankhoogmoed | f****d@r****m | 2 |
| Andrej Zachar | a****j@c****u | 1 |
| Frits (F.K.) Hermans | f****s@i****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 7
- Average time to close issues: N/A
- Average time to close pull requests: about 8 hours
- Total issue authors: 0
- Total pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 0.71
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- fritshermans (4)
- azachar (2)
- hokkiefrank (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 2,009 last-month
-
Total dependent packages: 2
(may contain duplicates) -
Total dependent repositories: 1
(may contain duplicates) - Total versions: 12
- Total maintainers: 1
pypi.org: pyminhash
Efficient MinHashing
- Homepage: https://github.com/fritshermans/pyminhash
- Documentation: https://pyminhash.readthedocs.io/
- License: MIT License
-
Latest release: 0.1.5
published about 3 years ago
Rankings
Maintainers (1)
conda-forge.org: pyminhash
MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started. Developed by [Frits Hermans](https://www.linkedin.com/in/frits-hermans-data-scientist/) PyPI: [https://pypi.org/project/PyMinHash/](https://pypi.org/project/PyMinHash/)
- Homepage: https://github.com/fritshermans/pyminhash
- License: MIT
-
Latest release: 0.1.4
published about 4 years ago
Rankings
Dependencies
- nbsphinx *
- sphinx ==3.5.4
- sphinx_rtd_theme *