deduplipy

Python package for deduplication/entity resolution using active learning

https://github.com/fritshermans/deduplipy

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.0%) to scientific vocabulary

Keywords

deduplication entity-resolution fuzzy-matching record-linkage
Last synced: 6 months ago · JSON representation ·

Repository

Python package for deduplication/entity resolution using active learning

Basic Info
  • Host: GitHub
  • Owner: fritshermans
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage: https://www.deduplipy.com
  • Size: 521 KB
Statistics
  • Stars: 81
  • Watchers: 5
  • Forks: 9
  • Open Issues: 1
  • Releases: 0
Topics
deduplication entity-resolution fuzzy-matching record-linkage
Created almost 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Version Downloads Conda - Platform Conda (channel only) Conda Recipe Docs - GitHub.io

DedupliPy

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

DedupliPy is an end-to-end solution with advantages over existing solutions:

  • active learning; no large manually labelled dataset required
  • during active learning, the user gets notified when the model converged and training may be finished
  • works out of the box, advanced users can choose settings as desired (custom blocking rules, custom metrics, interaction features)

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

With pip

Install directly from PyPI.

pip install deduplipy

With conda

Install using conda from conda-forge channel.

conda install -c conda-forge deduplipy

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]" python setup.py develop

Usage

Apply deduplication your Pandas dataframe df as follows:

python myDedupliPy = Deduplicator(col_names=['name', 'address']) myDedupliPy.fit(df)

This will start the interactive learning session in which you provide input on whether a pair is a match (y) or not (n). During active learning you will get the message that training may be finished once algorithm training has converged. Predictions on (new) data are obtained as follows:

python result = myDedupliPy.predict(df)

Owner

  • Login: fritshermans
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
title: DedupliPy
message: >-
  If you use DedupliPy, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Frits
    family-names: Hermans
repository-code: "https://github.com/fritshermans/deduplipy"
url: "https://www.deduplipy.com"
abstract: >-
  Deduplication is the task to combine different representations
  of the same real world entity. This package implements
  deduplication using active learning.
keywords:
  - deduplication
  - entity resolution
  - string matching
  - fuzzy matching
  - active learning
license: MIT

GitHub Events

Total
  • Watch event: 7
Last Year
  • Watch event: 7

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 266
  • Total Committers: 4
  • Avg Commits per committer: 66.5
  • Development Distribution Score (DDS): 0.064
Past Year
  • Commits: 4
  • Committers: 2
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
fritshermans p****t@f****l 249
Frits (F.K.) Hermans f****s@i****m 15
Sugato Ray s****y@u****m 1
vincent d warmerdam v****m@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 10
  • Total pull requests: 19
  • Average time to close issues: 2 months
  • Average time to close pull requests: about 2 hours
  • Total issue authors: 10
  • Total pull request authors: 3
  • Average comments per issue: 4.3
  • Average comments per pull request: 0.05
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • gregskol (1)
  • Murat-Topuz (1)
  • bingbong-sempai (1)
  • abhilashchowdhary (1)
  • azachar (1)
  • sugatoray (1)
  • koaning (1)
  • Pacman1984 (1)
  • NickCrews (1)
  • AlexAdolfoKohan (1)
Pull Request Authors
  • fritshermans (17)
  • sugatoray (1)
  • koaning (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 27 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 27
  • Total maintainers: 1
pypi.org: deduplipy

End-to-end deduplication solution

  • Versions: 23
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 27 Last month
Rankings
Dependent packages count: 7.4%
Downloads: 7.9%
Stargazers count: 8.3%
Average: 11.5%
Forks count: 11.5%
Dependent repos count: 22.3%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: deduplipy

<a href="https://deduplipy.readthedocs.io/en/latest/"> <img src="https://deduplipy.readthedocs.io/en/latest/_images/logo.png" width="15%" height="15%" align="left" /> </a> Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset. DedupliPy is an end-to-end solution with advantages over existing solutions: - active learning; no large manually labelled dataset required - during active learning, the user gets notified when the model converged and training may be finished - works out of the box, advanced users can choose settings as desired (custom blocking rules, custom metrics, interaction features) Developed by [Frits Hermans](https://www.linkedin.com/in/frits-hermans-data-scientist/) PyPI: [https://pypi.org/project/DedupliPy/](https://pypi.org/project/DedupliPy/)

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Stargazers count: 34.3%
Average: 41.0%
Forks count: 44.7%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

docs/docs-requirements.txt pypi
  • Jinja2 <3.1
  • nbsphinx *
  • sphinx ==3.5.4
  • sphinx_rtd_theme *
pyproject.toml pypi
setup.py pypi