dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

https://github.com/dedupeio/dedupe

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    3 of 70 committers (4.3%) from academic institutions
  • Institutional organization owner
    Organization dedupeio has institutional domain (dedupe.io)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary

Keywords

clustering datamade de-duplicating dedupe dedupe-library entity-resolution python python-library record-linkage

Keywords from Contributors

interaction tensor autograd distribution profiles quantum-circuit test-data-generator test-data faker-generator faker
Last synced: 6 months ago · JSON representation ·

Repository

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Basic Info
  • Host: GitHub
  • Owner: dedupeio
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage: https://docs.dedupe.io
  • Size: 5.99 MB
Statistics
  • Stars: 4,362
  • Watchers: 120
  • Forks: 566
  • Open Issues: 88
  • Releases: 0
Topics
clustering datamade de-duplicating dedupe dedupe-library entity-resolution python python-library record-linkage
Created almost 14 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

Dedupe Python Library

Tests Passingcodecov

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data.

dedupe will help you:

  • remove duplicate entries from a spreadsheet of names and addresses
  • link a list with customer information to another with order history, even without unique customer IDs
  • take a database of campaign contributions and figure out which ones were made by the same person, even if the names were entered slightly differently for each record

dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Important links

  • Documentation: https://docs.dedupe.io/
  • Repository: https://github.com/dedupeio/dedupe
  • Issues: https://github.com/dedupeio/dedupe/issues
  • Mailing list: https://groups.google.com/forum/#!forum/open-source-deduplication
  • Examples: https://github.com/dedupeio/dedupe-examples

dedupe library consulting

If you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. Read more about pricing and available services here.

Tools built with dedupe

Dedupe.io

A cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.

Dedupe.io also supports record linkage across data sources and continuous matching and training through an API.

For more, see the Dedupe.io product site, tutorials on how to use it, and differences between it and the dedupe library.

Dedupe is well adopted by the Python community. Check out this blogpost, a YouTube video on how to use Dedupe with Python and a Youtube video on how to apply Dedupe at scale using Spark.

csvdedupe

Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.

Installation

Using dedupe

If you only want to use dedupe, install it this way:

bash pip install dedupe

Familiarize yourself with dedupe's API, and get started on your project. Need inspiration? Have a look at some examples.

Developing dedupe

We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

Once you have virtualenvwrapper set up,

bash mkvirtualenv dedupe git clone https://github.com/dedupeio/dedupe.git cd dedupe pip install -e . --config-settings editable_mode=compat pip install -r requirements.txt

If these tests pass, then everything should have been installed correctly!

bash pytest

Afterwards, whenever you want to work on dedupe,

bash workon dedupe

Testing

Unit tests of core dedupe functions bash pytest

Test using canonical dataset from Bilenko's research

Using Deduplication bash python -m pip install -e ./benchmarks python benchmarks/benchmarks/canonical.py

Using Record Linkage bash python -m pip install -e ./benchmarks python benchmarks/benchmarks/canonical_matching.py

Team

  • Forest Gregg, DataMade
  • Derek Eder, DataMade

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2022 Forest Gregg and Derek Eder. Released under the MIT License.

Third-party copyright in this distribution is noted where applicable.

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe.

Owner

  • Name: Dedupe.io
  • Login: dedupeio
  • Kind: organization
  • Email: dedupe@datamade.us
  • Location: Chicago. IL

De-duplicate and find matches in your Excel spreadsheet or database

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Gregg"
  given-names: "Forest"
- family-names: "Eder"
  given-names: "Derek"
title: "dedupe"
version: 2.0.11
date-released: 2022-01-27
url: "https://github.com/dedupeio/dedupe"

GitHub Events

Total
  • Issues event: 5
  • Watch event: 210
  • Delete event: 2
  • Issue comment event: 14
  • Push event: 2
  • Pull request event: 6
  • Fork event: 26
  • Create event: 3
Last Year
  • Issues event: 5
  • Watch event: 210
  • Delete event: 2
  • Issue comment event: 14
  • Push event: 2
  • Pull request event: 6
  • Fork event: 26
  • Create event: 3

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 2,968
  • Total Committers: 70
  • Avg Commits per committer: 42.4
  • Development Distribution Score (DDS): 0.236
Past Year
  • Commits: 27
  • Committers: 4
  • Avg Commits per committer: 6.75
  • Development Distribution Score (DDS): 0.222
Top Committers
Name Email Commits
Forest Gregg f****g@u****u 2,269
Derek Eder d****r@g****m 225
Nick Crews n****s@g****m 125
nikitsaraf n****f@g****m 102
Cathy Deng c****5@g****m 48
markhuberty m****y@g****m 30
Eric van Zanten e****n@g****m 25
dependabot[bot] 4****] 22
Jean Cochrane j****n@j****m 13
Lorenzo Moreschini l****i@g****m 12
Jeff Hendricks j****s@c****m 7
Wade Leftwich w****h@r****m 5
Atul Varma v****a@g****m 4
Flávio Juvenal f****o@v****r 4
Zack Maril z****k@z****m 4
Michael E. Karpeles m****s@g****m 4
Nathan Hoeft f****8@g****m 3
Frits (F.K.) Hermans f****s@i****m 3
Mark Huberty m****y@m****) 3
daniel-acuna d****a@n****u 3
Jochen Brissier j****r@g****m 2
Primož k****z@g****m 2
nmiranda n****a@d****l 2
Leobouloc L****o@b****u 2
Geoff Hing g****g@a****g 2
Kevin Dwyer d****r@t****m 2
John O'Leary j****y@c****m 2
Benjamin Manns b****s@g****m 1
Ben Smithgall b****l@g****m 1
Azat Abubakirov k****t@g****m 1
and 40 more...

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 81
  • Total pull requests: 73
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 1 month
  • Total issue authors: 48
  • Total pull request authors: 16
  • Average comments per issue: 2.83
  • Average comments per pull request: 2.16
  • Merged pull requests: 35
  • Bot issues: 0
  • Bot pull requests: 30
Past Year
  • Issues: 9
  • Pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 10 days
  • Issue authors: 8
  • Pull request authors: 5
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.88
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 4
Top Authors
Issue Authors
  • fgregg (16)
  • NickCrews (10)
  • ArVar (3)
  • lmores (3)
  • havardox (2)
  • pecade (2)
  • saipraneeth171 (2)
  • rderidder-lda (2)
  • Pobby321 (2)
  • oreccb (2)
  • Abhishek-thetechie (1)
  • leifericf (1)
  • EvanOman (1)
  • jaime-varela (1)
  • raulsperoni (1)
Pull Request Authors
  • dependabot[bot] (42)
  • fgregg (15)
  • NickCrews (15)
  • lmores (6)
  • AhmedNader42 (2)
  • andrea-gi (2)
  • ArVar (2)
  • jorenham (1)
  • regel (1)
  • f-hafner (1)
  • jack-odonoghue (1)
  • EvanOman (1)
  • graeme-russell (1)
  • PaulM5406 (1)
  • benmanns (1)
Top Labels
Issue Labels
research (1) enhancement (1)
Pull Request Labels
dependencies (42) github_actions (41) python (1)

Packages

  • Total packages: 4
  • Total downloads:
    • pypi 53,499 last-month
  • Total docker downloads: 2,053
  • Total dependent packages: 7
    (may contain duplicates)
  • Total dependent repositories: 132
    (may contain duplicates)
  • Total versions: 302
  • Total maintainers: 3
pypi.org: dedupe

A python library for accurate and scaleable data deduplication and entity-resolution

  • Homepage: https://github.com/dedupeio/dedupe
  • Documentation: https://docs.dedupe.io/en/latest/
  • License: The MIT License (MIT) Copyright (c) 2014 Forest Gregg, Derek Eder, DataMade and Contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  • Latest release: 3.0.3
    published over 1 year ago
  • Versions: 179
  • Dependent Packages: 7
  • Dependent Repositories: 132
  • Downloads: 53,464 Last month
  • Docker Downloads: 2,053
Rankings
Dependent packages count: 1.1%
Downloads: 1.2%
Stargazers count: 1.2%
Dependent repos count: 1.3%
Average: 1.6%
Forks count: 2.3%
Docker downloads count: 2.5%
Maintainers (2)
Last synced: 6 months ago
proxy.golang.org: github.com/dedupeio/dedupe
  • Versions: 110
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.6%
Average: 5.8%
Dependent repos count: 6.0%
Last synced: 6 months ago
conda-forge.org: dedupe
  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 5.5%
Forks count: 6.8%
Average: 24.4%
Dependent repos count: 34.0%
Dependent packages count: 51.2%
Last synced: 6 months ago
pypi.org: dedupe-fork-eccovia

A python library for accurate and scaleable data deduplication and entity-resolution

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 35 Last month
Rankings
Dependent packages count: 6.6%
Average: 28.4%
Forks count: 30.5%
Dependent repos count: 30.6%
Downloads: 35.0%
Stargazers count: 39.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/benchmark-bot.yml actions
  • actions/checkout v3 composite
  • actions/github-script v6 composite
  • actions/setup-python v3 composite
.github/workflows/codeql-analysis.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/lock.yml actions
  • dessant/lock-threads v4 composite
.github/workflows/pythonpackage.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • pypa/cibuildwheel v2.11.3 composite
docs/requirements.txt pypi
  • sphinx >=4.3.0
  • sphinx-autodoc-typehints *
  • sphinx-rtd-theme >=0.5.1
  • sphinxcontrib-htmlhelp *
  • sphinxcontrib-jsmath *
  • sphinxcontrib-serializinghtml *
requirements.txt pypi
  • asv *
  • black *
  • coverage *
  • coveralls *
  • flake8 *
  • isort *
  • mock *
  • mypy *
  • pytest *
  • pytest-cov *
  • virtualenv *