dedupe
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
3 of 70 committers (4.3%) from academic institutions -
✓Institutional organization owner
Organization dedupeio has institutional domain (dedupe.io) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Basic Info
- Host: GitHub
- Owner: dedupeio
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://docs.dedupe.io
- Size: 5.99 MB
Statistics
- Stars: 4,362
- Watchers: 120
- Forks: 566
- Open Issues: 88
- Releases: 0
Topics
Metadata Files
README.md
Dedupe Python Library
dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data.
dedupe will help you:
- remove duplicate entries from a spreadsheet of names and addresses
- link a list with customer information to another with order history, even without unique customer IDs
- take a database of campaign contributions and figure out which ones were made by the same person, even if the names were entered slightly differently for each record
dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.
Important links
- Documentation: https://docs.dedupe.io/
- Repository: https://github.com/dedupeio/dedupe
- Issues: https://github.com/dedupeio/dedupe/issues
- Mailing list: https://groups.google.com/forum/#!forum/open-source-deduplication
- Examples: https://github.com/dedupeio/dedupe-examples
dedupe library consulting
If you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. Read more about pricing and available services here.
Tools built with dedupe
Dedupe.io
A cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.
Dedupe.io also supports record linkage across data sources and continuous matching and training through an API.
For more, see the Dedupe.io product site, tutorials on how to use it, and differences between it and the dedupe library.
Dedupe is well adopted by the Python community. Check out this blogpost, a YouTube video on how to use Dedupe with Python and a Youtube video on how to apply Dedupe at scale using Spark.
csvdedupe
Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.
Installation
Using dedupe
If you only want to use dedupe, install it this way:
bash
pip install dedupe
Familiarize yourself with dedupe's API, and get started on your project. Need inspiration? Have a look at some examples.
Developing dedupe
We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.
Once you have virtualenvwrapper set up,
bash
mkvirtualenv dedupe
git clone https://github.com/dedupeio/dedupe.git
cd dedupe
pip install -e . --config-settings editable_mode=compat
pip install -r requirements.txt
If these tests pass, then everything should have been installed correctly!
bash
pytest
Afterwards, whenever you want to work on dedupe,
bash
workon dedupe
Testing
Unit tests of core dedupe functions
bash
pytest
Test using canonical dataset from Bilenko's research
Using Deduplication
bash
python -m pip install -e ./benchmarks
python benchmarks/benchmarks/canonical.py
Using Record Linkage
bash
python -m pip install -e ./benchmarks
python benchmarks/benchmarks/canonical_matching.py
Team
- Forest Gregg, DataMade
- Derek Eder, DataMade
Credits
Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.
Errors / Bugs
If something is not behaving intuitively, it is a bug, and should be reported. Report it here
Note on Patches/Pull Requests
- Fork the project.
- Make your feature addition or bug fix.
- Send us a pull request. Bonus points for topic branches.
Copyright
Copyright (c) 2022 Forest Gregg and Derek Eder. Released under the MIT License.
Third-party copyright in this distribution is noted where applicable.
Citing Dedupe
If you use Dedupe in an academic work, please give this citation:
Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe.
Owner
- Name: Dedupe.io
- Login: dedupeio
- Kind: organization
- Email: dedupe@datamade.us
- Location: Chicago. IL
- Website: https://dedupe.io/
- Repositories: 31
- Profile: https://github.com/dedupeio
De-duplicate and find matches in your Excel spreadsheet or database
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Gregg" given-names: "Forest" - family-names: "Eder" given-names: "Derek" title: "dedupe" version: 2.0.11 date-released: 2022-01-27 url: "https://github.com/dedupeio/dedupe"
GitHub Events
Total
- Issues event: 5
- Watch event: 210
- Delete event: 2
- Issue comment event: 14
- Push event: 2
- Pull request event: 6
- Fork event: 26
- Create event: 3
Last Year
- Issues event: 5
- Watch event: 210
- Delete event: 2
- Issue comment event: 14
- Push event: 2
- Pull request event: 6
- Fork event: 26
- Create event: 3
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Forest Gregg | f****g@u****u | 2,269 |
| Derek Eder | d****r@g****m | 225 |
| Nick Crews | n****s@g****m | 125 |
| nikitsaraf | n****f@g****m | 102 |
| Cathy Deng | c****5@g****m | 48 |
| markhuberty | m****y@g****m | 30 |
| Eric van Zanten | e****n@g****m | 25 |
| dependabot[bot] | 4****] | 22 |
| Jean Cochrane | j****n@j****m | 13 |
| Lorenzo Moreschini | l****i@g****m | 12 |
| Jeff Hendricks | j****s@c****m | 7 |
| Wade Leftwich | w****h@r****m | 5 |
| Atul Varma | v****a@g****m | 4 |
| Flávio Juvenal | f****o@v****r | 4 |
| Zack Maril | z****k@z****m | 4 |
| Michael E. Karpeles | m****s@g****m | 4 |
| Nathan Hoeft | f****8@g****m | 3 |
| Frits (F.K.) Hermans | f****s@i****m | 3 |
| Mark Huberty | m****y@m****) | 3 |
| daniel-acuna | d****a@n****u | 3 |
| Jochen Brissier | j****r@g****m | 2 |
| Primož | k****z@g****m | 2 |
| nmiranda | n****a@d****l | 2 |
| Leobouloc | L****o@b****u | 2 |
| Geoff Hing | g****g@a****g | 2 |
| Kevin Dwyer | d****r@t****m | 2 |
| John O'Leary | j****y@c****m | 2 |
| Benjamin Manns | b****s@g****m | 1 |
| Ben Smithgall | b****l@g****m | 1 |
| Azat Abubakirov | k****t@g****m | 1 |
| and 40 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 81
- Total pull requests: 73
- Average time to close issues: about 1 month
- Average time to close pull requests: about 1 month
- Total issue authors: 48
- Total pull request authors: 16
- Average comments per issue: 2.83
- Average comments per pull request: 2.16
- Merged pull requests: 35
- Bot issues: 0
- Bot pull requests: 30
Past Year
- Issues: 9
- Pull requests: 8
- Average time to close issues: N/A
- Average time to close pull requests: 10 days
- Issue authors: 8
- Pull request authors: 5
- Average comments per issue: 0.67
- Average comments per pull request: 0.88
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 4
Top Authors
Issue Authors
- fgregg (16)
- NickCrews (10)
- ArVar (3)
- lmores (3)
- havardox (2)
- pecade (2)
- saipraneeth171 (2)
- rderidder-lda (2)
- Pobby321 (2)
- oreccb (2)
- Abhishek-thetechie (1)
- leifericf (1)
- EvanOman (1)
- jaime-varela (1)
- raulsperoni (1)
Pull Request Authors
- dependabot[bot] (42)
- fgregg (15)
- NickCrews (15)
- lmores (6)
- AhmedNader42 (2)
- andrea-gi (2)
- ArVar (2)
- jorenham (1)
- regel (1)
- f-hafner (1)
- jack-odonoghue (1)
- EvanOman (1)
- graeme-russell (1)
- PaulM5406 (1)
- benmanns (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 4
-
Total downloads:
- pypi 53,499 last-month
- Total docker downloads: 2,053
-
Total dependent packages: 7
(may contain duplicates) -
Total dependent repositories: 132
(may contain duplicates) - Total versions: 302
- Total maintainers: 3
pypi.org: dedupe
A python library for accurate and scaleable data deduplication and entity-resolution
- Homepage: https://github.com/dedupeio/dedupe
- Documentation: https://docs.dedupe.io/en/latest/
- License: The MIT License (MIT) Copyright (c) 2014 Forest Gregg, Derek Eder, DataMade and Contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 3.0.3
published over 1 year ago
Rankings
Maintainers (2)
proxy.golang.org: github.com/dedupeio/dedupe
- Documentation: https://pkg.go.dev/github.com/dedupeio/dedupe#section-documentation
- License: mit
-
Latest release: v3.0.3+incompatible
published over 1 year ago
Rankings
conda-forge.org: dedupe
- Homepage: https://github.com/dedupeio/dedupe
- License: MIT
-
Latest release: 2.0.19
published over 3 years ago
Rankings
pypi.org: dedupe-fork-eccovia
A python library for accurate and scaleable data deduplication and entity-resolution
- Homepage: https://github.com/tigerang22/dedupe
- Documentation: https://docs.dedupe.io/en/latest/
- License: MIT License
-
Latest release: 2.0.13
published about 3 years ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/github-script v6 composite
- actions/setup-python v3 composite
- actions/checkout v3 composite
- github/codeql-action/analyze v2 composite
- github/codeql-action/autobuild v2 composite
- github/codeql-action/init v2 composite
- dessant/lock-threads v4 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- pypa/cibuildwheel v2.11.3 composite
- sphinx >=4.3.0
- sphinx-autodoc-typehints *
- sphinx-rtd-theme >=0.5.1
- sphinxcontrib-htmlhelp *
- sphinxcontrib-jsmath *
- sphinxcontrib-serializinghtml *
- asv *
- black *
- coverage *
- coveralls *
- flake8 *
- isort *
- mock *
- mypy *
- pytest *
- pytest-cov *
- virtualenv *