BibDedupe

BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication - Published in JOSS (2024)

https://github.com/colrev-environment/bib-dedupe

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    1 of 4 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords from Contributors

standardization
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
Statistics
  • Stars: 6
  • Watchers: 1
  • Forks: 3
  • Open Issues: 1
  • Releases: 16
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

# BibDedupe [![status](https://joss.theoj.org/papers/b954027d06d602c106430e275fe72130/status.svg)](https://joss.theoj.org/papers/b954027d06d602c106430e275fe72130) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bib-dedupe)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/bib-dedupe/.github%2Fworkflows%2Ftests.yml?label=tests)](https://github.com/CoLRev-Environment/bib-dedupe/actions/workflows/tests.yml) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/bib-dedupe/.github%2Fworkflows%2Fdocs.yml?label=docs)](https://github.com/CoLRev-Environment/bib-dedupe/actions/workflows/docs.yml) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CoLRev-Environment/bib-dedupe/.github%2Fworkflows%2Fevaluate.yml?label=continuous%20evaluation)](https://github.com/CoLRev-Environment/bib-dedupe/actions/workflows/evaluate.yml)

Overview

BibDedupe is an open-source Python library for deduplication of bibliographic records, tailored for literature reviews. Unlike traditional deduplication methods, BibDedupe focuses on entity resolution, linking duplicate records instead of simply deleting them.

Features

  • Automated Duplicate Linking with Zero False Positives: BibDedupe automates the duplicate linking process with a focus on eliminating false positives.
  • Preprocessing Approach: BibDedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.
  • Entity Resolution: BibDedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.
  • Programmatic Access: BibDedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.
  • Transparent and Reproducible Rules: BibDedupe's blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.
  • Continuous Benchmarking: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library's reliability and performance across datasets.
  • Efficient and Parallel Computation: BibDedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.

Documentation

Explore the official documentation for comprehensive information on installation, usage, and customization of BibDedupe.

Citation

If you use BibDedupe in your research, please cite it as follows:

Wagner, G. (2024) BibDedupe - An open-source Python library for deduplication of bibliographic records. Journal of Open Source Software, 9(97), 6318, https://doi.org/10.21105/joss.06318.

Contribution Guidelines

We welcome contributions from the community to enhance and expand BibDedupe. If you would like to contribute, please follow our contribution guidelines.

License

BibDedupe is released under the MIT License, allowing free and open use and modification.

Contact

For any questions, issues, or feedback, please open an issue on our GitHub repository.

Happy deduplicating with BibDedupe!

Owner

  • Name: CoLRev-Environment
  • Login: CoLRev-Environment
  • Kind: organization

JOSS Publication

BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication
Published
May 22, 2024
Volume 9, Issue 97, Page 6318
Authors
Gerit Wagner ORCID
Otto-Friedrich Universität Bamberg
Editor
Ana Trisovic ORCID
Tags
Bibliographic Records Deduplication Data Preprocessing Blocking

Citation (CITATION.cff)

cff-version: "1.2.0"
authors:
- family-names: Wagner
  given-names: Gerit
  orcid: "https://orcid.org/0000-0003-3926-7717"
doi: 10.5281/zenodo.11223590
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Wagner
    given-names: Gerit
    orcid: "https://orcid.org/0000-0003-3926-7717"
  date-published: 2024-05-22
  doi: 10.21105/joss.06318
  issn: 2475-9066
  issue: 97
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 6318
  title: "BibDedupe: An Open-Source Python Library for Bibliographic
    Record Deduplication"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.06318"
  volume: 9
title: "BibDedupe: An Open-Source Python Library for Bibliographic
  Record Deduplication"

GitHub Events

Total
  • Create event: 7
  • Issues event: 3
  • Release event: 2
  • Watch event: 3
  • Delete event: 6
  • Issue comment event: 1
  • Push event: 72
  • Pull request event: 9
  • Fork event: 2
Last Year
  • Create event: 7
  • Issues event: 3
  • Release event: 2
  • Watch event: 3
  • Delete event: 6
  • Issue comment event: 1
  • Push event: 72
  • Pull request event: 9
  • Fork event: 2

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 239
  • Total Committers: 4
  • Avg Commits per committer: 59.75
  • Development Distribution Score (DDS): 0.41
Past Year
  • Commits: 72
  • Committers: 3
  • Avg Commits per committer: 24.0
  • Development Distribution Score (DDS): 0.444
Top Committers
Name Email Commits
Gerit Wagner g****r@u****e 141
bib-dedupe evaluator y****l@e****m 78
Poetry updater a****s 19
github-actions[bot] 4****] 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 12
  • Total pull requests: 31
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 10 days
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.83
  • Average comments per pull request: 0.0
  • Merged pull requests: 20
  • Bot issues: 1
  • Bot pull requests: 29
Past Year
  • Issues: 3
  • Pull requests: 7
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 17 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 1
  • Bot pull requests: 6
Top Authors
Issue Authors
  • DrMattG (6)
  • linuxscout (2)
  • geritwagner (2)
Pull Request Authors
  • github-actions[bot] (48)
  • geritwagner (4)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 6,891 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 0
  • Total versions: 16
  • Total maintainers: 1
pypi.org: bib-dedupe

Identify and merge duplicates in bibliographic records

  • Versions: 16
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 6,891 Last month
Rankings
Dependent packages count: 9.9%
Average: 38.8%
Dependent repos count: 67.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/publish.yml actions
  • actions/checkout v3 composite
.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
pyproject.toml pypi
  • colrev ^0.10.4
  • coverage ^7.3.2
  • pylint 3.0.1
  • pytest ^7.2.1
  • python ^3.8
.github/workflows/evaluate.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/docs.yml actions
  • sphinx-notes/pages v3 composite
notebooks/buhos/Gemfile rubygems
  • mutant >= 0 development
  • mutant-rspec >= 0 development
  • pkgr >= 0 development
  • pry >= 0 development
  • rubocop >= 0 development
  • sassc >= 0 development
  • simplecov >= 0 development
  • test-prof >= 0 development
  • yard >= 0 development
  • yard-sinatra >= 0 development
  • ai4r >= 0
  • bibtex-ruby >= 0
  • caxlsx >= 0
  • certified >= 0
  • dotenv >= 0
  • elsevier_api >= 0
  • grim >= 0
  • haml >= 0
  • i18n >= 0
  • json >= 0
  • levenshtein >= 0
  • levenshtein-ffi >= 0
  • libcache >= 0
  • mail >= 0
  • mimemagic >= 0
  • moneta >= 0
  • mysql2 >= 0
  • narray >= 0
  • nokogiri >= 0
  • pdf-reader >= 0
  • puma >= 0
  • rack >= 0
  • rack-test >= 0
  • rake >= 13.0.0
  • ref_parsers >= 0
  • rspec >= 0
  • ruby-stemmer >= 0
  • rubyzip >= 1.3.0
  • rufus-scheduler >= 0
  • sequel >= 0
  • serrano >= 0
  • simple_xlsx_reader >= 0
  • sinatra >= 2.0.1
  • sqlite3 >= 0
  • tf-idf-similarity >= 0
  • thin >= 0
  • treetop >= 0
  • tzinfo-data >= 0
  • unicode >= 0
  • zip-zip >= 0
notebooks/buhos/Gemfile.lock rubygems
  • 127 dependencies