dothash

Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

https://github.com/mikeheddes/dothash

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.6%) to scientific vocabulary
Last synced: 8 months ago · JSON representation ·

Repository

Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

Basic Info
Statistics
  • Stars: 8
  • Watchers: 1
  • Forks: 1
  • Open Issues: 2
  • Releases: 0
Created almost 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

This repository contains the source code for the research paper published at Knowledge Discovery and Data Mining Conference (KDD) 2023.

Requirements

The code is written in Python 3.10. The required packages to run the experiments can be found in requirements.txt. To install the required packages, run the following command:

bash pip install -r requirements.txt

Experiments

The experiments are divided into two parts: (1) link prediction and (2) document deduplication. The experiments can be run using the following commands:

Link Prediction

bash python link_prediction.py --help

Document Deduplication

Experiments with the core dataset require data to be downloaded from Google Drive and placed in the data directory.

bash python document_deduplication.py --help

Citation

If you use this code for your research, please cite our paper:

@inproceedings{dothash, title={DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication}, author={Nunes, Igor and Heddes, Mike and Vergés, Pere and Abraham, Danny and Veidenbaum, Alex and Nicolau, Alexandru and Givargis, Tony}, booktitle={Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, year={2023} }

Owner

  • Name: Mike Heddes
  • Login: mikeheddes
  • Kind: user
  • Location: Irvine, California

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this code for your research, please cite our paper."
authors:
- family-names: "Nunes"
  given-names: "Igor"
- family-names: "Heddes"
  given-names: "Mike"
  orcid: "https://orcid.org/0000-0002-9276-458X"
- family-names: "Vergés"
  given-names: "Pere"
- family-names: "Abraham"
  given-names: "Danny"
- family-names: "Veidenbaum"
  given-names: "Alex"
- family-names: "Nicolau"
  given-names: "Alexandru"
- family-names: "Givargis"
  given-names: "Tony"
title: "DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication"
url: "https://github.com/mikeheddes/dothash"
preferred-citation:
  type: conference-paper
  authors:
  - family-names: "Nunes"
    given-names: "Igor"
  - family-names: "Heddes"
    given-names: "Mike"
    orcid: "https://orcid.org/0000-0002-9276-458X"
  - family-names: "Vergés"
    given-names: "Pere"
  - family-names: "Abraham"
    given-names: "Danny"
  - family-names: "Veidenbaum"
    given-names: "Alex"
  - family-names: "Nicolau"
    given-names: "Alexandru"
  - family-names: "Givargis"
    given-names: "Tony"
  title: "DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication"
  collection-title: "Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining"
  collection-type: proceedings
  year: 2023

GitHub Events

Total
  • Watch event: 2
  • Fork event: 1
Last Year
  • Watch event: 2
  • Fork event: 1

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 4
  • Total Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mike Heddes m****s@g****m 4

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jianshu93 (1)
  • ksrinivs64 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels