https://github.com/chenghaomou/awesome-data-deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

https://github.com/chenghaomou/awesome-data-deduplication

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

An awesome list of data deduplication use cases, papers, tools, and methods.

Basic Info
  • Host: GitHub
  • Owner: ChenghaoMou
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 6.84 KB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created almost 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Awesome Data Deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

How to contribute

  1. Fork this repository;
  2. Install the dependencies pip install -r requirements.txt and pre-commit install;
  3. Add your data to the corresponding folder by copying the template.json file;
  4. Run pre-commit run --all-files to format the data;
  5. Commit your changes and open a pull request to this repository.

Textual Data

| Paper | Dataset | Final Data Size | Method | Hardware | License | Comments | |:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------|:-------------------|:--------------------------|:-------------------------------------|:-----------|:-----------| | NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | | | NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | | | NA | SlimPajama | 627B Tokens | MinHash + LSH | NA | Apache 2.0 | | | arxiv | Multiple Sources | 200B ~ 400B tokens | MinHash | 200GB w/ 64 cores | Apache 2.0 | [^1] | | Arxiv | CulturaX | 6.3T Tokens | MinHashLSH (per language) | 600 AWS c5.24xlarge (96/192GB * 600) | | [^1] |

Image Data

Multi-modal Data

[^1]: This uses a variant of the spark script from text-dedup 🎉️;

Owner

  • Name: Chenghao Mou
  • Login: ChenghaoMou
  • Kind: user
  • Location: Ireland

NLP/AI

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

requirements.txt pypi
  • pandas ==2.0.3
  • pre-commit ==2.21.0
  • tabulate ==0.9.0