https://github.com/chenghaomou/awesome-data-deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

An awesome list of data deduplication use cases, papers, tools, and methods.

Basic Info

Host: GitHub
Owner: ChenghaoMou
License: mit
Language: Python
Default Branch: main
Size: 6.84 KB

Statistics

Stars: 4
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Created almost 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Awesome Data Deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

How to contribute

Fork this repository;
Install the dependencies pip install -r requirements.txt and pre-commit install;
Add your data to the corresponding folder by copying the template.json file;
Run pre-commit run --all-files to format the data;
Commit your changes and open a pull request to this repository.

Textual Data

| Paper | Dataset | Final Data Size | Method | Hardware | License | Comments | |:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------|:-------------------|:--------------------------|:-------------------------------------|:-----------|:-----------| | NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | | | NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | | | NA | SlimPajama | 627B Tokens | MinHash + LSH | NA | Apache 2.0 | | | arxiv | Multiple Sources | 200B ~ 400B tokens | MinHash | 200GB w/ 64 cores | Apache 2.0 | [^1] | | Arxiv | CulturaX | 6.3T Tokens | MinHashLSH (per language) | 600 AWS c5.24xlarge (96/192GB * 600) | | [^1] |

Image Data

Multi-modal Data

[^1]: This uses a variant of the spark script from text-dedup 🎉️;

Owner

Name: Chenghao Mou
Login: ChenghaoMou
Kind: user
Location: Ireland

Website: https://sleeplessindebugging.blog/
Repositories: 32
Profile: https://github.com/ChenghaoMou

NLP/AI

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

requirements.txt pypi

pandas ==2.0.3
pre-commit ==2.21.0
tabulate ==0.9.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/chenghaomou/awesome-data-deduplication

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Awesome Data Deduplication

How to contribute

Textual Data

Image Data

Multi-modal Data

Owner

GitHub Events

Total

Last Year

Dependencies