https://github.com/chenghaomou/awesome-data-deduplication
An awesome list of data deduplication use cases, papers, tools, and methods.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary
Repository
An awesome list of data deduplication use cases, papers, tools, and methods.
Basic Info
- Host: GitHub
- Owner: ChenghaoMou
- License: mit
- Language: Python
- Default Branch: main
- Size: 6.84 KB
Statistics
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Awesome Data Deduplication
An awesome list of data deduplication use cases, papers, tools, and methods.
How to contribute
- Fork this repository;
- Install the dependencies
pip install -r requirements.txtandpre-commit install; - Add your data to the corresponding folder by copying the
template.jsonfile; - Run
pre-commit run --all-filesto format the data; - Commit your changes and open a pull request to this repository.
Textual Data
| Paper | Dataset | Final Data Size | Method | Hardware | License | Comments | |:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------|:-------------------|:--------------------------|:-------------------------------------|:-----------|:-----------| | NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | | | NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | | | NA | SlimPajama | 627B Tokens | MinHash + LSH | NA | Apache 2.0 | | | arxiv | Multiple Sources | 200B ~ 400B tokens | MinHash | 200GB w/ 64 cores | Apache 2.0 | [^1] | | Arxiv | CulturaX | 6.3T Tokens | MinHashLSH (per language) | 600 AWS c5.24xlarge (96/192GB * 600) | | [^1] |
Image Data
Multi-modal Data
[^1]: This uses a variant of the spark script from text-dedup 🎉️;
Owner
- Name: Chenghao Mou
- Login: ChenghaoMou
- Kind: user
- Location: Ireland
- Website: https://sleeplessindebugging.blog/
- Repositories: 32
- Profile: https://github.com/ChenghaoMou
NLP/AI
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- pandas ==2.0.3
- pre-commit ==2.21.0
- tabulate ==0.9.0