Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Keywords
Repository
All-in-one text de-duplication
Basic Info
Statistics
- Stars: 710
- Watchers: 3
- Forks: 74
- Open Issues: 2
- Releases: 1
Topics
Metadata Files
README.md

Installation
Only tested with Python 3.10 so far.
bash
pip install text-dedup
or
bash
pip install git+https://github.com/ChenghaoMou/text-dedup
Documentation
Features
This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:
- RETSim/UniSim, an embedding-based near deduplication (WIP)
- MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
- 64 or 128 bit SimHash
- SuffixArray Substring
- Bloom Filter
- Exact Hash (document-level, line-level/ccnet)
I also have big plans for the future:
- [ ] Memory benchmark for streaming processing
- [ ] Inter-dataset deduplication
- [ ] Rewrite suffix array in Python
- [ ] A collections of other deduplication methods: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching
However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.
Acknowledgements
This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!
- Datasketch (MIT)
- simhash-py and simhash-cpp (MIT)
- Deduplicating Training Data Makes Language Models Better (Apache 2.0)
- Gaoya (MIT)
Quick Examples
Native PySpark
_MODIFY `text_dedup/minhash_spark.py` FOR YOUR OWN PROJECT AND DATASET FIRST!_ Assuming you have a downloaded dataset (in parquet files) under "./temp-data", you can process with file with your local compute by: ```bash export PYSPARK_PYTHON="path to your python with scipy, xxhash, and numpy installed" spark-submit --executor-memory 16g \ --driver-memory 20g \ --executor-cores 3 \ --num-executors 2 \ --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \ --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \ text_dedup/minhash_spark.py\ --input "./temp-data" \ --output "./temp-output" \ --column "text" \ --threshold 0.7 ``` ``` DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------ DEBUG __main__ - Using B=25, R=10 DEBUG __main__ - Loaded documents: 88803 DEBUG __main__ - args.input='./temp-data' DEBUG __main__ - args.output='./temp-output' DEBUG __main__ - args.threshold=0.7 DEBUG __main__ - args.ngram_size=5 DEBUG __main__ - args.min_length=5 DEBUG __main__ - args.num_perm=250 DEBUG __main__ - args.column='text' DEBUG __main__ - id : bigint DEBUG __main__ - text : string DEBUG __main__ - meta : structUniSim (WIP)
Based on Google's RETSim model([Github](https://github.com/google/unisim), [Arxiv](https://arxiv.org/abs/2311.17264)), it is an embedding based on near-deduplication method. For a large dataset, it would require GPU(s) for fast inference. ```bash python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question ``` Output: ``` INFO Load Dataset : 5.56s INFO Index Dataset : 8.13s INFO Clustering : 8.72s INFO Filtering : 0.35s INFO Saving : 0.01s INFO Cleaning : 0.00s INFO Total : 22.77s INFO Before : 817 INFO After : 788 ```Suffix Array Substring Exact Deduplication
```bash # input python -m text_dedup.suffix_array \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/suffix_array/oscar_gl_dedup" \ --column "text" \ --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets" \ --use_auth_token true # output INFO Loading : 2.75 seconds INFO Preprocessing : 4.78 seconds INFO SuffixArray : 98.29 seconds INFO SelfSimilar : 4.24 seconds INFO Restore : 0.25 seconds INFO Deduplicate : 6.23 seconds INFO Saving : 8.91 seconds INFO Total : 125.45 seconds INFO Before : 180332342 bytes (88803) INFO After : 97646271 bytes (40404) ```MinHash Near Deduplication
```bash # input python -m text_dedup.minhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/minhash/oscar_gl_dedup" \ --column "text" \ --batch_size 10000 \ --use_auth_token true # output INFO Loading : 2.62 seconds INFO MinHashing : 0.08 seconds INFO Clustering : 2.20 seconds INFO Filtering : 0.53 seconds INFO Saving : 9.86 seconds INFO Total : 15.29 seconds INFO Data Number (before) : 88803 INFO Data Number (after) : 44124 (49.69%) INFO Duplicate Number : 44679 (50.31%) INFO 🤗 Happy Deduplicating 🤗 ```SimHash Near Deduplication
```bash # input python -m text_dedup.simhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/simhash/oscar_gl_dedup" \ --column "text" \ --batch_size 10000 \ --use_auth_token true # output INFO Loading : 2.60 seconds INFO SimHashing : 0.04 seconds INFO Indexing : 28.88 seconds INFO Filtering : 0.88 seconds INFO Saving : 10.41 seconds INFO Total : 42.80 seconds INFO Data Number (before) : 88803 INFO Data Number (after) : 46163 (51.98%) INFO Duplicate Number : 42640 (48.02%) INFO 🤗 Happy Deduplicating 🤗 ```Exact Hash Exact Deduplication
```bash # input python -m text_dedup.exact_hash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/exact_hash/oscar_gl_dedup" \ --column "text" \ --batch_size 1000 \ --use_auth_token true # output INFO Loading : 2.95s INFO Processing : 3.79s INFO Filtering : 0.10s INFO Saving : 2.89s INFO Total : 9.72s INFO Before : 88803 INFO After : 47049 ```Bloom Filter Exact Deduplication
```bash # input python -m text_dedup.bloom_filter \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cache_dir "./cache" \ --output "output/bloom_filter/oscar_gl_dedup" \ --error_rate 1e-5 \ --column "text" \ --use_auth_token true --batch_size 1000 # output INFO Loading : 2.72s INFO Processing : 4.84s INFO Filtering : 0.10s INFO Saving : 2.88s INFO Total : 10.54s INFO Before : 88803 INFO After : 47045 ```Benchmarks
[!note] Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
pinecone/core-2020-05-10-deduplication
See `tests/benchmark_core.py` for reproduction. | Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time | | :------------------------------ | ---------------------: | ------------------: | -------------------------: | ----------------------: | -------------: | --------: | :------- | | UniSim | 0.9307 | 0.8924 | 0.9055 | 0.9394 | 0.9181 | 0.9054 | 1305.79s | | MinHash Spark | 0.957 | 0.9445 | 0.9471 | 0.959 | 0.952 | 0.9202 | 691.77s | | MinHash | 0.9594 | 0.9445 | 0.9474 | 0.9616 | **0.9534** | 0.924 | 18.88s | | SimHash | 0.9042 | 0.721 | 0.792 | 0.9329 | 0.8481 | 0.8321 | 644.36s | | Exact Title | 0.8302 | 0.5521 | 0.7098 | 0.9065 | 0.77 | 0.7456 | - | | Exact Title Matching [^1] | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - | | Simhash Matching [^1] | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - | | Document Vector Similarity [^1] | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - | | Hybrid Method [^1] | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - | | LaBSE[^2] | 0.937 | 0.923 | 0.930 | 0.943 | 0.933 | 0.919 | - | | Multilingual USE[^2] | 0.917 | 0.907 | 0.918 | 0.927 | 0.917 | 0.909 | - | | Multilingual E5-Base[^2] | 0.931 | 0.908 | 0.919 | 0.939 | 0.924 | 0.920 | - | | MinHash + LSH[^2] | 0.929 | 0.902 | 0.915 | 0.938 | 0.921 | 0.918 | - | | RETSim Partial-Dup[^2] | 0.945 | 0.941 | 0.945 | 0.949 | 0.945 | **0.928** | - | | RETSim Near-Dup[^2] | 0.928 | 0.937 | 0.942 | 0.934 | 0.935 | **0.926** | - |NEWS-COPY
See `tests/benchmark_news.py` for reproduction. Adjusted Rand Index (ARI) on NEWS-COPY dataset: | Model/Algorithm | ARI | | :----------------------- | :-------- | | SimHash | 0.612 | | MinHash (Spark) | 0.740 | | MinHash | 0.742 | | RETSim Near-Dup + ANN\* | _0.051_ | | n-gram [^3] | 0.440 | | SimHash[^2] | 0.695 | | MinHash[^3] | 0.737 | | MinHash[^2] | 0.783 | | Multilingual USE[^2] | 0.730 | | Multilingual E5-Base[^2] | 0.742 | | S-BERT[^3] | 0.700 | | RETSim Partial-Dup[^2] | 0.831 | | RETSim Near-Dup[^2] | 0.704 | | Re-ranking [^3] | **0.937** | | Bi-encoder [^3] | 0.915 | \*: I can't seem to reproduce the results from the paper. [^1]: [Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings](https://aclanthology.org/2020.lrec-1.113) [^2]: [RETSim: Resilient and Efficient Text Similarity](https://arxiv.org/abs/2311.17264) [^3]: [Noise-Robust De-Duplication at Scale](https://www.semanticscholar.org/paper/Noise-Robust-De-Duplication-at-Scale-Silcock-D'Amico-Wong/7ca41cc5fc364b713aba5b573ae4ada801fd788a)License
Citations
Generally, you can cite this repository as:
bibtex
@software{chenghao_mou_2023_8364980,
author = {Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu},
title = {ChenghaoMou/text-dedup: Reference Snapshot},
month = sep,
year = 2023,
publisher = {Zenodo},
version = {2023.09.20},
doi = {10.5281/zenodo.8364980},
url = {https://doi.org/10.5281/zenodo.8364980}
}
The spark version was born from BigCode (Apache 2.0) and BigScience (Apache 2.0), and you can cite the original paper if you want:
bibtex
@article{
kocetkov2023the,
title={The Stack: 3 {TB} of permissively licensed source code},
author={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=pxpbTdUEpD},
note={}
}
Citation (CITATION.bib)
@software{chenghao_mou_2023_8364980,
author = {Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu},
title = {ChenghaoMou/text-dedup: Reference Snapshot},
month = sep,
year = 2023,
publisher = {Zenodo},
version = {2023.09.20},
doi = {10.5281/zenodo.8364980},
url = {https://doi.org/10.5281/zenodo.8364980}
}
GitHub Events
Total
- Issues event: 45
- Watch event: 104
- Issue comment event: 17
- Push event: 5
- Pull request event: 4
- Fork event: 5
- Create event: 3
Last Year
- Issues event: 45
- Watch event: 104
- Issue comment event: 17
- Push event: 5
- Pull request event: 4
- Fork event: 5
- Create event: 3
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 16
- Total pull requests: 4
- Average time to close issues: 9 days
- Average time to close pull requests: about 1 month
- Total issue authors: 4
- Total pull request authors: 2
- Average comments per issue: 0.56
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 12
- Bot pull requests: 0
Past Year
- Issues: 16
- Pull requests: 4
- Average time to close issues: 9 days
- Average time to close pull requests: about 1 month
- Issue authors: 4
- Pull request authors: 2
- Average comments per issue: 0.56
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 12
- Bot pull requests: 0
Top Authors
Issue Authors
- linear[bot] (17)
- Dodero10 (3)
- mmpouya (2)
- Jason3900 (2)
- bowspider-man (2)
- alielfilali01 (1)
- hancheolcho (1)
- MiladMolazadeh (1)
- maoxiangyi (1)
- 311dada (1)
- cjmp1 (1)
- wuodar (1)
- simplew2011 (1)
- XChen-Zero (1)
- siebeniris (1)
Pull Request Authors
- ChenghaoMou (4)
- louisowen6 (2)
- qxuanson (1)
- chris-ha458 (1)
- mohamedlekarim (1)
- dependabot[bot] (1)
- hahmad2008 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 449 last-month
- Total dependent packages: 0
- Total dependent repositories: 2
- Total versions: 24
- Total maintainers: 1
pypi.org: text-dedup
- Documentation: https://text-dedup.readthedocs.io/
- License: Apache 2.0
-
Latest release: 0.4.0
published over 1 year ago