https://github.com/ai4bharat/text-dedup

All-in-one text de-duplication

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

All-in-one text de-duplication

Basic Info

Host: GitHub
Owner: AI4Bharat
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 4.46 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of ChenghaoMou/text-dedup

Created almost 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
64 or 128 bit SimHash
SuffixArray Substring
Bloom Filter
Exact Hash (document-level, line-level/ccnet)

I also have big plans for the future:

[ ] Memory benchmark for streaming processing
[ ] Inter-dataset deduplication
[ ] Rewrite suffix array in Python
[ ] A collections of other deduplication methods: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching

However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Datasketch (MIT)
simhash-py and simhash-cpp (MIT)
Deduplicating Training Data Makes Language Models Better (Apache 2.0)
Gaoya (MIT)

Quick Examples

PySpark with DataProc

Not a lot of people have access to enough compute resources or the need to deduplicate TB-scale datasets, but if you do, this is a good example of how to use it with GCP DataProc.

MODIFY text_dedup/minhash_spark.py FOR YOUR OWN PROJECT AND DATASET FIRST!

```bash export CLUSTERNAME=chenghao-temp export PROJECTID=xx export REGION=us-central1 export ZONE=us-central1-a export INPUTGCSPATH="gs://chenghao-temp-exp/data/ada" export OUTPUTGCSPATH="gs://chenghao-temp-exp/output/ada"

gcloud dataproc clusters create $CLUSTERNAME \ --enable-component-gateway \ --region $REGION \ --zone $ZONE \ --master-machine-type c2d-standard-16 \ --master-boot-disk-size 500 \ --num-workers 10 \ --worker-machine-type c2d-standard-16 \ --worker-boot-disk-size 500 \ --image-version 2.0-debian10 \ --project $PROJECTID

gcloud dataproc jobs submit pyspark --cluster ${CLUSTERNAME}\ --region $REGION \ --jars gs://spark-lib/bigquery/spark-3.3-bigquery-0.32.2.jar \ --driver-log-levels root=FATAL,main=DEBUG \ --properties="spark.executor.memory"="50g","spark.driver.memory"="8g","spark.executor.cores"="14" \ minhashspark.py -- --input $INPUTGCSPATH --output $OUTPUTGCSPATH ```

For reference, the script finished deduplicating 42 million rows in less than 40 minutes with above settings (160 cores, 640GB memory in total), while the python version would take around 10 hours with a 80-core machine with 1.8TB memory.

In the following part, we are going to deduplicate one dataset: gl subset of oscar-corpus/OSCAR-2201.

Suffix Array Substring Exact Deduplication

```bash

input

python -m textdedup.suffixarray \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cachedir "./cache" \ --output "output/suffixarray/oscargldedup" \ --column "text" \ --googlerepopath "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"

output

INFO Loading : 2.75 seconds INFO Preprocessing : 4.78 seconds INFO SuffixArray : 98.29 seconds INFO SelfSimilar : 4.24 seconds INFO Restore : 0.25 seconds INFO Deduplicate : 6.23 seconds INFO Saving : 8.91 seconds INFO Total : 125.45 seconds INFO Before : 180332342 bytes (88803) INFO After : 97646271 bytes (40404) ```

MinHash Near Deduplication

```bash

input

python -m textdedup.minhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cachedir "./cache" \ --output "output/minhash/oscargldedup" \ --column "text" \ --batch_size 10000

output

INFO Loading : 2.62 seconds INFO MinHashing : 0.08 seconds INFO Clustering : 2.20 seconds INFO Filtering : 0.53 seconds INFO Saving : 9.86 seconds INFO Total : 15.29 seconds INFO Data Number (before) : 88803 INFO Data Number (after) : 44124 (49.69%) INFO Duplicate Number : 44679 (50.31%) INFO 🤗 Happy Deduplicating 🤗 ```

SimHash Near Deduplication

```bash

input

python -m textdedup.simhash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cachedir "./cache" \ --output "output/simhash/oscargldedup" \ --column "text" \ --batch_size 10000

output

INFO Loading : 2.60 seconds INFO SimHashing : 0.04 seconds INFO Indexing : 28.88 seconds INFO Filtering : 0.88 seconds INFO Saving : 10.41 seconds INFO Total : 42.80 seconds INFO Data Number (before) : 88803 INFO Data Number (after) : 46163 (51.98%) INFO Duplicate Number : 42640 (48.02%) INFO 🤗 Happy Deduplicating 🤗 ```

Exact Hash Exact Deduplication

```bash

input

python -m textdedup.exacthash \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cachedir "./cache" \ --output "output/exacthash/oscargldedup" \ --column "text" \ --batch_size 1000

output

INFO Loading : 2.95s INFO Processing : 3.79s INFO Filtering : 0.10s INFO Saving : 2.89s INFO Total : 9.72s INFO Before : 88803 INFO After : 47049 ```

Bloom Filter Exact Deduplication

```bash

input

python -m textdedup.bloomfilter \ --path "oscar-corpus/OSCAR-2201" \ --name "gl" \ --split "train" \ --cachedir "./cache" \ --output "output/bloomfilter/oscargldedup" \ --errorrate 1e-5 \ --column "text" \ --batchsize 1000

output

INFO Loading : 2.72s INFO Processing : 4.84s INFO Filtering : 0.10s INFO Saving : 2.88s INFO Total : 10.54s INFO Before : 88803 INFO After : 47045 ```

Benchmarks

A benchmark of different methods here can be found in benchmarks/wiki40.ipynb. A notebook in evaluating MinHash on pinecone/core-2020-05-10-deduplication can be found in benchmarks/pinecone.ipynb.

For quick reference, here are the results:

| Method | Precision | Recall | F1 | Time | | ------------------------------------------------------------------------------- | ---------------- | ---------------- | ---------------- | ---- | | MinHash | 0.9464 | 0.9446 | 0.9455 | 24s | | SimHash* | 0.9011 | 0.6959 | 0.7853 | 210s | | SimHash(Gyawali et al., LREC 2020) | 0.697 | 0.247 | 0.3647 | - | | Exact Title (my implementation) | 0.8302 | 0.5521 | 0.6632 | - | | Exact Title(Gyawali et al., LREC 2020) | 0.830 | 0.50 | 0.624 | - |

*Best SimHash result from benchmarks/hyperparameter.ipynb.

License

Apache 2.0

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Last Year

Dependencies

.github/workflows/coverage.yaml actions

actions/checkout v2 composite
codacy/codacy-coverage-reporter-action v1 composite

.github/workflows/docs.yaml actions

actions/checkout v3 composite
actions/deploy-pages v1 composite
actions/setup-python v3 composite
actions/upload-artifact v3 composite

poetry.lock pypi

aiohttp 3.8.5
aiosignal 1.3.1
alabaster 0.7.13
async-timeout 4.0.3
attrs 23.1.0
babel 2.12.1
beautifulsoup4 4.12.2
bitarray 2.8.1
certifi 2023.7.22
cffi 1.15.1
cfgv 3.4.0
charset-normalizer 3.2.0
colorama 0.4.6
commonmark 0.9.1
contourpy 1.1.0
coverage 6.5.0
cycler 0.11.0
datasets 2.14.4
dill 0.3.7
distlib 0.3.7
docutils 0.17.1
exceptiongroup 1.1.3
filelock 3.12.2
fonttools 4.42.1
frozenlist 1.4.0
fsspec 2023.6.0
ftfy 6.1.1
furo 2022.12.7
huggingface-hub 0.16.4
identify 2.5.27
idna 3.4
imagesize 1.4.1
iniconfig 2.0.0
insegel 1.3.1
isort 5.12.0
jinja2 3.1.2
kiwisolver 1.4.5
latexcodec 2.0.1
livereload 2.6.3
markupsafe 2.1.3
matplotlib 3.7.2
multidict 6.0.4
multiprocess 0.70.15
nodeenv 1.8.0
numpy 1.25.2
packaging 23.1
pandas 2.0.3
pillow 10.0.0
platformdirs 3.10.0
pluggy 1.2.0
pre-commit 2.21.0
py4j 0.10.9.7
pyarrow 13.0.0
pybloom-live 4.0.0
pybtex 0.24.0
pybtex-docutils 1.0.3
pycparser 2.21
pygments 2.16.1
pyparsing 3.0.9
pyspark 3.4.1
pytest 7.4.0
python-dateutil 2.8.2
pytz 2023.3
pyyaml 6.0.1
regex 2023.8.8
requests 2.31.0
rich 12.6.0
ruff 0.0.265
scipy 1.10.1
seaborn 0.12.2
setuptools 68.1.2
six 1.16.0
snowballstemmer 2.2.0
soupsieve 2.4.1
sphinx 5.3.0
sphinx-autobuild 2021.3.14
sphinx-basic-ng 1.0.0b2
sphinxcontrib-applehelp 1.0.7
sphinxcontrib-bibtex 2.6.0
sphinxcontrib-devhelp 1.0.5
sphinxcontrib-htmlhelp 2.0.4
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.6
sphinxcontrib-serializinghtml 1.1.9
tomli 2.0.1
tornado 6.3.3
tqdm 4.66.1
typing-extensions 4.7.1
tzdata 2023.3
urllib3 1.26.16
virtualenv 20.24.3
wcwidth 0.2.6
xxhash 3.3.0
yarl 1.9.2
zstandard 0.21.0

pyproject.toml pypi

bitarray ^2.6.2
datasets ^2.4.0
ftfy ^6.1.1
numpy ^1.23.2
pybloom-live ^4.0.0
pyspark ^3.3.1
python ^3.10,<3.12
regex ^2023.5.5
rich ^12.5.1
scipy 1.10.1
sphinxcontrib-bibtex ^2.5.0
tqdm ^4.64.1
urllib3 <=2.0
xxhash ^3.0.0
zstandard ^0.21.0

https://github.com/ai4bharat/text-dedup

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Documentation

Features

Acknowledgements

Quick Examples

PySpark with DataProc

Suffix Array Substring Exact Deduplication

input

output

MinHash Near Deduplication

input

output

SimHash Near Deduplication

input

output

Exact Hash Exact Deduplication

input

output

Bloom Filter Exact Deduplication

input

output

Benchmarks

License

Owner

GitHub Events

Total

Last Year

Dependencies