text-minhash-priority

This repository implements the MinHash Near Deduplication with Priority algorithm.

https://github.com/zmzhang2000/text-minhash-priority

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

This repository implements the MinHash Near Deduplication with Priority algorithm.

Basic Info

Host: GitHub
Owner: zmzhang2000
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 5.52 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

Text-MinHash-Priority

Installation

Only tested with Python 3.10 so far.

bash pip install git+https://github.com/zmzhang2000/text-minhash-priority

Features

This repository implements the MinHash Near Deduplication with Priority algorithm. Specifically, this algorithm differs from the original MinHash Near Deduplication algorithm in that * it support to select the item to keep from duplicated items according to the priority

Usage

Process your dataset into huggingface dataset format. I will provide a sample dataset with jsonl format. json {"source": "ABC", "text": "What's your name?"} {"source": "ABC", "text": "My name is John."}
Add __keep__ or __minhash_priority__ key to your dataset. json {"source": "ABC", "text": "What's your name?", "__keep__": true} {"source": "ABC", "text": "My name is John.", "__keep__": false} or json {"source": "ABC", "text": "What's your name?", "__minhash_priority__": 20} {"source": "ABC", "text": "My name is John.", "__minhash_priority__": 1}
Run minhash deduplication script. Use --column to specify the column to deduplicate. bash python -m text_dedup.minhash \ --path "json" \ --data_files "dataset.jsonl" \ --split "train" \ --cache_dir "./cache" \ --output "dataset_deduplicated" \ --column "text" \ --ngram 4 \ --threshold 0.8 \ --batch_size 10000 \ --use_auth_token true
The results will be saved with the huggingface dataset format. You can load the results with datasets.load_from_disk().

Acknowledgements

This repository is developed based on ChenghaoMou/text-dedup. More details can be found in the original repository.

Owner

Name: Zongmeng Zhang
Login: zmzhang2000
Kind: user
Company: University of Science and Technology of China

Website: https://zmzhang2000.github.io/
Repositories: 1
Profile: https://github.com/zmzhang2000

I am currently pursuing the master's degree in University of Science and Technology of China (USTC).

Citation (CITATION.bib)

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Dependencies

.github/workflows/bot.yml actions

actions/stale v9.0.0 composite

.github/workflows/coverage.yaml actions

actions/checkout v2 composite
codacy/codacy-coverage-reporter-action v1 composite

.github/workflows/docs.yaml actions

actions/checkout v3 composite
actions/deploy-pages v1 composite
actions/setup-python v3 composite
actions/upload-artifact v3 composite

Dockerfile docker

python 3.10-slim build

poetry.lock pypi

123 dependencies

pyproject.toml pypi

coverage ^7.4.3 develop
insegel ^1.3.1 develop
pre-commit ^3.6.2 develop
pytest ^8.0.2 develop
ruff ^0.3.2 develop
scikit-learn ^1.4.1.post1 develop
tabulate ^0.9.0 develop
bitarray >=2.6.2
click ^8.1.7
click-option-group ^0.5.6
datasets >=2.17.0
fire ^0.6.0
ftfy >=6.1.1
matplotlib >=3.10.3
numpy >=1.26.4
pandarallel ^1.6.5
psutil >=5.9.8
pybloom-live >=4.0.0
pyspark >=3.3.1
python ^3.10
regex >=2023.5.5
rich ^13.7.1
scipy >=1.10.1
setuptools >=69.1.0
sphinxcontrib-bibtex >=2.5.0
tensorflow ^2.16.1
tqdm >=4.64.1
unisim ^0.0.1
urllib3 <=2.0
xxhash >=3.0.0
zstandard >=0.21.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science