text-minhash-priority
This repository implements the MinHash Near Deduplication with Priority algorithm.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary
Repository
This repository implements the MinHash Near Deduplication with Priority algorithm.
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Text-MinHash-Priority
Installation
Only tested with Python 3.10 so far.
bash
pip install git+https://github.com/zmzhang2000/text-minhash-priority
Features
This repository implements the MinHash Near Deduplication with Priority algorithm. Specifically, this algorithm differs from the original MinHash Near Deduplication algorithm in that * it support to select the item to keep from duplicated items according to the priority
Usage
Process your dataset into
huggingface datasetformat. I will provide a sample dataset withjsonlformat.json {"source": "ABC", "text": "What's your name?"} {"source": "ABC", "text": "My name is John."}Add
__keep__or__minhash_priority__key to your dataset.json {"source": "ABC", "text": "What's your name?", "__keep__": true} {"source": "ABC", "text": "My name is John.", "__keep__": false}orjson {"source": "ABC", "text": "What's your name?", "__minhash_priority__": 20} {"source": "ABC", "text": "My name is John.", "__minhash_priority__": 1}Run minhash deduplication script. Use
--columnto specify the column to deduplicate.bash python -m text_dedup.minhash \ --path "json" \ --data_files "dataset.jsonl" \ --split "train" \ --cache_dir "./cache" \ --output "dataset_deduplicated" \ --column "text" \ --ngram 4 \ --threshold 0.8 \ --batch_size 10000 \ --use_auth_token trueThe results will be saved with the
huggingface datasetformat. You can load the results withdatasets.load_from_disk().
Acknowledgements
This repository is developed based on ChenghaoMou/text-dedup. More details can be found in the original repository.
Owner
- Name: Zongmeng Zhang
- Login: zmzhang2000
- Kind: user
- Company: University of Science and Technology of China
- Website: https://zmzhang2000.github.io/
- Repositories: 1
- Profile: https://github.com/zmzhang2000
I am currently pursuing the master's degree in University of Science and Technology of China (USTC).
Citation (CITATION.bib)
@software{chenghao_mou_2023_8364980,
author = {Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu},
title = {ChenghaoMou/text-dedup: Reference Snapshot},
month = sep,
year = 2023,
publisher = {Zenodo},
version = {2023.09.20},
doi = {10.5281/zenodo.8364980},
url = {https://doi.org/10.5281/zenodo.8364980}
}
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Dependencies
- actions/stale v9.0.0 composite
- actions/checkout v2 composite
- codacy/codacy-coverage-reporter-action v1 composite
- actions/checkout v3 composite
- actions/deploy-pages v1 composite
- actions/setup-python v3 composite
- actions/upload-artifact v3 composite
- python 3.10-slim build
- 123 dependencies
- coverage ^7.4.3 develop
- insegel ^1.3.1 develop
- pre-commit ^3.6.2 develop
- pytest ^8.0.2 develop
- ruff ^0.3.2 develop
- scikit-learn ^1.4.1.post1 develop
- tabulate ^0.9.0 develop
- bitarray >=2.6.2
- click ^8.1.7
- click-option-group ^0.5.6
- datasets >=2.17.0
- fire ^0.6.0
- ftfy >=6.1.1
- matplotlib >=3.10.3
- numpy >=1.26.4
- pandarallel ^1.6.5
- psutil >=5.9.8
- pybloom-live >=4.0.0
- pyspark >=3.3.1
- python ^3.10
- regex >=2023.5.5
- rich ^13.7.1
- scipy >=1.10.1
- setuptools >=69.1.0
- sphinxcontrib-bibtex >=2.5.0
- tensorflow ^2.16.1
- tqdm >=4.64.1
- unisim ^0.0.1
- urllib3 <=2.0
- xxhash >=3.0.0
- zstandard >=0.21.0