text-minhash-priority

This repository implements the MinHash Near Deduplication with Priority algorithm.

https://github.com/zmzhang2000/text-minhash-priority

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

This repository implements the MinHash Near Deduplication with Priority algorithm.

Basic Info
  • Host: GitHub
  • Owner: zmzhang2000
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 5.52 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Text-MinHash-Priority

中文 English

Installation

Only tested with Python 3.10 so far.

bash pip install git+https://github.com/zmzhang2000/text-minhash-priority

Features

This repository implements the MinHash Near Deduplication with Priority algorithm. Specifically, this algorithm differs from the original MinHash Near Deduplication algorithm in that * it support to select the item to keep from duplicated items according to the priority

Usage

  1. Process your dataset into huggingface dataset format. I will provide a sample dataset with jsonl format. json {"source": "ABC", "text": "What's your name?"} {"source": "ABC", "text": "My name is John."}

  2. Add __keep__ or __minhash_priority__ key to your dataset. json {"source": "ABC", "text": "What's your name?", "__keep__": true} {"source": "ABC", "text": "My name is John.", "__keep__": false} or json {"source": "ABC", "text": "What's your name?", "__minhash_priority__": 20} {"source": "ABC", "text": "My name is John.", "__minhash_priority__": 1}

  3. Run minhash deduplication script. Use --column to specify the column to deduplicate. bash python -m text_dedup.minhash \ --path "json" \ --data_files "dataset.jsonl" \ --split "train" \ --cache_dir "./cache" \ --output "dataset_deduplicated" \ --column "text" \ --ngram 4 \ --threshold 0.8 \ --batch_size 10000 \ --use_auth_token true

  4. The results will be saved with the huggingface dataset format. You can load the results with datasets.load_from_disk().

Acknowledgements

This repository is developed based on ChenghaoMou/text-dedup. More details can be found in the original repository.

Owner

  • Name: Zongmeng Zhang
  • Login: zmzhang2000
  • Kind: user
  • Company: University of Science and Technology of China

I am currently pursuing the master's degree in University of Science and Technology of China (USTC).

Citation (CITATION.bib)

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Dependencies

.github/workflows/bot.yml actions
  • actions/stale v9.0.0 composite
.github/workflows/coverage.yaml actions
  • actions/checkout v2 composite
  • codacy/codacy-coverage-reporter-action v1 composite
.github/workflows/docs.yaml actions
  • actions/checkout v3 composite
  • actions/deploy-pages v1 composite
  • actions/setup-python v3 composite
  • actions/upload-artifact v3 composite
Dockerfile docker
  • python 3.10-slim build
poetry.lock pypi
  • 123 dependencies
pyproject.toml pypi
  • coverage ^7.4.3 develop
  • insegel ^1.3.1 develop
  • pre-commit ^3.6.2 develop
  • pytest ^8.0.2 develop
  • ruff ^0.3.2 develop
  • scikit-learn ^1.4.1.post1 develop
  • tabulate ^0.9.0 develop
  • bitarray >=2.6.2
  • click ^8.1.7
  • click-option-group ^0.5.6
  • datasets >=2.17.0
  • fire ^0.6.0
  • ftfy >=6.1.1
  • matplotlib >=3.10.3
  • numpy >=1.26.4
  • pandarallel ^1.6.5
  • psutil >=5.9.8
  • pybloom-live >=4.0.0
  • pyspark >=3.3.1
  • python ^3.10
  • regex >=2023.5.5
  • rich ^13.7.1
  • scipy >=1.10.1
  • setuptools >=69.1.0
  • sphinxcontrib-bibtex >=2.5.0
  • tensorflow ^2.16.1
  • tqdm >=4.64.1
  • unisim ^0.0.1
  • urllib3 <=2.0
  • xxhash >=3.0.0
  • zstandard >=0.21.0