Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: wheynelau
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 5.93 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

MAKING DEDUP GO BRRRRRR 🚀🚀🚀 (... somewhat)

"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." — Michael A. Jackson

"If it ain't broke, don't fix it." — Bert Lance

Description

The original algorithrm is from here and I just ported it to Rust.
It uses a minhash LSH algorithm to find similar documents.

Learning points

I intend to write a gist for the additional issues for learning purposes.

  • flamegraph: Learnt quite a bit about flamegraph when my initial runs were slower than python. Oof
  • Concepts I never knew existed in python: Integer overflow, bytes encoding and decoding, memory reallocation, etc.
  • Writing tests for rust code

How to run

I have not tested on another fresh environment so expect "It works on my machine" issues.
Additionally, I have not setup wheels so only building from source is possible.

```bash

from the original repo

this assumes you have python, environment management is up to you

curl https://sh.rustup.rs -sSf | sh # installing rust pip install . python tests/benchmark_core.py ```

Docker

bash docker build -t text-dedup . docker run text-dedup "python tests/benchmark_core.py" docker run text-dedup "python tests/benchmark_news.py"

Changes made to original code

  • Removed pyspark from requirements
  • Removed simhash and pyspark runs from benchmark core
  • Added a minhashrust and pure rust to benchmark core and news.
  • Added a new dataset for speed and memory testing

Results

Benchmark results are in BENCHMARKS

Issues

  1. Some minor differences in the scores, you can review in the benchmarks.
  2. Found out that the major difference previously was due to the u32 type

For testing purposes, the parellelized version is kept. The original code is at this point

TODO

  • [x] Write setup.py -> setup pyproject.toml for pip install .
  • [ ] Remove hard codes
  • [x] ~~End goal: Make it work with pyspark~~ Check out the experimental-pyspark branch
  • [ ] Check for potential improvements
  • [ ] Write idiomatic rust code
  • [ ] Allow generics for u32,u64,u128
  • [ ] Move to numpy crate
  • [ ] Implement test for u64

- [ ] Try to get a closer match on the benchmark scores

Owner

  • Name: Wayne Lau
  • Login: wheynelau
  • Kind: user
  • Location: Singapore

Aspiring machine learning engineer to be

Citation (CITATION.bib)

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

GitHub Events

Total
  • Push event: 10
  • Create event: 2
Last Year
  • Push event: 10
  • Create event: 2

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 254 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: dedup-rs

A Rust library for deduplication of documents

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 254 Last month
Rankings
Dependent packages count: 10.9%
Average: 36.0%
Dependent repos count: 61.1%
Maintainers (1)
Last synced: 11 months ago

Dependencies

.github/workflows/CI.yml actions
  • PyO3/maturin-action v1 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
Cargo.toml cargo
Dockerfile docker
  • python 3.10-slim build
poetry.lock pypi
  • 101 dependencies
pyproject.toml pypi
  • bitarray >=2.6.2
  • click ~=8.1.7
  • click-option-group ~=0.5.6
  • datasets >=2.17.0
  • fire ~=0.6.0
  • ftfy >=6.1.1
  • numpy >=1.26.4
  • psutil >=5.9.8
  • pybloom-live >=4.0.0
  • regex >=2023.5.5
  • rich ~=13.7.1
  • scipy >=1.10.1
  • setuptools >=69.1.0
  • sphinxcontrib-bibtex >=2.5.0
  • tqdm >=4.64.1
  • unisim ~=0.0.1
  • urllib3 <=2.0
  • xxhash >=3.0.0
  • zstandard >=0.21.0