Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: wheynelau
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 5.93 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md

MAKING DEDUP GO BRRRRRR 🚀🚀🚀 (... somewhat)
"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." — Michael A. Jackson
"If it ain't broke, don't fix it." — Bert Lance
Description
The original algorithrm is from here and I just ported it to Rust.
It uses a minhash LSH algorithm to find similar documents.
Learning points
I intend to write a gist for the additional issues for learning purposes.
- flamegraph: Learnt quite a bit about flamegraph when my initial runs were slower than python. Oof
- Concepts I never knew existed in python: Integer overflow, bytes encoding and decoding, memory reallocation, etc.
- Writing tests for rust code
How to run
I have not tested on another fresh environment so expect "It works on my machine" issues.
Additionally, I have not setup wheels so only building from source is possible.
```bash
from the original repo
this assumes you have python, environment management is up to you
curl https://sh.rustup.rs -sSf | sh # installing rust pip install . python tests/benchmark_core.py ```
Docker
bash
docker build -t text-dedup .
docker run text-dedup "python tests/benchmark_core.py"
docker run text-dedup "python tests/benchmark_news.py"
Changes made to original code
- Removed pyspark from requirements
- Removed simhash and pyspark runs from benchmark core
- Added a minhashrust and pure rust to benchmark core and news.
- Added a new dataset for speed and memory testing
Results
Benchmark results are in BENCHMARKS
Issues
- Some minor differences in the scores, you can review in the benchmarks.
- Found out that the major difference previously was due to the u32 type
For testing purposes, the parellelized version is kept. The original code is at this point
TODO
- [x] Write setup.py -> setup pyproject.toml for
pip install . - [ ] Remove hard codes
- [x] ~~End goal: Make it work with pyspark~~ Check out the experimental-pyspark branch
- [ ] Check for potential improvements
- [ ] Write idiomatic rust code
- [ ] Allow generics for u32,u64,u128
- [ ] Move to numpy crate
- [ ] Implement test for u64
- [ ] Try to get a closer match on the benchmark scores
Owner
- Name: Wayne Lau
- Login: wheynelau
- Kind: user
- Location: Singapore
- Repositories: 2
- Profile: https://github.com/wheynelau
Aspiring machine learning engineer to be
Citation (CITATION.bib)
@software{chenghao_mou_2023_8364980,
author = {Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu},
title = {ChenghaoMou/text-dedup: Reference Snapshot},
month = sep,
year = 2023,
publisher = {Zenodo},
version = {2023.09.20},
doi = {10.5281/zenodo.8364980},
url = {https://doi.org/10.5281/zenodo.8364980}
}
GitHub Events
Total
- Push event: 10
- Create event: 2
Last Year
- Push event: 10
- Create event: 2
Packages
- Total packages: 1
-
Total downloads:
- pypi 254 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 1
pypi.org: dedup-rs
A Rust library for deduplication of documents
- Documentation: https://dedup-rs.readthedocs.io/
- License: apache-2.0
-
Latest release: 0.1.1
published about 2 years ago
Rankings
Maintainers (1)
Dependencies
- PyO3/maturin-action v1 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- python 3.10-slim build
- 101 dependencies
- bitarray >=2.6.2
- click ~=8.1.7
- click-option-group ~=0.5.6
- datasets >=2.17.0
- fire ~=0.6.0
- ftfy >=6.1.1
- numpy >=1.26.4
- psutil >=5.9.8
- pybloom-live >=4.0.0
- regex >=2023.5.5
- rich ~=13.7.1
- scipy >=1.10.1
- setuptools >=69.1.0
- sphinxcontrib-bibtex >=2.5.0
- tqdm >=4.64.1
- unisim ~=0.0.1
- urllib3 <=2.0
- xxhash >=3.0.0
- zstandard >=0.21.0