dedup-rs

https://github.com/wheynelau/text-dedup-rs

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: wheynelau
License: apache-2.0
Language: Python
Default Branch: main
Size: 5.93 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

MAKING DEDUP GO BRRRRRR 🚀🚀🚀 (... somewhat)

"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." — Michael A. Jackson

"If it ain't broke, don't fix it." — Bert Lance

MAKING DEDUP GO BRRRRRR 🚀🚀🚀 (... somewhat)

Description

The original algorithrm is from here and I just ported it to Rust.
It uses a minhash LSH algorithm to find similar documents.

Learning points

I intend to write a gist for the additional issues for learning purposes.

flamegraph: Learnt quite a bit about flamegraph when my initial runs were slower than python. Oof
Concepts I never knew existed in python: Integer overflow, bytes encoding and decoding, memory reallocation, etc.
Writing tests for rust code

How to run

I have not tested on another fresh environment so expect "It works on my machine" issues.
Additionally, I have not setup wheels so only building from source is possible.

```bash

from the original repo

this assumes you have python, environment management is up to you

curl https://sh.rustup.rs -sSf | sh # installing rust pip install . python tests/benchmark_core.py ```

Docker

bash docker build -t text-dedup . docker run text-dedup "python tests/benchmark_core.py" docker run text-dedup "python tests/benchmark_news.py"

Changes made to original code

Removed pyspark from requirements
Removed simhash and pyspark runs from benchmark core
Added a minhashrust and pure rust to benchmark core and news.
Added a new dataset for speed and memory testing

Results

Benchmark results are in BENCHMARKS

Issues

Some minor differences in the scores, you can review in the benchmarks.
Found out that the major difference previously was due to the u32 type

For testing purposes, the parellelized version is kept. The original code is at this point

TODO

[x] Write setup.py -> setup pyproject.toml for pip install .
[ ] Remove hard codes
[x] ~~End goal: Make it work with pyspark~~ Check out the experimental-pyspark branch
[ ] Check for potential improvements
[ ] Write idiomatic rust code
[ ] Allow generics for u32,u64,u128
[ ] Move to numpy crate
[ ] Implement test for u64

- [ ] Try to get a closer match on the benchmark scores

Owner

Name: Wayne Lau
Login: wheynelau
Kind: user
Location: Singapore

Repositories: 2
Profile: https://github.com/wheynelau

Aspiring machine learning engineer to be

Citation (CITATION.bib)

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

GitHub Events

Total

Push event: 10
Create event: 2

Last Year

Push event: 10
Create event: 2

Packages

Total packages: 1
Total downloads:
- pypi 254 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

pypi.org: dedup-rs

A Rust library for deduplication of documents

Documentation: https://dedup-rs.readthedocs.io/
License: apache-2.0
Latest release: 0.1.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 254 Last month

Rankings

Dependent packages count: 10.9%

Average: 36.0%

Dependent repos count: 61.1%

Maintainers (1)

wheynelau

Last synced: 11 months ago

Dependencies

.github/workflows/CI.yml actions

PyO3/maturin-action v1 composite
actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

Cargo.toml cargo

Dockerfile docker

python 3.10-slim build

poetry.lock pypi

101 dependencies

pyproject.toml pypi

bitarray >=2.6.2
click ~=8.1.7
click-option-group ~=0.5.6
datasets >=2.17.0
fire ~=0.6.0
ftfy >=6.1.1
numpy >=1.26.4
psutil >=5.9.8
pybloom-live >=4.0.0
regex >=2023.5.5
rich ~=13.7.1
scipy >=1.10.1
setuptools >=69.1.0
sphinxcontrib-bibtex >=2.5.0
tqdm >=4.64.1
unisim ~=0.0.1
urllib3 <=2.0
xxhash >=3.0.0
zstandard >=0.21.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science