Recent Releases of parallel-urls-classifier
parallel-urls-classifier - Dataset
Dataset used to train and evaluate the released model. Necessary steps to use the dataset in the code:
Decompress:
bash
xz -d train.tsv.xz
xz -d dev.tsv.xz
xz -d test.tsv.xz
- Python
Published by cgr71ii almost 2 years ago
parallel-urls-classifier - PyTorch model
PyTorch model that can be used within the code provided in this repository. A manually converted HuggingFace compliant model is also available: https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier
You may want to use this version instead of the HuggingFace one if, for example, you want to use the Gunicorn server without writing new code and use the available scripts.
- Python
Published by cgr71ii almost 2 years ago
parallel-urls-classifier - MaCoCu v1 wordfreq files
Created following the method described in the Bicleaner AI repo: ```bash l="bg"
cat monolingual.${l} \ | sacremoses -l ${l} tokenize -x \ | awk '{print tolower($0)}' \ | tr ' ' '\n' \ | LCALL=C sort | uniq -c \ | LCALL=C sort -nr \ | grep -v '[[:space:]]*1' \ | pigz -c > wordfreq-${l}.gz ```
- Python
Published by cgr71ii about 3 years ago