Recent Releases of parallel-urls-classifier

parallel-urls-classifier - Dataset

Dataset used to train and evaluate the released model. Necessary steps to use the dataset in the code:

Decompress:

bash xz -d train.tsv.xz xz -d dev.tsv.xz xz -d test.tsv.xz

- Python
Published by cgr71ii almost 2 years ago

parallel-urls-classifier - PyTorch model

PyTorch model that can be used within the code provided in this repository. A manually converted HuggingFace compliant model is also available: https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier

You may want to use this version instead of the HuggingFace one if, for example, you want to use the Gunicorn server without writing new code and use the available scripts.

- Python
Published by cgr71ii almost 2 years ago

parallel-urls-classifier - MaCoCu v1 wordfreq files

Created following the method described in the Bicleaner AI repo: ```bash l="bg"

cat monolingual.${l} \ | sacremoses -l ${l} tokenize -x \ | awk '{print tolower($0)}' \ | tr ' ' '\n' \ | LCALL=C sort | uniq -c \ | LCALL=C sort -nr \ | grep -v '[[:space:]]*1' \ | pigz -c > wordfreq-${l}.gz ```

- Python
Published by cgr71ii about 3 years ago