Updated 6 months ago
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Updated 6 months ago
https://github.com/commoncrawl/web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
Updated 6 months ago
https://github.com/commoncrawl/language-detection-cld2
Natural language detection, Java bindings for CLD2
Updated 6 months ago
https://github.com/adbar/py3langid
Faster, modernized fork of the language identification tool langid.py
Updated 6 months ago
colibri-utils
NLP utilities that rely on Colibri Core: currently only language identification