https://github.com/alexeyev/awesome-kyrgyz-nlp
Kyrgyz language processing software, models and datasets.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Keywords
awesome-list
corpus
kyrgyz
morphology
natural-language-processing
turkic
turkic-languages
Last synced: 5 months ago
·
JSON representation
Repository
Kyrgyz language processing software, models and datasets.
Basic Info
Statistics
- Stars: 30
- Watchers: 3
- Forks: 4
- Open Issues: 1
- Releases: 0
Topics
awesome-list
corpus
kyrgyz
morphology
natural-language-processing
turkic
turkic-languages
Created over 3 years ago
· Last pushed 12 months ago
Metadata Files
Readme
README.md
Awesome Kyrgyz NLP 
A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.
The main focus is on open source tools, downloadable data and research papers with code.
If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:
- Repository's owners explicitly say that "this library is not maintained".
- Not committed to for a long time (2~3 years).
Table of Contents
- Awesome Kyrgyz NLP
- Table of Contents
- Datasets
- Pretrained models
- Methods/Software
- Online Demos
- Miscellaneous <!-- /MarkdownTOC -->
Datasets
Corpora
- Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
- kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
- Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
- TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)
- TurkLang-7: parallel corpora mentioned in the 2020 work 'First Results of the ``TurkLang-7'' Project: Creating Russian-Turkic Parallel Corpora and MT Systems' by Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., Abdurakhmonova, N. [status?]
Character recognition
- Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80213) have been transformed to 50x50 images, then to CSV format
Raw text
- kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code
Morphology & Syntax
- UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
- KTMU's UD Treebank, 781 sentences; UPD: now even more sentences! + some fixes in the previous version of the dataset
- Small UD Treebank: 145 sentences (incl. 20 Cairo sentences), and ~ 100 sentences suggested by UD Turkic Group; a part of UD Turkic Treebank; also note that the translations to English, Azerbaijani and Turkish are available
- Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff
Named Entity Recognition
Text Classification
- Kyrgyz Multi-Label News Classification: training and evaluation code as well as the dataset of 1000/500 news documents are available
Word Similarity Data
- Kyrgyz Word Embedding Evaluation: [not published yet]; the 2 best models have been released
Instructions
- Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate
Machine-readable dictionaries
- Country names table: Kyrgyz-Russian-English
- Thesaurus KyrSpell (however, unpacking it seems to be an action violating the license)
- Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)
Pretrained models
- Polyglot morfessor — pretrained morfessor model, number 6
- fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
- compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
- fastText trained on Leipzig Corpora — best-performant 100/300-dimensional fastText vectors provided by the authors of the HJ-Ky-0.1 paper.
- fastText from Kuriyozov et al.'2020 — trained on SketchEngine's KyWaC
- BERT-based NER —
bert-base-multilingual-casedfine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later. - Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas
- kyrgyz-tokenizers-collection — pre-trained subword tokenizers for Kyrgyz (by @metinovadilet)
- KyrgyzBert — BERT (6 encoders, 8 heads, hidden dim 512) trained on Kyrgyz texts (data is not available) from scratch (by @metinovadilet)
Methods/Software
- spaCy basic support: tokenization, stopwords,
like_num - stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
- kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)
Morphology
- Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: installapertiumkir.sh. A much, much easier way:
import apertium; apertium.installer.install_module("kir"). - [DEPRECATED] kymopl: Kyrgyz morphology in Prolog
Hate Speech detection
Other
- Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
- ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
- Number-to-words conversion (JavaScript) by @AzamatSooldaev
- Number-to-words conversion (TypeScript) by @timursaurus
- Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz
Online Demos
Miscellaneous
- Kyrgyz NLP bibliography: kyrgyznlp.github.io
- Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
- A useful Apertium's list of tools and other resources
- Online dictionaries and other useful resources on el-sozduk.kg
- Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
- Data prepared by CSLT: 128h speech, 163 speakers (100m/63f), transcription of the speech audio, lexicon in the word level; link (requires extra steps, quote:
You should ask for license before you can download the datasets. Please send Email to shiying@cslt.org or lilt@cslt.org to get the license.). Note that the original texts in Kyrgyz are not available, only the phonetic transcription is shared.
Contributions to this list
Owner
- Name: Anton Alekseev
- Login: alexeyev
- Kind: user
- Website: https://ai.pdmi.ras.ru/
- Repositories: 52
- Profile: https://github.com/alexeyev
GitHub Events
Total
- Watch event: 5
- Push event: 14
Last Year
- Watch event: 5
- Push event: 14
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Anton Alekseev | a****v@g****m | 54 |
| Timur Turatali | z****o@g****m | 2 |
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: about 3 hours
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0