https://github.com/alexeyev/awesome-kyrgyz-nlp

Kyrgyz language processing software, models and datasets.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary

Keywords

awesome-list corpus kyrgyz morphology natural-language-processing turkic turkic-languages

Last synced: 9 months ago · JSON representation

Repository

Kyrgyz language processing software, models and datasets.

Basic Info

Host: GitHub
Owner: alexeyev
Language: Shell
Default Branch: main
Homepage:
Size: 66.4 KB

Statistics

Stars: 30
Watchers: 3
Forks: 4
Open Issues: 1
Releases: 0

Topics

awesome-list corpus kyrgyz morphology natural-language-processing turkic turkic-languages

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme

README.md

Awesome Kyrgyz NLP

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

Repository's owners explicitly say that "this library is not maintained".
Not committed to for a long time (2~3 years).

Awesome Kyrgyz NLP
- Table of Contents
- Datasets
- Pretrained models
- Methods/Software
  - Morphology
- Online Demos
- Miscellaneous

Datasets

Corpora

Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)
TurkLang-7: parallel corpora mentioned in the 2020 work 'First Results of the ``TurkLang-7'' Project: Creating Russian-Turkic Parallel Corpora and MT Systems' by Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., Abdurakhmonova, N. [status?]

Character recognition

Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80213) have been transformed to 50x50 images, then to CSV format

Raw text

kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Morphology & Syntax

UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
KTMU's UD Treebank, 781 sentences; UPD: now even more sentences! + some fixes in the previous version of the dataset
Small UD Treebank: 145 sentences (incl. 20 Cairo sentences), and ~ 100 sentences suggested by UD Turkic Group; a part of UD Turkic Treebank; also note that the translations to English, Azerbaijani and Turkish are available
Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff

Named Entity Recognition

WikiANN has a Kyrgyz language part
KyrgyzNER: [not published yet]

Text Classification

Kyrgyz Multi-Label News Classification: training and evaluation code as well as the dataset of 1000/500 news documents are available

Word Similarity Data

Kyrgyz Word Embedding Evaluation: [not published yet]; the 2 best models have been released

Instructions

Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate

Machine-readable dictionaries

Country names table: Kyrgyz-Russian-English
Thesaurus KyrSpell (however, unpacking it seems to be an action violating the license)
Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)

Pretrained models

Polyglot morfessor — pretrained morfessor model, number 6
fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
fastText trained on Leipzig Corpora — best-performant 100/300-dimensional fastText vectors provided by the authors of the HJ-Ky-0.1 paper.
fastText from Kuriyozov et al.'2020 — trained on SketchEngine's KyWaC
BERT-based NER — bert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas
kyrgyz-tokenizers-collection — pre-trained subword tokenizers for Kyrgyz (by @metinovadilet)
KyrgyzBert — BERT (6 encoders, 8 heads, hidden dim 512) trained on Kyrgyz texts (data is not available) from scratch (by @metinovadilet)

Methods/Software

spaCy basic support: tokenization, stopwords, like_num
stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)

Morphology

Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: installapertiumkir.sh. A much, much easier way: import apertium; apertium.installer.install_module("kir").
[DEPRECATED] kymopl: Kyrgyz morphology in Prolog

Hate Speech detection

Jupyter Notebook for hate speech detection

Other

Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
Number-to-words conversion (JavaScript) by @AzamatSooldaev
Number-to-words conversion (TypeScript) by @timursaurus
Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz

Online Demos

Cyrillic-to-Latin online converter based on this resource.

Miscellaneous

Kyrgyz NLP bibliography: kyrgyznlp.github.io
Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
A useful Apertium's list of tools and other resources
Online dictionaries and other useful resources on el-sozduk.kg
Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
Data prepared by CSLT: 128h speech, 163 speakers (100m/63f), transcription of the speech audio, lexicon in the word level; link (requires extra steps, quote: You should ask for license before you can download the datasets. Please send Email to shiying@cslt.org or lilt@cslt.org to get the license.). Note that the original texts in Kyrgyz are not available, only the phonetic transcription is shared.

Contributions to this list

@golden-ratio

Owner

Name: Anton Alekseev
Login: alexeyev
Kind: user

Website: https://ai.pdmi.ras.ru/
Repositories: 52
Profile: https://github.com/alexeyev

GitHub Events

Total

Watch event: 5
Push event: 14

Last Year

Watch event: 5
Push event: 14

Committers

Last synced: 12 months ago

All Time

Total Commits: 56
Total Committers: 2
Avg Commits per committer: 28.0
Development Distribution Score (DDS): 0.036

Past Year

Commits: 16
Committers: 1
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Anton Alekseev	a**v@g**m	54
Timur Turatali	z**o@g**m	2

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 3 hours
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/alexeyev/awesome-kyrgyz-nlp

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Awesome Kyrgyz NLP

Table of Contents

Datasets

Corpora

Character recognition

Raw text

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity Data

Instructions

Machine-readable dictionaries

Pretrained models

Methods/Software

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

Contributions to this list

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels