https://github.com/alexeyev/awesome-kyrgyz-nlp

Kyrgyz language processing software, models and datasets.

https://github.com/alexeyev/awesome-kyrgyz-nlp

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary

Keywords

awesome-list corpus kyrgyz morphology natural-language-processing turkic turkic-languages
Last synced: 5 months ago · JSON representation

Repository

Kyrgyz language processing software, models and datasets.

Basic Info
  • Host: GitHub
  • Owner: alexeyev
  • Language: Shell
  • Default Branch: main
  • Homepage:
  • Size: 66.4 KB
Statistics
  • Stars: 30
  • Watchers: 3
  • Forks: 4
  • Open Issues: 1
  • Releases: 0
Topics
awesome-list corpus kyrgyz morphology natural-language-processing turkic turkic-languages
Created over 3 years ago · Last pushed 12 months ago
Metadata Files
Readme

README.md

Awesome Kyrgyz NLP Awesome

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

  • Repository's owners explicitly say that "this library is not maintained".
  • Not committed to for a long time (2~3 years).

Table of Contents

Datasets

Corpora

  • Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
  • kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
  • Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
  • TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)
  • TurkLang-7: parallel corpora mentioned in the 2020 work 'First Results of the ``TurkLang-7'' Project: Creating Russian-Turkic Parallel Corpora and MT Systems' by Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., Abdurakhmonova, N. [status?]

Character recognition

Raw text

  • kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity Data

Instructions

Machine-readable dictionaries

Pretrained models

Methods/Software

  • spaCy basic support: tokenization, stopwords, like_num
  • stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
  • kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

  • Kyrgyz NLP bibliography: kyrgyznlp.github.io
  • Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
  • A useful Apertium's list of tools and other resources
  • Online dictionaries and other useful resources on el-sozduk.kg
  • Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
  • Data prepared by CSLT: 128h speech, 163 speakers (100m/63f), transcription of the speech audio, lexicon in the word level; link (requires extra steps, quote: You should ask for license before you can download the datasets. Please send Email to shiying@cslt.org or lilt@cslt.org to get the license.). Note that the original texts in Kyrgyz are not available, only the phonetic transcription is shared.

Contributions to this list

@golden-ratio

Owner

  • Name: Anton Alekseev
  • Login: alexeyev
  • Kind: user

GitHub Events

Total
  • Watch event: 5
  • Push event: 14
Last Year
  • Watch event: 5
  • Push event: 14

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 56
  • Total Committers: 2
  • Avg Commits per committer: 28.0
  • Development Distribution Score (DDS): 0.036
Past Year
  • Commits: 16
  • Committers: 1
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Anton Alekseev a****v@g****m 54
Timur Turatali z****o@g****m 2

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: about 3 hours
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels