https://github.com/alexeyev/awesome-azerbaijani-nlp
Azerbaijani language processing software, models and datasets.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: researchgate.net, academia.edu -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Keywords
Repository
Azerbaijani language processing software, models and datasets.
Basic Info
Statistics
- Stars: 30
- Watchers: 4
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Awesome Azeri NLP 
A curated list of awesome Azerbaijani language processing software, models and datasets. Inspired by awesome-ML.
The main focus is on open source tools, downloadable data and research papers with code.
If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:
- Repository's owners explicitly say that "this library is not maintained".
- Not committed for long time (2~3 years).
Table of Contents
- Awesome Azeri NLP
- Table of Contents
- Datasets
- Pretrained models
- Methods/Software
- Online Demos
- Miscellaneous <!-- /MarkdownTOC -->
Datasets
Raw text
- University of Leipzig corpus collection — Newscrawl (2011, 2013) and Wikipedia (misc) datasets
- Helsinki University corpus — New Testament in the Azerbaijani language
- Latest azwiki dump: download directly
- Azeri at An Crúbadán — 8M+ words, Latin script
- az-corpus-nlp — 2000+ texts, Latin script
- azWaC: Azerbaijani corpus from the web — SketchEngine-hosted corpus crawled from the web in 2012, ~94 million words
- Domrachev-Sudoplatova scraped corpus — 2189398 words, 100560 sentences
- Azerbaijani Named Entity Recognition (NER) Dataset — A dataset for training and evaluating NER models in Azerbaijani, including annotated text data with various named entities.
Several corpora are also mentioned in research works: * S. Mammadova, G. Azimova, and A. Fatullayev. 2010.Text corpora and its role in development of the linguistic technologies for the azerbaijani language. In The Third International Conference Problems of Cybernetics and Informatics. * Baisa, Vıt, and Vıt Suchomel. "Large corpora for turkic languages and unsupervised morphological analysis." Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). 2012. [SketchEngine corpora?] * C. Biemann, S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. Computational linguistics and intelligent text processing, pages 217–228. * Domrachev M. A., Sudoplatova S. N. Testing Methods for Automatic Detection of Mor- pheme Boundaries in the Azerbaijani Language. Vestnik NSU. Series: Linguistics and Intercultural Communication , 2018, vol. 16, no. 2, p. 34–47. (in Russ.) Downloadable corpus * Özenç B., Ehsani R., Solak E. Moraz: an open-source morphological analyzer for Azerbaijani Turkish //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. – 2018. – С. 25-29. [BBC Azerbaijan]
Syntax
- UD_Azerbaijani-TueCL: a treebank that contains a total of ~110 sentences including 20 Cairo sentences, and ~90 sentences suggested by UD Turkic Group; part of the UD Turkic Treebank. Translations of all the sentences are available in English, Turkish and Kyrgyz languages
- UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Azeri is hard as well
Machine-readable dictionaries
TODO
Summarization
- AZ summarization — articles and titles, available on request
Translation
- AZ-EN parallel corpus — 68K+ sentences, available on request
Sentiment
Mentioned in: * N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — 2194+700 tweets * Mammad Hajili's 160K customer reviews with scores and upvotes
Pretrained models
- Polyglot morfessor — pretrained morfessor model, number 53
- fastText — 300-dimensional fastText vectors provided by the authors
Methods/Software
Morphology
- Azmorph — morphological analyzer for Azerbaijani (Azerbaycan dili), said to be in pre-ALPHA state; however, was used for web corpora preparation
- Wiktionary word forms extraction — Python code on github
- MorAz — open-source morph. analyzer, paper, demo, related slides on AZ morphology.
Mentioned in papers: * POS-tagging paper — Mammadov, S., Rustamov, S., Mustafali, A., Sadigov, Z., Mollayev, R., & Mammadov, Z. (2018, October). Part-of-Speech Tagging for Azerbaijani Language. In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-6). IEEE. [Probable implementation: aznlp repo] * Stemming paper, 2019 — Alizadeh, M. B. H., & Seyyedi, S. A. H. (2019). AUTO STEMMING OF AZERBAIJANI LANGUAGE. Problems of Information Technology, 59-66. * N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — Zemberek is extended for Azerbaijani; though stated a lot of effort is still required for it to work properly for Azeri language.
Syntax
- TODO
Online Demos
- Cyrillic ⇄ Latin conversion — PHP-based online tool
Miscellaneous
- Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
- Azeribaijani corpora data review
- Dilmanc — government-funded Azerbaijani language-related initiative
- Dilmanc EAMT paper on MT peculiarities
- Apertium page — a list of various online language-related resources
- AZNLP github — a repo hub with various language-related software: stemmer, POS-tagger
- MozillaAZ community spellchecker — spellchecker plugin
Owner
- Name: Anton Alekseev
- Login: alexeyev
- Kind: user
- Website: https://ai.pdmi.ras.ru/
- Repositories: 52
- Profile: https://github.com/alexeyev
GitHub Events
Total
Last Year
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Anton Alekseev | a****v@g****m | 41 |
| Ismat | i****v@g****m | 2 |
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 10 minutes
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.5
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 10 minutes
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.5
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- Ismat-Samadov (2)