speech-recognition-uk

🇺🇦 Speech Recognition & Synthesis for Ukrainian

https://github.com/egorsmkv/speech-recognition-uk

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.8%) to scientific vocabulary

Keywords

speech speech-recognition speech-synthesis speech-to-text text-to-speech tts ukrainian
Last synced: 6 months ago · JSON representation ·

Repository

🇺🇦 Speech Recognition & Synthesis for Ukrainian

Basic Info
Statistics
  • Stars: 388
  • Watchers: 20
  • Forks: 21
  • Open Issues: 9
  • Releases: 1
Topics
speech speech-recognition speech-synthesis speech-to-text text-to-speech tts ukrainian
Created over 5 years ago · Last pushed 9 months ago
Metadata Files
Readme Funding Citation

README.md

🇺🇦 Speech Recognition & Synthesis for Ukrainian

Overview

This repository collects links to models, datasets, and tools for Ukrainian Speech-to-Text and Text-to-Speech.

Speech-UK initiative

We have datasets/models/leaderboards on Hugging Face, check it out:

  • https://huggingface.co/speech-uk

Community

Discord

  • Discord: https://bit.ly/discord-uds
  • Speech Recognition: https://t.me/speechrecognitionuk
  • Speech Synthesis: https://t.me/speechsynthesisuk

🎤 Speech-to-Text

📦 Implementations

wav2vec2-bert

- 600M params: https://huggingface.co/Yehor/w2v-bert-uk-v2.1 (demo: https://huggingface.co/spaces/Yehor/w2v-bert-uk-v2.1-demo) - 600M params: https://huggingface.co/Yehor/w2v-bert-uk (demo: https://huggingface.co/spaces/Yehor/w2v-bert-uk-demo)

wav2vec2

- 300M params (with language model based on Wikipedia texts): https://huggingface.co/Yehor/w2v-xls-r-uk - 300M params: https://huggingface.co/robinhad/wav2vec2-xls-r-300m-uk - 1B params: https://huggingface.co/arampacha/wav2vec2-xls-r-1b-uk You can check demos out here: https://github.com/egorsmkv/wav2vec2-uk-demo

HuBERT

- hubert-uk: https://huggingface.co/Yehor/hubert-uk

Citrinet

- NVIDIA Streaming Citrinet 1024 (uk): https://huggingface.co/nvidia/stt_uk_citrinet_1024_gamma_0_25 - NVIDIA Streaming Citrinet 512 (uk): https://huggingface.co/neongeckocom/stt_uk_citrinet_512_gamma_0_25

ContextNet

- NVIDIA Streaming ContextNet 512 (uk): https://huggingface.co/theodotus/stt_uk_contextnet_512

FastConformer

- FastConformer Hybrid Transducer-CTC Large P&C: https://huggingface.co/nvidia/stt_ua_fastconformer_hybrid_large_pc - FastConformer Hybrid Transducer-CTC Large P&C: https://huggingface.co/theodotus/stt_ua_fastconformer_hybrid_large_pc - Demo: https://huggingface.co/spaces/theodotus/asr-uk-punctuation-capitalization

Squeezeformer

- Squeezeformer-CTC ML: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_ml - Demo 1: https://huggingface.co/spaces/theodotus/streaming-asr-uk - Demo 2: https://huggingface.co/spaces/theodotus/buffered-asr-uk - Squeezeformer-CTC SM: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_sm - Squeezeformer-CTC XS: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_xs

Conformer-CTC

- https://huggingface.co/taras-sereda/uk-pods-conformer

Whisper

- official whisper: https://github.com/openai/whisper - whisper (small, fine-tuned for Ukrainian): https://github.com/egorsmkv/whisper-ukrainian - whisper (large, fine-tuned for Ukrainian): https://huggingface.co/arampacha/whisper-large-uk-2 - https://huggingface.co/mitchelldehaven/whisper-medium-uk - https://huggingface.co/mitchelldehaven/whisper-large-v2-uk Quantized variants: - https://huggingface.co/Yehor/whisper-large-v2-quantized-uk - https://huggingface.co/Yehor/whisper-large-v3-turbo-quantized-uk Lite Whisper: - https://huggingface.co/collections/efficient-speech/lite-whisper-67c0fa0e01cef6d4b9a1ab5d

OWSM, OWSM-CTC, and OWLS

- https://huggingface.co/espnet/owsm_v3.2 - https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B - https://huggingface.co/espnet/owls_025B_180K

Flashlight

- Flashlight Conformer: https://huggingface.co/Yehor/flashlight-uk

MMS

- mms-1b-fl102: https://huggingface.co/facebook/mms-1b-fl102

data2vec

- data2vec-large: https://huggingface.co/robinhad/data2vec-large-uk

VOSK

Models: https://huggingface.co/Yehor/vosk-uk

DeepSpeech

- [DeepSpeech](https://github.com/mozilla/DeepSpeech) using transfer learning from English model: https://github.com/robinhad/voice-recognition-ua - v0.5: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.5 (1230+ hours) - v0.4: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.4 (1230 hours) - v0.3: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.3 (751 hours)

M-CTC-T

- m-ctc-t-large: https://huggingface.co/speechbrain/m-ctc-t-large

📊 Benchmarks

This benchmark uses Common Voice 10 test split.

  • WER: Word Error Rate
  • CER: Character Error Rate

wav2vec2-bert

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | Yehor/w2v-bert-uk (FP16) | 6.6% | 1.34% | 93.4% | | Yehor/w2v-bert-uk-v2.1 (FP16) | 17.34% | 3.33% | 82.66% |

wav2vec2

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | Yehor/w2v-xls-r-uk | 20.24% | 3.64% | 79.76% | | robinhad/wav2vec2-xls-r-300m-uk | 27.36% | 5.37% | 72.64% | | arampacha/wav2vec2-xls-r-1b-uk | 16.52% | 2.93% | 83.48% |

HuBERT

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|-------------| | Yehor/hubert-uk (FP16) | 37.07% | 6.87% | 62.93% |

Citrinet

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | nvidia/sttukcitrinet1024gamma025 | 4.32% | 0.94% | 95.68% | | neongeckocom/sttukcitrinet512gamma025 | 7.46% | 1.6% | 92.54% |

ContextNet

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | theodotus/sttukcontextnet_512 | 6.69% | 1.45% | 93.31% |

FastConformer P&C

This model supports text punctuation and capitalization

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | nvidia/sttuafastconformerhybridlargepc | 4.52% | 1% | 95.48% | | theodotus/sttuafastconformerhybridlargepc | 4% | 1.02% | 96% |

Squeezeformer

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | theodotus/sttuksqueezeformerctcxs | 10.78% | 2.29% | 89.22% | | theodotus/sttuksqueezeformerctcsm | 8.2% | 1.75% | 91.8% | | theodotus/sttuksqueezeformerctcml | 5.91% | 1.26% | 94.09% |

Conformer-CTC

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | taras-sereda/uk-pods-conformer | 6.75% | 1.41% | 93.25% |

Whisper

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | tiny | 63.08% | 18.59% | 36.92% | | base | 52.1% | 14.08% | 47.9% | | small | 30.57% | 7.64% | 69.43% | | medium | 18.73% | 4.4% | 81.27% | | large (v1) | 16.42% | 3.93% | 83.58% | | large (v2) | 13.72% | 3.18% | 86.28% | | large (v3) | 20.53% | 5.28% | 79.478% | | turbo | 22.83% | 7.05% | 77.17% |

Quantized versions:

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | Yehor/whisper-large-v2-quantized-uk | 14.95% | 4.23% | 85.05% | | Yehor/whisper-large-v3-turbo-quantized-uk | 12.75% | 3.25% | 87.25% | | efficient-speech/lite-whisper-large-v3-turbo | 42.89% | 12.59% | 57.11% | | efficient-speech/lite-whisper-large-v3-turbo-acc | 17.79% | 4.34% | 82.21% |

If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian

Flashlight

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | Flashlight Conformer | 19.15% | 2.44% | 80.85% |

data2vec

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | robinhad/data2vec-large-uk | 31.17% | 7.31% | 68.83% |

VOSK

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | v3 | 53.25% | 38.78% | 46.75% |

m-ctc-t

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | speechbrain/m-ctc-t-large | 57% | 10.94% | 43% |

DeepSpeech

| Model | WER | CER | Accuracy (words) | |-------|-----|-----|------------| | v0.5 | 70.25% | 20.09% | 29.75% |

📖 Development

  • How to train own model using Kaldi
  • How to train a KenLM model based on Ukrainian Wikipedia data: https://github.com/egorsmkv/ukwiki-kenlm
  • Export a traced JIT version of wav2vec2 models: https://github.com/egorsmkv/wav2vec2-jit

📚 Datasets

Compiled dataset: ~1200 hours

  • Dataset: https://nx16725.your-storageshare.de/s/cAbcBeXtdz7znDN, use Wget to download, downloading in a browser has speed limitations, or use torrent file

Voice of America: ~390 hours

  • Dataset: https://huggingface.co/datasets/speech-uk/voice-of-america

FLEURS

  • Ukrainian subset: https://huggingface.co/datasets/google/fleurs/viewer/uk_ua/train

Ukrainian broadcast: ~300 hours

  • Ukrainian broadcast speech: https://huggingface.co/datasets/Yehor/broadcast-speech-uk

YODAS2: ~400 hours

  • Ukrainian subsets:
    • https://huggingface.co/datasets/espnet/yodas2/tree/main/data/uk000
    • https://huggingface.co/datasets/espnet/yodas2/tree/main/data/uk100

Ukrainian podcasts

  • https://huggingface.co/datasets/taras-sereda/uk-pods

Cleaned Common Voice 10 (test set)

  • Repository: https://github.com/egorsmkv/cv10-uk-testset-clean

Noised Common Voice 10

  • Transcriptions: https://www.dropbox.com/s/ohj3y2cq8f4207a/transcriptions.zip?dl=0
  • Audio files: https://www.dropbox.com/s/v8crgclt9opbrv1/data.zip?dl=0

Other

  • ASR Corpus created using a Telegram bot for Ukrainian: https://huggingface.co/datasets/Yehor/tg-voices-uk
  • Speech Dataset with Ukrainian: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
  • Mozilla Common Voice has the Ukrainian dataset: https://commonvoice.mozilla.org/uk/datasets
  • M-AILABS Ukrainian Corpus Ukrainian: http://www.caito.de/data/Training/stttts/ukUK.tgz
  • Espreso TV subset: https://blog.gdeltproject.org/visual-explorer-quick-workflow-for-downloading-belarusian-russian-ukrainian-transcripts-translations/
  • VoxForge Repository: http://www.repository.voxforge1.org/downloads/uk/Trunk/

⭐ Related works

Language models

  • Ukrainian LMs: https://huggingface.co/Yehor/kenlm-uk

Inverse Text Normalization

  • WFST for Ukrainian Inverse Text Normalization: https://github.com/lociko/ukraineitnwfst

Text Enhancement

  • Punctuation and capitalization model: https://huggingface.co/dchaplinsky/punctuationukbert (demo: https://huggingface.co/spaces/Yehor/punctuation-uk)

Aligners

  • NeMo Forced Aligner: https://github.com/NVIDIA/NeMo/tree/main/tools/nemoforcedaligner
  • Aligner for wav2vec2-bert models: https://github.com/egorsmkv/w2v2-bert-aligner
  • Aligner based on FasterWhisper (mostly for TTS): https://github.com/patriotyk/narizaka
  • Aligner based on Kaldi: https://github.com/proger/uk

Other

  • A space to calculate ASR metrics: https://huggingface.co/spaces/Yehor/evaluate-asr-outputs
  • A space to see ASR outputs: https://huggingface.co/spaces/Yehor/see-asr-outputs

📢 Text-to-Speech

Test sentence with stresses:

К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.

Without stresses:

Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.

📦 Implementations

StyleTTS2

- [StyleTTS2 demo & the code](https://huggingface.co/spaces/patriotyk/styletts2-ukrainian)

P-Flow TTS

- [P-Flow TTS](https://huggingface.co/spaces/patriotyk/pflowtts_ukr_demo) https://github.com/egorsmkv/speech-recognition-uk/assets/7875085/18cfc074-f8a1-4842-90b6-9503d0bb7250

RAD-TTS

- [RAD-TTS](https://github.com/egorsmkv/ukrainian-radtts), the voice "Lada" - [RAD-TTS with three voices](https://github.com/egorsmkv/radtts-uk), voices of Lada, Tetiana, and Mykyta https://user-images.githubusercontent.com/7875085/206881140-bf8c09e7-5553-43d9-8807-065c36b2904b.mp4

Coqui TTS

- v1.0.0 using M-AILABS dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v1.0.0 (200,000 steps) - v2.0.0 using Mykyta/Olena dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v2.0.0 (140,000 steps) https://user-images.githubusercontent.com/5759207/167480982-275d8ca0-571f-4d21-b8d7-3776b3091956.mp4

Neon TTS

- [Coqui TTS](https://github.com/coqui-ai/TTS) model implemented in the [Neon Coqui TTS Python Plugin](https://pypi.org/project/neon-tts-plugin-coqui/). An interactive demo is available [on huggingface](https://huggingface.co/spaces/neongeckocom/neon-tts-plugin-coqui). This model and others can be downloaded [from huggingface](https://huggingface.co/neongeckocom) and more information can be found at [neon.ai](https://neon.ai/languages) https://user-images.githubusercontent.com/96498856/170762023-d4b3f6d7-d756-4cb7-89de-dc50e9049b96.mp4

FastPitch

- NVIDIA FastPitch: https://huggingface.co/theodotus/tts_uk_fastpitch

Balacoon TTS

- [Balacoon TTS](https://huggingface.co/spaces/balacoon/tts), voices of Lada, Tetiana and Mykyta. [Blog post](https://balacoon.com/blog/uk_release/) on model release. https://github.com/clementruhm/speech-recognition-uk/assets/87281103/a13493ce-a5e5-4880-8b72-42b02feeee50

MMS

- https://huggingface.co/facebook/mms-tts-ukr

📚 Datasets

⭐ Related works

Accentors

  • https://github.com/NeonBohdan/ukrainian-accentor-transformer
  • https://github.com/lang-uk/ukrainian-word-stress
  • https://github.com/egorsmkv/ukrainian-accentor

Grapheme-to-Phoneme

ipa-uk: - https://github.com/lang-uk/ipa-uk - https://github.com/patriotyk/ipa-uk

Charsiu G2P: - https://huggingface.co/charsiu/g2pmultilingualbyT5tiny16layers100 - https://huggingface.co/charsiu/g2pmultilingualbyT5small100 - https://huggingface.co/charsiu/g2pmultilingualmT5_small

Other: - https://github.com/dmort27/epitran - https://montreal-forced-aligner.readthedocs.io/en/v1.0/pretrained_models.html - https://huggingface.co/darkproger/ukpron

Misc

  • Tool to make high quality text to speech (TTS) corpus from audio + text books: https://github.com/patriotyk/narizaka
  • Text Normalization:
    • https://huggingface.co/skypro1111/m2m100-ukr-verbalization (see the demo)
    • https://huggingface.co/skypro1111/mbart-large-50-verbalization
  • Audio Aesthetics for opentts-uk: https://huggingface.co/datasets/Yehor/opentts-uk-aesthetics

Owner

  • Name: Yehor Smoliakov
  • Login: egorsmkv
  • Kind: user
  • Location: 50.4501° N, 30.5234° E

Speech-to-Text, Text-to-Speech, Voice over Internet Protocol

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you cite this repository, please cite it as below."
authors:
- family-names: "Smoliakov"
  given-names: "Yehor"
  orcid: "https://orcid.org/0000-0002-8272-2095"
title: "Speech Recognition & Synthesis for Ukrainian"
version: 1.0.0
doi: 10.5281/zenodo.14895258
date-released: 2025-02-19
url: "https://github.com/egorsmkv/speech-recognition-uk"

GitHub Events

Total
  • Issues event: 65
  • Watch event: 59
  • Issue comment event: 46
  • Push event: 59
  • Fork event: 2
Last Year
  • Issues event: 65
  • Watch event: 59
  • Issue comment event: 46
  • Push event: 59
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 305
  • Total Committers: 9
  • Avg Commits per committer: 33.889
  • Development Distribution Score (DDS): 0.102
Past Year
  • Commits: 86
  • Committers: 1
  • Avg Commits per committer: 86.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Yehor Smoliakov y****h@u****t 274
AlexeyBoiler 6****r 14
Yurii Paniv m****d@g****m 6
Богдан Михайленко m****2@g****m 3
tarasfrompir t****r@i****u 2
Sitdzikau Ihar i****u@y****y 2
NeonBohdan b****n@n****i 2
Clement Ruhm c****t@b****m 1
Alexander Veysov a****v@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 61
  • Total pull requests: 16
  • Average time to close issues: 3 months
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 4
  • Total pull request authors: 7
  • Average comments per issue: 0.8
  • Average comments per pull request: 0.0
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 44
  • Pull requests: 0
  • Average time to close issues: 4 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.7
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • egorsmkv (58)
  • Alexbeard (1)
  • hellcezar (1)
  • shahidjabbar (1)
  • mitchelldehaven (1)
Pull Request Authors
  • robinhad (6)
  • Theodotus1243 (3)
  • NeonBohdan (2)
  • tarasfrompir (2)
  • snakers4 (1)
  • igorsitdikov (1)
  • clementruhm (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
proxy.golang.org: github.com/egorsmkv/speech-recognition-uk
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 7 months ago