cv10-uk-testset-clean

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian πŸ‡ΊπŸ‡¦

https://github.com/egorsmkv/cv10-uk-testset-clean

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • βœ“
    CITATION.cff file
    Found CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • βœ“
    .zenodo.json file
    Found .zenodo.json file
  • β—‹
    DOI references
  • β—‹
    Academic publication links
  • β—‹
    Academic email domains
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (7.5%) to scientific vocabulary

Keywords

asr automatic-speech-recognition speech speech-recognition speech-to-text ukrainian
Last synced: 6 months ago · JSON representation ·

Repository

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian πŸ‡ΊπŸ‡¦

Basic Info
  • Host: GitHub
  • Owner: egorsmkv
  • Default Branch: main
  • Homepage:
  • Size: 409 KB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
asr automatic-speech-recognition speech speech-recognition speech-to-text ukrainian
Created over 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme Citation

README.md

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian πŸ‡ΊπŸ‡¦

Overview

This repository contains the archive of CV10 (test set) with checked Ukrainian transcriptions and audios. All audios have been checked by a human to be sure that they are correct.

This archive is used to test all ASR models listed here: https://github.com/egorsmkv/speech-recognition-uk

Hugging Face dataset

  • URL: https://huggingface.co/datasets/Yehor/cv10-uk-testset-clean

Usage

Example with datasets:

```python from datasets import load_dataset

ds = load_dataset('Yehor/cv10-uk-testset-clean')

print(ds)

for row in ds['train']: audio = row["audio"]

samplingrate = audio["samplingrate"] audio_bytes = audio["array"] filename = audio["path"]

print(len(audiobytes), samplingrate, filename) print(row["duration"], row["transcription"])

print('---') ```

Example with polars: https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing

Google Colabs

Use the following colabs to see how you can download this dataset in Python:

datasets: - https://colab.research.google.com/drive/1qqnr5-WkaJi8iqHa_Pmlx7PbbXwXiimD?usp=sharing

polars: - https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing

Statistics

Duration statistics

Duration: 4.6 hours

| Metrics | Value | | ------ | ------ | | mean | 5.201474 | | std | 1.764957 | | min | 1.704 | | 25% | 3.816 | | 50% | 4.896 | | 75% | 6.384 | | max | 10.536 |

Download from GitHub

We recommend to use Hugging Face dataset, but in case you need raw dataset, use:

  • Audio data: https://github.com/egorsmkv/cv10-uk-testset-clean/releases/download/v1.1/filtered-cv10-test.zip

  • Labels list (TAB format) with absolute paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_absolute.lst

  • Labels list (CSV format) with absolute paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_absolute.csv

  • Labels list (CSV format) with relative paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_relative.csv

Owner

  • Name: Yehor Smoliakov
  • Login: egorsmkv
  • Kind: user
  • Location: 50.4501Β° N, 30.5234Β° E

Speech-to-Text, Text-to-Speech, Voice over Internet Protocol

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you cite this repository, please cite it as below."
authors:
- family-names: "Smoliakov"
  given-names: "Yehor"
  orcid: "https://orcid.org/0000-0002-8272-2095"
title: "cv10-uk-testset-clean"
version: 1.0.0
doi: 10.57967/hf/4559
date-released: 2025-02-19
url: "https://huggingface.co/datasets/Yehor/cv10-uk-testset-clean"

GitHub Events

Total
  • Watch event: 1
  • Push event: 11
Last Year
  • Watch event: 1
  • Push event: 11

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels