cv10-uk-testset-clean
The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian πΊπ¦
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
βCITATION.cff file
Found CITATION.cff file -
βcodemeta.json file
Found codemeta.json file -
β.zenodo.json file
Found .zenodo.json file -
βDOI references
-
βAcademic publication links
-
βAcademic email domains
-
βInstitutional organization owner
-
βJOSS paper metadata
-
βScientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary
Keywords
Repository
The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian πΊπ¦
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian πΊπ¦
Overview
This repository contains the archive of CV10 (test set) with checked Ukrainian transcriptions and audios. All audios have been checked by a human to be sure that they are correct.
This archive is used to test all ASR models listed here: https://github.com/egorsmkv/speech-recognition-uk
Hugging Face dataset
- URL: https://huggingface.co/datasets/Yehor/cv10-uk-testset-clean
Usage
Example with datasets:
```python from datasets import load_dataset
ds = load_dataset('Yehor/cv10-uk-testset-clean')
print(ds)
for row in ds['train']: audio = row["audio"]
samplingrate = audio["samplingrate"] audio_bytes = audio["array"] filename = audio["path"]
print(len(audiobytes), samplingrate, filename) print(row["duration"], row["transcription"])
print('---') ```
Example with polars: https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing
Google Colabs
Use the following colabs to see how you can download this dataset in Python:
datasets:
- https://colab.research.google.com/drive/1qqnr5-WkaJi8iqHa_Pmlx7PbbXwXiimD?usp=sharing
polars:
- https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing
Statistics
Duration statistics
Duration: 4.6 hours
| Metrics | Value | | ------ | ------ | | mean | 5.201474 | | std | 1.764957 | | min | 1.704 | | 25% | 3.816 | | 50% | 4.896 | | 75% | 6.384 | | max | 10.536 |
Download from GitHub
We recommend to use Hugging Face dataset, but in case you need raw dataset, use:
Audio data: https://github.com/egorsmkv/cv10-uk-testset-clean/releases/download/v1.1/filtered-cv10-test.zip
Labels list (TAB format) with absolute paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_absolute.lst
Labels list (CSV format) with absolute paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_absolute.csv
Labels list (CSV format) with relative paths: https://github.com/egorsmkv/cv10-uk-testset-clean/blob/main/labels_relative.csv
Owner
- Name: Yehor Smoliakov
- Login: egorsmkv
- Kind: user
- Location: 50.4501Β° N, 30.5234Β° E
- Twitter: yehor_smoliakov
- Repositories: 22
- Profile: https://github.com/egorsmkv
Speech-to-Text, Text-to-Speech, Voice over Internet Protocol
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you cite this repository, please cite it as below." authors: - family-names: "Smoliakov" given-names: "Yehor" orcid: "https://orcid.org/0000-0002-8272-2095" title: "cv10-uk-testset-clean" version: 1.0.0 doi: 10.57967/hf/4559 date-released: 2025-02-19 url: "https://huggingface.co/datasets/Yehor/cv10-uk-testset-clean"
GitHub Events
Total
- Watch event: 1
- Push event: 11
Last Year
- Watch event: 1
- Push event: 11
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0