speech-dataset-parser

Parser for several speech datasets.

https://github.com/stefantaubert/speech-dataset-parser

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Parser for several speech datasets.

Basic Info

Host: GitHub
Owner: stefantaubert
License: mit
Language: Python
Default Branch: master
Size: 330 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 4

Created over 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

speech-dataset-parser

Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included. Speech datasets consists of pairs of .TextGrid and .wav files. The TextGrids need to contain a tier which has each symbol separated in an interval, e.g., T|h|i|s| |i|s| |a| |t|e|x|t|.

Generic Format

The format is as follows: {Dataset name}/{Speaker name};{Speaker gender};{Speaker language}[;{Speaker accent}]/[Subfolder(s)]/{Recordings as .wav- and .TextGrid-pairs}

Example: LJ Speech/Linda Johnson;2;eng;North American/wavs/...

Speaker names can be any string (excluding ; symbols). Genders are defined via their ISO/IEC 5218 Code. Languages are defined via their ISO 639-2 Code (bibliographic). Accents are optional and can be any string (excluding ; symbols).

Installation

sh pip install speech-dataset-parser --user

Library Usage

```py from speechdatasetparser import parse_dataset

entries = list(parse_dataset({folder}, {grid-tier-name})) ```

The resulting entries list contains dataclass-instances with these properties:

symbols: Tuple[str, ...]: contains the mark of each interval
intervals: Tuple[float, ...]: contains the max-time of each interval
symbols_language: str: contains the language
speaker_name: str: contains the name of the speaker
speaker_accent: str: contains the accent of the speaker
speaker_gender: int: contains the gender of the speaker
audio_file_abs: Path: contains the absolute path to the speech audio
min_time: float: the min-time of the grid
max_time: float: the max-time of the grid (equal to intervals[-1])

CLI Usage

```txt usage: dataset-converter-cli [-h] [-v] {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} ...

This program converts common speech datasets into a generic representation.

positional arguments: {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} description convert-ljs convert LJ Speech dataset to a generic dataset convert-l2arctic convert L2-ARCTIC dataset to a generic dataset convert-thchs convert THCHS-30 (OpenSLR Version) dataset to a generic dataset convert-thchs-cslt convert THCHS-30 (CSLT Version) dataset to a generic dataset restore-structure restore original dataset structure of generic datasets

optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit ```

CLI Example

```sh

Convert LJ Speech dataset with symbolic links to the audio files

dataset-converter-cli convert-ljs \ "/data/datasets/LJSpeech-1.1" \ "/tmp/ljs" \ --tier "Symbols" \ --symlink ```

Dependencies

tqdm
TextGrid>=1.5
ordered_set>=4.1.0
importlib_resources; python_version < '3.8'

Roadmap

Supporting conversion of more datasets
Adding more tests

Contributing

If you notice an error, please don't hesitate to open an issue.

Development setup

```sh

update

sudo apt update

install Python 3.7, 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run

sudo apt install python3-pip \ python3.7 python3.7-dev python3.7-distutils python3.7-venv \ python3.8 python3.8-dev python3.8-distutils python3.8-venv \ python3.9 python3.9-dev python3.9-distutils python3.9-venv \ python3.10 python3.10-dev python3.10-distutils python3.10-venv \ python3.11 python3.11-dev python3.11-distutils python3.11-venv

install pipenv for creation of virtual environments

python3.8 -m pip install pipenv --user

check out repo

git clone https://github.com/stefantaubert/speech-dataset-parser.git cd speech-dataset-parser

create virtual environment

python3.8 -m pipenv install --dev ```

Running the tests

```sh

first install the tool like in "Development setup"

then, navigate into the directory of the repo (if not already done)

cd speech-dataset-parser

activate environment

python3.8 -m pipenv shell

run tests

tox ```

Final lines of test result output:

log py37: commands succeeded py38: commands succeeded py39: commands succeeded py310: commands succeeded py311: commands succeeded congratulations :)

License

MIT License

Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Citation

If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).

Changelog

v0.0.5 (unreleased)
- Added:
- Added option to parse LJ Speech --use-un-normalized-text
v0.0.4 (2023-01-12)
- Added:
- Added support to parse OpenSLR THCHS-30 version
- Added returning of an exit code
- Changed:
- Changed default command to be parsing the OpenSLR version for THCHS-30 by renaming the previous command to convert-thchs-cslt
v0.0.3 (2023-01-02)
- added option to restore original file structure
- added option to THCHS-30 to opt in for adding of punctuation
- change file naming format to numbers with preceding zeros
v0.0.2 (2022-09-08)
- added support for L2Arctic
- added support for THCHS-30
v0.0.1 (2022-06-03)
- Initial release

Owner

Name: Stefan Taubert
Login: stefantaubert
Kind: user
Location: Chemnitz, Germany
Company: Chemnitz University of Technology

Website: https://stefantaubert.com
Twitter: Stefan_Taubert
Repositories: 75
Profile: https://github.com/stefantaubert

Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.

Citation (CITATION.cff)

cff-version: 1.2.0
title: speech-dataset-parser
abstract: Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - email: github@stefantaubert.com
    given-names: Stefan
    family-names: Taubert
    affiliation: Chemnitz University of Technology
    orcid: 'https://orcid.org/0000-0002-4932-2874'
    website: 'https://stefantaubert.com'
version: 0.0.4
date-released: 2023-01-12
license: MIT
url: https://github.com/stefantaubert/speech-dataset-parser
doi: 10.5281/zenodo.7529425

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

stefantaubert (1)

Pull Request Authors

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 36 last-month

Total dependent packages: 1
Total dependent repositories: 2
Total versions: 4
Total maintainers: 1

pypi.org: speech-dataset-parser

Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.

Homepage: https://github.com/stefantaubert/speech-dataset-parser
Documentation: https://speech-dataset-parser.readthedocs.io/
License: MIT
Latest release: 0.0.4
published over 3 years ago

Versions: 4
Dependent Packages: 1
Dependent Repositories: 2
Downloads: 36 Last month

Rankings

Dependent packages count: 4.7%

Dependent repos count: 11.6%

Average: 24.2%

Forks count: 29.8%

Downloads: 36.2%

Stargazers count: 38.8%

Maintainers (1)

stefantaubert

Last synced: 10 months ago

Dependencies

Pipfile pypi

autoflake * develop
autopep8 * develop
isort * develop
pycodestyle * develop
pylint * develop
pytest * develop
rope * develop
speech-dataset-parser * develop
tox * develop
twine * develop
TextGrid >=1.5
ordered-set >=4.1.0
tox-wheel *
tqdm *

Pipfile.lock pypi

astroid ==2.11.5 develop
attrs ==21.4.0 develop
autoflake ==1.4 develop
autopep8 ==1.6.0 develop
bleach ==5.0.0 develop
certifi ==2022.5.18.1 develop
cffi ==1.15.0 develop
charset-normalizer ==2.0.12 develop
commonmark ==0.9.1 develop
cryptography ==37.0.2 develop
dill ==0.3.5.1 develop
distlib ==0.3.4 develop
docutils ==0.18.1 develop
filelock ==3.7.1 develop
idna ==3.3 develop
importlib-metadata ==4.11.4 develop
iniconfig ==1.1.1 develop
isort ==5.10.1 develop
jeepney ==0.8.0 develop
keyring ==23.5.1 develop
lazy-object-proxy ==1.7.1 develop
mccabe ==0.7.0 develop
ordered-set ==4.1.0 develop
packaging ==21.3 develop
pkginfo ==1.8.2 develop
platformdirs ==2.5.2 develop
pluggy ==1.0.0 develop
py ==1.11.0 develop
pycodestyle ==2.8.0 develop
pycparser ==2.21 develop
pyflakes ==2.4.0 develop
pygments ==2.12.0 develop
pylint ==2.14.0 develop
pyparsing ==3.0.9 develop
pytest ==7.1.2 develop
readme-renderer ==35.0 develop
requests ==2.27.1 develop
requests-toolbelt ==0.9.1 develop
rfc3986 ==2.0.0 develop
rich ==12.4.4 develop
rope ==1.1.1 develop
secretstorage ==3.3.2 develop
setuptools ==62.3.2 develop
six ==1.16.0 develop
speech-dataset-parser * develop
textgrid ==1.5 develop
toml ==0.10.2 develop
tomli ==2.0.1 develop
tomlkit ==0.11.0 develop
tox ==3.25.0 develop
tqdm ==4.64.0 develop
twine ==4.0.1 develop
typing-extensions ==4.2.0 develop
urllib3 ==1.26.9 develop
virtualenv ==20.14.1 develop
webencodings ==0.5.1 develop
wrapt ==1.14.1 develop
zipp ==3.8.0 develop
distlib ==0.3.4
filelock ==3.7.1
ordered-set ==4.1.0
packaging ==21.3
platformdirs ==2.5.2
pluggy ==1.0.0
py ==1.11.0
pyparsing ==3.0.9
six ==1.16.0
textgrid ==1.5
toml ==0.10.2
tox ==3.25.0
tox-wheel ==0.7.0
tqdm ==4.64.0
virtualenv ==20.14.1
wheel ==0.37.1

pyproject.toml pypi

TextGrid >=1.5
importlib_resources python_version < '3.8'
ordered_set >=4.1.0
tqdm *

speech-dataset-parser

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

speech-dataset-parser

Generic Format

Installation

Library Usage

CLI Usage

CLI Example

Convert LJ Speech dataset with symbolic links to the audio files

Dependencies

Roadmap

Contributing

Development setup

update

install Python 3.7, 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run

install pipenv for creation of virtual environments

check out repo

create virtual environment

Running the tests

first install the tool like in "Development setup"

then, navigate into the directory of the repo (if not already done)

activate environment

run tests

License

Acknowledgments

Citation

Changelog

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: speech-dataset-parser

Rankings

Maintainers (1)

Dependencies