speech-dataset-parser

Parser for several speech datasets.

https://github.com/stefantaubert/speech-dataset-parser

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Parser for several speech datasets.

Basic Info
  • Host: GitHub
  • Owner: stefantaubert
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 330 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 4
Created over 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

speech-dataset-parser

PyPI PyPI MIT PyPI PyPI PyPI DOI

Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included. Speech datasets consists of pairs of .TextGrid and .wav files. The TextGrids need to contain a tier which has each symbol separated in an interval, e.g., T|h|i|s| |i|s| |a| |t|e|x|t|.

Generic Format

The format is as follows: {Dataset name}/{Speaker name};{Speaker gender};{Speaker language}[;{Speaker accent}]/[Subfolder(s)]/{Recordings as .wav- and .TextGrid-pairs}

Example: LJ Speech/Linda Johnson;2;eng;North American/wavs/...

Speaker names can be any string (excluding ; symbols). Genders are defined via their ISO/IEC 5218 Code. Languages are defined via their ISO 639-2 Code (bibliographic). Accents are optional and can be any string (excluding ; symbols).

Installation

sh pip install speech-dataset-parser --user

Library Usage

```py from speechdatasetparser import parse_dataset

entries = list(parse_dataset({folder}, {grid-tier-name})) ```

The resulting entries list contains dataclass-instances with these properties:

  • symbols: Tuple[str, ...]: contains the mark of each interval
  • intervals: Tuple[float, ...]: contains the max-time of each interval
  • symbols_language: str: contains the language
  • speaker_name: str: contains the name of the speaker
  • speaker_accent: str: contains the accent of the speaker
  • speaker_gender: int: contains the gender of the speaker
  • audio_file_abs: Path: contains the absolute path to the speech audio
  • min_time: float: the min-time of the grid
  • max_time: float: the max-time of the grid (equal to intervals[-1])

CLI Usage

```txt usage: dataset-converter-cli [-h] [-v] {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} ...

This program converts common speech datasets into a generic representation.

positional arguments: {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} description convert-ljs convert LJ Speech dataset to a generic dataset convert-l2arctic convert L2-ARCTIC dataset to a generic dataset convert-thchs convert THCHS-30 (OpenSLR Version) dataset to a generic dataset convert-thchs-cslt convert THCHS-30 (CSLT Version) dataset to a generic dataset restore-structure restore original dataset structure of generic datasets

optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit ```

CLI Example

```sh

Convert LJ Speech dataset with symbolic links to the audio files

dataset-converter-cli convert-ljs \ "/data/datasets/LJSpeech-1.1" \ "/tmp/ljs" \ --tier "Symbols" \ --symlink ```

Dependencies

  • tqdm
  • TextGrid>=1.5
  • ordered_set>=4.1.0
  • importlib_resources; python_version < '3.8'

Roadmap

  • Supporting conversion of more datasets
  • Adding more tests

Contributing

If you notice an error, please don't hesitate to open an issue.

Development setup

```sh

update

sudo apt update

install Python 3.7, 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run

sudo apt install python3-pip \ python3.7 python3.7-dev python3.7-distutils python3.7-venv \ python3.8 python3.8-dev python3.8-distutils python3.8-venv \ python3.9 python3.9-dev python3.9-distutils python3.9-venv \ python3.10 python3.10-dev python3.10-distutils python3.10-venv \ python3.11 python3.11-dev python3.11-distutils python3.11-venv

install pipenv for creation of virtual environments

python3.8 -m pip install pipenv --user

check out repo

git clone https://github.com/stefantaubert/speech-dataset-parser.git cd speech-dataset-parser

create virtual environment

python3.8 -m pipenv install --dev ```

Running the tests

```sh

first install the tool like in "Development setup"

then, navigate into the directory of the repo (if not already done)

cd speech-dataset-parser

activate environment

python3.8 -m pipenv shell

run tests

tox ```

Final lines of test result output:

log py37: commands succeeded py38: commands succeeded py39: commands succeeded py310: commands succeeded py311: commands succeeded congratulations :)

License

MIT License

Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Citation

If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).

Changelog

  • v0.0.5 (unreleased)
    • Added:
    • Added option to parse LJ Speech --use-un-normalized-text
  • v0.0.4 (2023-01-12)
    • Added:
    • Added support to parse OpenSLR THCHS-30 version
    • Added returning of an exit code
    • Changed:
    • Changed default command to be parsing the OpenSLR version for THCHS-30 by renaming the previous command to convert-thchs-cslt
  • v0.0.3 (2023-01-02)
    • added option to restore original file structure
    • added option to THCHS-30 to opt in for adding of punctuation
    • change file naming format to numbers with preceding zeros
  • v0.0.2 (2022-09-08)
    • added support for L2Arctic
    • added support for THCHS-30
  • v0.0.1 (2022-06-03)
    • Initial release

Owner

  • Name: Stefan Taubert
  • Login: stefantaubert
  • Kind: user
  • Location: Chemnitz, Germany
  • Company: Chemnitz University of Technology

Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.

Citation (CITATION.cff)

cff-version: 1.2.0
title: speech-dataset-parser
abstract: Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - email: github@stefantaubert.com
    given-names: Stefan
    family-names: Taubert
    affiliation: Chemnitz University of Technology
    orcid: 'https://orcid.org/0000-0002-4932-2874'
    website: 'https://stefantaubert.com'
version: 0.0.4
date-released: 2023-01-12
license: MIT
url: https://github.com/stefantaubert/speech-dataset-parser
doi: 10.5281/zenodo.7529425

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • stefantaubert (1)
Pull Request Authors
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 36 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 2
  • Total versions: 4
  • Total maintainers: 1
pypi.org: speech-dataset-parser

Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.

  • Versions: 4
  • Dependent Packages: 1
  • Dependent Repositories: 2
  • Downloads: 36 Last month
Rankings
Dependent packages count: 4.7%
Dependent repos count: 11.6%
Average: 24.2%
Forks count: 29.8%
Downloads: 36.2%
Stargazers count: 38.8%
Maintainers (1)
Last synced: 10 months ago

Dependencies

Pipfile pypi
  • autoflake * develop
  • autopep8 * develop
  • isort * develop
  • pycodestyle * develop
  • pylint * develop
  • pytest * develop
  • rope * develop
  • speech-dataset-parser * develop
  • tox * develop
  • twine * develop
  • TextGrid >=1.5
  • ordered-set >=4.1.0
  • tox-wheel *
  • tqdm *
Pipfile.lock pypi
  • astroid ==2.11.5 develop
  • attrs ==21.4.0 develop
  • autoflake ==1.4 develop
  • autopep8 ==1.6.0 develop
  • bleach ==5.0.0 develop
  • certifi ==2022.5.18.1 develop
  • cffi ==1.15.0 develop
  • charset-normalizer ==2.0.12 develop
  • commonmark ==0.9.1 develop
  • cryptography ==37.0.2 develop
  • dill ==0.3.5.1 develop
  • distlib ==0.3.4 develop
  • docutils ==0.18.1 develop
  • filelock ==3.7.1 develop
  • idna ==3.3 develop
  • importlib-metadata ==4.11.4 develop
  • iniconfig ==1.1.1 develop
  • isort ==5.10.1 develop
  • jeepney ==0.8.0 develop
  • keyring ==23.5.1 develop
  • lazy-object-proxy ==1.7.1 develop
  • mccabe ==0.7.0 develop
  • ordered-set ==4.1.0 develop
  • packaging ==21.3 develop
  • pkginfo ==1.8.2 develop
  • platformdirs ==2.5.2 develop
  • pluggy ==1.0.0 develop
  • py ==1.11.0 develop
  • pycodestyle ==2.8.0 develop
  • pycparser ==2.21 develop
  • pyflakes ==2.4.0 develop
  • pygments ==2.12.0 develop
  • pylint ==2.14.0 develop
  • pyparsing ==3.0.9 develop
  • pytest ==7.1.2 develop
  • readme-renderer ==35.0 develop
  • requests ==2.27.1 develop
  • requests-toolbelt ==0.9.1 develop
  • rfc3986 ==2.0.0 develop
  • rich ==12.4.4 develop
  • rope ==1.1.1 develop
  • secretstorage ==3.3.2 develop
  • setuptools ==62.3.2 develop
  • six ==1.16.0 develop
  • speech-dataset-parser * develop
  • textgrid ==1.5 develop
  • toml ==0.10.2 develop
  • tomli ==2.0.1 develop
  • tomlkit ==0.11.0 develop
  • tox ==3.25.0 develop
  • tqdm ==4.64.0 develop
  • twine ==4.0.1 develop
  • typing-extensions ==4.2.0 develop
  • urllib3 ==1.26.9 develop
  • virtualenv ==20.14.1 develop
  • webencodings ==0.5.1 develop
  • wrapt ==1.14.1 develop
  • zipp ==3.8.0 develop
  • distlib ==0.3.4
  • filelock ==3.7.1
  • ordered-set ==4.1.0
  • packaging ==21.3
  • platformdirs ==2.5.2
  • pluggy ==1.0.0
  • py ==1.11.0
  • pyparsing ==3.0.9
  • six ==1.16.0
  • textgrid ==1.5
  • toml ==0.10.2
  • tox ==3.25.0
  • tox-wheel ==0.7.0
  • tqdm ==4.64.0
  • virtualenv ==20.14.1
  • wheel ==0.37.1
pyproject.toml pypi
  • TextGrid >=1.5
  • importlib_resources python_version < '3.8'
  • ordered_set >=4.1.0
  • tqdm *