speech-dataset-parser
Parser for several speech datasets.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Repository
Parser for several speech datasets.
Basic Info
- Host: GitHub
- Owner: stefantaubert
- License: mit
- Language: Python
- Default Branch: master
- Size: 330 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 4
Metadata Files
README.md
speech-dataset-parser
Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
Speech datasets consists of pairs of .TextGrid and .wav files. The TextGrids need to contain a tier which has each symbol separated in an interval, e.g., T|h|i|s| |i|s| |a| |t|e|x|t|.
Generic Format
The format is as follows: {Dataset name}/{Speaker name};{Speaker gender};{Speaker language}[;{Speaker accent}]/[Subfolder(s)]/{Recordings as .wav- and .TextGrid-pairs}
Example: LJ Speech/Linda Johnson;2;eng;North American/wavs/...
Speaker names can be any string (excluding ; symbols).
Genders are defined via their ISO/IEC 5218 Code.
Languages are defined via their ISO 639-2 Code (bibliographic).
Accents are optional and can be any string (excluding ; symbols).
Installation
sh
pip install speech-dataset-parser --user
Library Usage
```py from speechdatasetparser import parse_dataset
entries = list(parse_dataset({folder}, {grid-tier-name})) ```
The resulting entries list contains dataclass-instances with these properties:
symbols: Tuple[str, ...]: contains the mark of each intervalintervals: Tuple[float, ...]: contains the max-time of each intervalsymbols_language: str: contains the languagespeaker_name: str: contains the name of the speakerspeaker_accent: str: contains the accent of the speakerspeaker_gender: int: contains the gender of the speakeraudio_file_abs: Path: contains the absolute path to the speech audiomin_time: float: the min-time of the gridmax_time: float: the max-time of the grid (equal tointervals[-1])
CLI Usage
```txt usage: dataset-converter-cli [-h] [-v] {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} ...
This program converts common speech datasets into a generic representation.
positional arguments: {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} description convert-ljs convert LJ Speech dataset to a generic dataset convert-l2arctic convert L2-ARCTIC dataset to a generic dataset convert-thchs convert THCHS-30 (OpenSLR Version) dataset to a generic dataset convert-thchs-cslt convert THCHS-30 (CSLT Version) dataset to a generic dataset restore-structure restore original dataset structure of generic datasets
optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit ```
CLI Example
```sh
Convert LJ Speech dataset with symbolic links to the audio files
dataset-converter-cli convert-ljs \ "/data/datasets/LJSpeech-1.1" \ "/tmp/ljs" \ --tier "Symbols" \ --symlink ```
Dependencies
tqdmTextGrid>=1.5ordered_set>=4.1.0importlib_resources; python_version < '3.8'
Roadmap
- Supporting conversion of more datasets
- Adding more tests
Contributing
If you notice an error, please don't hesitate to open an issue.
Development setup
```sh
update
sudo apt update
install Python 3.7, 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run
sudo apt install python3-pip \ python3.7 python3.7-dev python3.7-distutils python3.7-venv \ python3.8 python3.8-dev python3.8-distutils python3.8-venv \ python3.9 python3.9-dev python3.9-distutils python3.9-venv \ python3.10 python3.10-dev python3.10-distutils python3.10-venv \ python3.11 python3.11-dev python3.11-distutils python3.11-venv
install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user
check out repo
git clone https://github.com/stefantaubert/speech-dataset-parser.git cd speech-dataset-parser
create virtual environment
python3.8 -m pipenv install --dev ```
Running the tests
```sh
first install the tool like in "Development setup"
then, navigate into the directory of the repo (if not already done)
cd speech-dataset-parser
activate environment
python3.8 -m pipenv shell
run tests
tox ```
Final lines of test result output:
log
py37: commands succeeded
py38: commands succeeded
py39: commands succeeded
py310: commands succeeded
py311: commands succeeded
congratulations :)
License
MIT License
Acknowledgments
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
Citation
If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).
Changelog
- v0.0.5 (unreleased)
- Added:
- Added option to parse LJ Speech
--use-un-normalized-text
- v0.0.4 (2023-01-12)
- Added:
- Added support to parse OpenSLR THCHS-30 version
- Added returning of an exit code
- Changed:
- Changed default command to be parsing the OpenSLR version for THCHS-30 by renaming the previous command to
convert-thchs-cslt
- v0.0.3 (2023-01-02)
- added option to restore original file structure
- added option to THCHS-30 to opt in for adding of punctuation
- change file naming format to numbers with preceding zeros
- v0.0.2 (2022-09-08)
- added support for L2Arctic
- added support for THCHS-30
- v0.0.1 (2022-06-03)
- Initial release
Owner
- Name: Stefan Taubert
- Login: stefantaubert
- Kind: user
- Location: Chemnitz, Germany
- Company: Chemnitz University of Technology
- Website: https://stefantaubert.com
- Twitter: Stefan_Taubert
- Repositories: 75
- Profile: https://github.com/stefantaubert
Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.
Citation (CITATION.cff)
cff-version: 1.2.0
title: speech-dataset-parser
abstract: Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- email: github@stefantaubert.com
given-names: Stefan
family-names: Taubert
affiliation: Chemnitz University of Technology
orcid: 'https://orcid.org/0000-0002-4932-2874'
website: 'https://stefantaubert.com'
version: 0.0.4
date-released: 2023-01-12
license: MIT
url: https://github.com/stefantaubert/speech-dataset-parser
doi: 10.5281/zenodo.7529425
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- stefantaubert (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 36 last-month
- Total dependent packages: 1
- Total dependent repositories: 2
- Total versions: 4
- Total maintainers: 1
pypi.org: speech-dataset-parser
Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
- Homepage: https://github.com/stefantaubert/speech-dataset-parser
- Documentation: https://speech-dataset-parser.readthedocs.io/
- License: MIT
-
Latest release: 0.0.4
published over 3 years ago
Rankings
Maintainers (1)
Dependencies
- autoflake * develop
- autopep8 * develop
- isort * develop
- pycodestyle * develop
- pylint * develop
- pytest * develop
- rope * develop
- speech-dataset-parser * develop
- tox * develop
- twine * develop
- TextGrid >=1.5
- ordered-set >=4.1.0
- tox-wheel *
- tqdm *
- astroid ==2.11.5 develop
- attrs ==21.4.0 develop
- autoflake ==1.4 develop
- autopep8 ==1.6.0 develop
- bleach ==5.0.0 develop
- certifi ==2022.5.18.1 develop
- cffi ==1.15.0 develop
- charset-normalizer ==2.0.12 develop
- commonmark ==0.9.1 develop
- cryptography ==37.0.2 develop
- dill ==0.3.5.1 develop
- distlib ==0.3.4 develop
- docutils ==0.18.1 develop
- filelock ==3.7.1 develop
- idna ==3.3 develop
- importlib-metadata ==4.11.4 develop
- iniconfig ==1.1.1 develop
- isort ==5.10.1 develop
- jeepney ==0.8.0 develop
- keyring ==23.5.1 develop
- lazy-object-proxy ==1.7.1 develop
- mccabe ==0.7.0 develop
- ordered-set ==4.1.0 develop
- packaging ==21.3 develop
- pkginfo ==1.8.2 develop
- platformdirs ==2.5.2 develop
- pluggy ==1.0.0 develop
- py ==1.11.0 develop
- pycodestyle ==2.8.0 develop
- pycparser ==2.21 develop
- pyflakes ==2.4.0 develop
- pygments ==2.12.0 develop
- pylint ==2.14.0 develop
- pyparsing ==3.0.9 develop
- pytest ==7.1.2 develop
- readme-renderer ==35.0 develop
- requests ==2.27.1 develop
- requests-toolbelt ==0.9.1 develop
- rfc3986 ==2.0.0 develop
- rich ==12.4.4 develop
- rope ==1.1.1 develop
- secretstorage ==3.3.2 develop
- setuptools ==62.3.2 develop
- six ==1.16.0 develop
- speech-dataset-parser * develop
- textgrid ==1.5 develop
- toml ==0.10.2 develop
- tomli ==2.0.1 develop
- tomlkit ==0.11.0 develop
- tox ==3.25.0 develop
- tqdm ==4.64.0 develop
- twine ==4.0.1 develop
- typing-extensions ==4.2.0 develop
- urllib3 ==1.26.9 develop
- virtualenv ==20.14.1 develop
- webencodings ==0.5.1 develop
- wrapt ==1.14.1 develop
- zipp ==3.8.0 develop
- distlib ==0.3.4
- filelock ==3.7.1
- ordered-set ==4.1.0
- packaging ==21.3
- platformdirs ==2.5.2
- pluggy ==1.0.0
- py ==1.11.0
- pyparsing ==3.0.9
- six ==1.16.0
- textgrid ==1.5
- toml ==0.10.2
- tox ==3.25.0
- tox-wheel ==0.7.0
- tqdm ==4.64.0
- virtualenv ==20.14.1
- wheel ==0.37.1
- TextGrid >=1.5
- importlib_resources python_version < '3.8'
- ordered_set >=4.1.0
- tqdm *