speech-recognition
Develop speech recognition models with Tensorflow 2
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Keywords
Repository
Develop speech recognition models with Tensorflow 2
Basic Info
Statistics
- Stars: 7
- Watchers: 1
- Forks: 4
- Open Issues: 1
- Releases: 1
Topics
Metadata Files
README.md
Speech Recognition
- This is for speech recognition including models and train, evaluate, inference scripts based tensorflow 2
- You can execute script examples on below descriptions with test data
resources/configsdirectory contains default datasets (LibriSpeech, KsponSpeech, Clovacall) and models (LAS, DeepSpeech2) configs.resources/sp-modelsdirectory contains default sentencepiece tokenizer for each datasetsI trained LAS small model using LibriSpeech dataset. You can download pretrained model on release page
Trained model performance is below.
| | LibriSpeech dev-clean | LibriSpeech dev-other | | --- | --- | --- | | WER (Word Error Rate) | 9.35% | 24.53% | | CER (Character Error Rate) | 4.24% | 13.29% |
References
LAS Model
- Listen, Attend and Spell
- On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
DeepSpeech2 Model
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Dataset Format
Dataset File is tsv(tab separated values) format.
The dataset file should have header line.
The 1st column is audio file path relative to directory that contains dataset tsv file.
The 2nd column is recognized text.
Refer to
tests/data/dataset.tsvfile.
FilePath | Text ---|--- audio/001.wav | audio/002.wav | audio/003.wav | ? ... | ... - This is tsv file example.
Train
Example
You can start training by running script like below example.
sh
$ python -m speech_recognition.run.train \
--data-config resources/configs/libri_config.yml \
--model-config resources/configs/las_small.yml \
--sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \
--train-dataset-paths tests/data/wav_dataset.tsv \
--dev-dataset-paths tests/data/wav_dataset.tsv \
--train-dataset-size 1000 \
--steps-per-epoch 100 \
--epochs 10 \
--batch-size 32 \
--dev-batch-size 32 \
--learning-rate 2e-4 \
--mixed-precision \
--device CPU
You can also start training with train configuration file using --from-file parameter.
sh
$ python -m speech_recognition.run.train --from-file resources/configs/train_config_sample.yml
And you can override the parameter of file by command line arguments like below.
sh
$ python -m speech_recognition.run.train \
--from-file resources/configs/train_config_sample.yml \
--epochs 1 \
--batch-size 128 \
--device GPU
Arguments
text
--from-file FROM_FILE
load configs from file
--data-config DATA_CONFIG
data processing config file
--model-config MODEL_CONFIG
model config file
--sp-model-path SP_MODEL_PATH
sentencepiece model path
--train-dataset-paths TRAIN_DATASET_PATHS
a tsv/tfrecord dataset file or multiple files ex)
*.tsv
--dev-dataset-paths DEV_DATASET_PATHS
a tsv/tfrecord dataset file or multiple files ex)
*.tsv
--train-dataset-size TRAIN_DATASET_SIZE
the number of training dataset examples
--output-path OUTPUT_PATH
output directory to save log and model checkpoints
--pretrained-model-path PRETRAINED_MODEL_PATH
pretrained model checkpoint
--epochs EPOCHS
--steps-per-epoch STEPS_PER_EPOCH
--learning-rate LEARNING_RATE
--min-learning-rate MIN_LEARNING_RATE
--warmup-rate WARMUP_RATE
--warmup-steps WARMUP_STEPS
--batch-size BATCH_SIZE
--dev-batch-size DEV_BATCH_SIZE
--shuffle-buffer-size SHUFFLE_BUFFER_SIZE
shuffle buffer size
--max-over-policy {filter,slice}
policy for sequence whose length is over max
--use-tfrecord use tfrecord dataset
--tensorboard-update-freq TENSORBOARD_UPDATE_FREQ
--mixed-precision use mixed precision FP16
--seed SEED Set random seed
--skip-epochs SKIP_EPOCHS
skip first N epochs and start N + 1 epoch
--device {CPU,GPU,TPU}
device to use (TPU or GPU or CPU)
- data-config is config file path for data processing. example config is resources/configs/libri_config.yml.
- model-config is config model file path for model initialize. default config is resources/configs/las_small.yml.
- sp-model-path is sentencepiece model path to tokenize target text.
- pretrained-model-path is pretrained model checkpoint path if you continue to train from pretrained model.
- warmup-rate or warmup-steps specify warmup steps. default is zero. warmup-steps is used if both of params provided.
- max-over-policy option is for sequences whose length is over than max sequence. You can filter longer example or slice to fit length.
- use-tfrecord option should be provided when using TFRecord format dataset.
- mixed-precision option is enabling FP16 mixed precision.
Evaluate
Example
You can evaluate your trained model using evaluate.py script.
You'll get to know CER or WER as a result of evaluation like below example.
sh
$ python -m speech_recognition.run.evaluate \
--data-config resources/configs/libri_config.yml \
--model-config tests/data/model-configs/las_mini_for_test.yml \
--dataset-paths tests/data/wav_dataset.tsv \
--model-path tests/data/model-checkpoints/las.ckpt \
--sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \
--device CPU
...
[2021-06-07 13:22:48,599] [+] Load Tokenizer from resources/sp-models/sp_model_unigram_16K_libri.model
[2021-06-07 13:22:48,626] [+] Load Data Config from resources/configs/libri_config.yml
[2021-06-07 13:22:48,629] [+] Load dataset from tests/data/wav_dataset.tsv
2021-06-07 13:22:49.018137: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
[2021-06-07 13:22:49,662] [+] Use delta and deltas accelerate
[2021-06-07 13:22:53,122] [+] Load weights of model from tests/data/model-checkpoints/las.ckpt
Model: "las"
...
[2021-06-07 13:22:53,135] [+] Start Inference
2021-06-07 13:22:53.171394: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-07 13:22:53.188758: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz
[2021-06-07 13:22:56,352] [+] Ended Inference
[2021-06-07 13:22:56,589] [+] Average WER: 2494.6429%
[2021-06-07 13:22:56,589] [+] Average CER: 7256.3131%
Argument
sh
--data-config DATA_CONFIG
data processing config file
--model-config MODEL_CONFIG
model config file
--dataset-paths DATASET_PATHS
a tsv/tfrecord dataset file or multiple files ex)
*.tsv
--model-path MODEL_PATH
pretrained model checkpoint
--sp-model-path SP_MODEL_PATH
sentencepiece model path
--output-path OUTPUT_PATH
output tsv file path to save generated sentences
--batch-size BATCH_SIZE
--beam-size BEAM_SIZE
not given, use greedy search else beam search with
this value as beam size
--use-tfrecord use tfrecord dataset
--mixed-precision Use mixed precision FP16
--device DEVICE device to train
- dataset-paths is same as dataset-paths in train script.
- If you pass output-path argument, recognized text and real target text, distance metric is exported in tsv format.
- You can select your metric of CER or WER by passing metric argument.
Inference
Example
You can infer with trained model to your audio files like below example. ```sh $ python -m speechrecognition.run.inference \ --data-config resources/configs/libriconfig.yml \ --model-config tests/data/model-configs/lasminifortest.yml \ --audio-files "tests/data/audiofiles/*.wav" \ --model-path tests/data/model-checkpoints/las.ckpt \ --sp-model-path resources/sp-models/spmodelunigram16Klibri.model \ --batch-size 3 \ --device CPU \ --beam-size 2
... [2021-06-07 13:28:27,696] [+] Use delta and deltas accelerate [2021-06-07 13:28:31,202] Loaded weights of model from tests/data/model-checkpoints/las.ckpt Model: "las" (MODEL SUMMARY) [2021-06-07 13:28:31,204] Start Inference 2021-06-07 13:28:31.238552: I tensorflow/compiler/mlir/mlirgraphoptimizationpass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-06-07 13:28:31.256769: I tensorflow/core/platform/profileutils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz [2021-06-07 13:28:35,693] Ended Inference, Start to save... [2021-06-07 13:28:35,694] Saved (audio path,decoded sentence) pairs to output.tsv ``` Then inferenced files is saved to output path.
Argument
sh
--data-config DATA_CONFIG
data processing config file
--model-config MODEL_CONFIG
model config file
--audio-files AUDIO_FILES
an audio file or glob pattern of multiple files ex)
*.pcm
--model-path MODEL_PATH
pretrained model checkpoint
--output-path OUTPUT_PATH
output tsv file path to save generated sentences
--sp-model-path SP_MODEL_PATH
sentencepiece model path
--batch-size BATCH_SIZE
--beam-size BEAM_SIZE
not given, use greedy search else beam search with
this value as beam size
--mixed-precision Use mixed precision FP16
--device DEVICE device to train
- audio-files is audio files glob pattern. i.e) "*.pcm", "data[0-9]+.wav"
- model-path is tensorflow model checkpoint path.
Make TFRecord
Example
You can convert dataset into TFRecord format like below example. ```sh $ python -m speechrecognition.run.maketfrecord \ --data-config resources/configs/libriconfig.yml \ --dataset-paths tests/data/wavdataset.tsv \ --sp-model-path resources/sp-models/spmodelunigram16Klibri.model \ --output-dir .
[2021-06-07 13:31:10,444] [+] Number of Dataset Files: 1 [2021-06-07 13:31:10,445] [+] Load Config From resources/configs/libriconfig.yml [2021-06-07 13:31:10,447] [+] Load Tokenizer From resources/sp-models/spmodelunigram16Klibri.model ... 2021-06-07 13:31:10.491991: I tensorflow/compiler/jit/xlagpudevice.cc:99] Not creating XLA devices, tfxlaenablexladevices not set [2021-06-07 13:31:10,519] [+] Start Saving Dataset... 0%| | 0/1 [00:00<?, ?it/s]2021-06-07 13:31:10.848397: I tensorflowio/core/kernels/cpucheck.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA 2021-06-07 13:31:11.530043: I tensorflow/compiler/mlir/mlirgraphoptimizationpass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-06-07 13:31:11.548833: I tensorflow/core/platform/profileutils/cpuutils.cc:112] CPU Frequency: 2198835000 Hz 100%|| 1/1 [00:01<00:00, 1.35s/it] [2021-06-07 13:31:11,867] [+] Done ```
Argument
text
--data-config DATA_CONFIG
data processing config file
--dataset-paths DATASET_PATHS
dataset file path glob pattern
--output-dir OUTPUT_DIR
output directory path, default is input dataset file
directoruy
--sp-model-path SP_MODEL_PATH
sentencepiece model path
- The arguments is same as train script arguments.
- The output TFRecord file contains already pre-processed audio tensors and tokenized tensors, so you can train with only TFRecord file without tsv or audio files.
Owner
- Name: ParkSangJun
- Login: cosmoquester
- Kind: user
- Location: Seoul, Korea
- Company: @scatterlab @pingpong-ai
- Website: https://cosmoquester.github.io
- Repositories: 12
- Profile: https://github.com/cosmoquester
Machine Learning Engineer @scatterlab Korea. Thank you.
GitHub Events
Total
Last Year
Dependencies
- black * development
- codecov * development
- isort * development
- pytest * development
- pytest-cov * development
- pytest-xdist * development
- pydantic *
- pyyaml *
- tensorflow >=2
- tensorflow-addons *
- tensorflow-io *
- tensorflow-text *
- tqdm *
- pydantic *
- pyyaml *
- tensorflow >=2
- tensorflow-addons *
- tensorflow-io *
- tensorflow-text *
- tqdm *