Speech-Recognition-using-WhisperAI
Multilingual Speech Recognition that uses WhisperAI that can extract words and their timestamps and put them into a .TextGrid file.
https://github.com/esantiago28/Speech-Recognition-using-WhisperAI
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Repository
Multilingual Speech Recognition that uses WhisperAI that can extract words and their timestamps and put them into a .TextGrid file.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Speech-Recognition-using-WhisperAI
Speech Recognition using WhisperAI is a Python script utilizing Whisper and whisper-timestamped to automatically transcribe and output a .TextGrid file with timestamps of each word from the start and the end of each word. Supports any language model that Whisper has.
Installation
First Installation
Requirements:
- python3 (version higher or equal to 3.7, at least 3.9 is recommended)
- ffmpeg (see instructions for installation on the whisper repository)
- tgt (version 1.4.4)
Installation of whisper-timestamped with pip:
bash
pip3 install git+https://github.com/linto-ai/whisper-timestamped
or clone repository and running installation
bash
git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
python3 setup.py install
Installation of 'tgt' with pip:
bash
pip install tgt
Additional Packages that might be needed (Depends on usage)
Usage
Python
In Python, you can use the function whisper_timestamped.transcribe(), which is similar to the function whisper.transcribe():
python
import whisper_timestamped
help(whisper_timestamped.transcribe)
The main difference with whisper.transcribe() is that the output will include a key "words" for all segments, with the word start and end position. Note that the word will include punctuation. See the example below.
Besides, the default decoding options are different to favour efficient decoding (greedy decoding instead of beam search, and no temperature sampling fallback). To have same default as in whisper, use beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
There are also additional options related to word alignement.
In general, if you import whisper_timestamped instead of whisper in your Python script and use transcribe(model, ...) instead of model.transcribe(...), it should do the job:
```
import whisper_timestamped as whisper
audio = whisper.load_audio("AUDIO.wav")
model = whisper.load_model("tiny", device="cpu")
result = whisper.transcribe(model, audio, language="fr")
import json print(json.dumps(result, indent = 2, ensure_ascii = False)) ```
Note that you can use a finetuned Whisper model from HuggingFace or a local folder by using the load_model method of whisper_timestamped. For instance, if you want to use whisper-large-v2-nob, you can simply do the following:
```
import whisper_timestamped as whisper
model = whisper.load_model("NbAiLab/whisper-large-v2-nob", device="cpu")
...
```
Command line
You can also use whisper_timestamped on the command line, similarly to whisper. See help with:
bash
whisper_timestamped --help
The main differences with whisper CLI are:
* Output files:
* The output JSON contains word timestamps and confidence scores. See example below.
* There is an additional CSV output format.
* For SRT, VTT, TSV formats, there will be additional files saved with word timestamps.
* Some default options are different:
* By default, no output folder is set: Use --output_dir . for Whisper default.
* By default, there is no verbose: Use --verbose True for Whisper default.
* By default, beam search decoding and temperature sampling fallback are disabled, to favour an efficient decoding.
To set the same as Whisper default, you can use --accurate (which is an alias for --beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5).
* There are some additional specific options:
<!-- * --efficient to use a faster greedy decoding (without beam search neither several sampling at each step),
which enables a special path where word timestamps are computed on the fly (no need to run inference twice).
Note that transcription results might be significantly worse on challenging audios with this option. -->
* --compute_confidence to enable/disable the computation of confidence scores for each word.
* --punctuations_with_words to decide whether punctuation marks should be included or not with preceding words.
An example command to process several files using the tiny model and output the results in the current folder, as would be done by default with whisper, is as follows:
whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .
Note that you can use a fine-tuned Whisper model from HuggingFace or a local folder. For instance, if you want to use the whisper-large-v2-nob model, you can simply do the following:
whisper_timestamped --model NbAiLab/whisper-large-v2-nob <...>
Example output
Here is an example output of the whisper_timestamped.transcribe() function, which can be viewed by using the CLI:
bash
whisper_timestamped AUDIO_FILE.wav --model tiny --language fr
json
{
"text": " Bonjour! Est-ce que vous allez bien?",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.5,
"end": 1.2,
"text": " Bonjour!",
"tokens": [ 25431, 2298 ],
"temperature": 0.0,
"avg_logprob": -0.6674491882324218,
"compression_ratio": 0.8181818181818182,
"no_speech_prob": 0.10241222381591797,
"confidence": 0.51,
"words": [
{
"text": "Bonjour!",
"start": 0.5,
"end": 1.2,
"confidence": 0.51
}
]
},
{
"id": 1,
"seek": 200,
"start": 2.02,
"end": 4.48,
"text": " Est-ce que vous allez bien?",
"tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
"temperature": 0.0,
"avg_logprob": -0.43492694334550336,
"compression_ratio": 0.7714285714285715,
"no_speech_prob": 0.06502953916788101,
"confidence": 0.595,
"words": [
{
"text": "Est-ce",
"start": 2.02,
"end": 3.78,
"confidence": 0.441
},
{
"text": "que",
"start": 3.78,
"end": 3.84,
"confidence": 0.948
},
{
"text": "vous",
"start": 3.84,
"end": 4.0,
"confidence": 0.935
},
{
"text": "allez",
"start": 4.0,
"end": 4.14,
"confidence": 0.347
},
{
"text": "bien?",
"start": 4.14,
"end": 4.48,
"confidence": 0.998
}
]
}
],
"language": "fr"
}
This get's parsed into a .TextGrid file with the following format:
```
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 10
tiers?
Options that may improve results
Here are some options that are not enabled by default but might improve results.
Accurate Whisper transcription
As mentioned earlier, some decoding options are disabled by default to offer better efficiency. However, this can impact the quality of the transcription. To run with the options that have the best chance of providing a good transcription, use the following options.
* In Python:
python
results = whisper_timestamped.transcribe(model, audio, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), ...)
* On the command line:
bash
whisper_timestamped --accurate ...
Running Voice Activity Detection (VAD) before sending to Whisper
Whisper models can "hallucinate" text when given a segment without speech. This can be avoided by running VAD and gluing speech segments together before transcribing with the Whisper model. This is possible with whisper-timestamped.
* In Python:
python
results = whisper_timestamped.transcribe(model, audio, vad=True, ...)
* On the command line:
bash
whisper_timestamped --vad True ...
Detecting disfluencies
Whisper models tend to remove speech disfluencies (filler words, hesitations, repetitions, etc.). Without precautions, the disfluencies that are not transcribed will affect the timestamp of the following word: the timestamp of the beginning of the word will actually be the timestamp of the beginning of the disfluencies. whisper-timestamped can have some heuristics to avoid this.
* In Python:
python
results = whisper_timestamped.transcribe(model, audio, detect_disfluencies=True, ...)
* On the command line:
bash
whisper_timestamped --detect_disfluencies True ...
Important: Note that when using these options, possible disfluencies will appear in the transcription as a special "[*]" word.
Acknowlegment
- whisper: Whisper speech recognition (License MIT).
- dtw-python: Dynamic Time Warping (License GPL v3).
- whisper-timestamped: Multilingual Automatic Speech Recognition (License AGPL v3).
- TextGridTools: TextGridTools Read, write, and manipulate Praat TextGrid files with Python. (License GPL v3).
Owner
- Login: esantiago28
- Kind: user
- Repositories: 1
- Profile: https://github.com/esantiago28
Citation (CITATION)
@misc{lintoai2023whispertimestamped,
title={whisper-timestamped},
author={Louradour, J{\'e}r{\^o}me},
journal={GitHub repository},
year={2023},
publisher={GitHub},
howpublished = {\url{https://github.com/linto-ai/whisper-timestamped}}
}
@article{radford2022robust,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
@article{JSSv031i07,
title={Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package},
author={Giorgino, Toni},
journal={Journal of Statistical Software},
year={2009},
volume={31},
number={7},
doi={10.18637/jss.v031.i07}
}
Buschmeier, H. & Włodarczak, M. (2013). TextGridTools: A TextGrid processing and analysis toolkit for Python. In Proceedings der 24. Konferenz zur Elektronischen Sprachsignalverarbeitung, pp. 152–157, Bielefeld, Germany.