Speech-Recognition-using-WhisperAI

Multilingual Speech Recognition that uses WhisperAI that can extract words and their timestamps and put them into a .TextGrid file.

https://github.com/esantiago28/Speech-Recognition-using-WhisperAI

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Multilingual Speech Recognition that uses WhisperAI that can extract words and their timestamps and put them into a .TextGrid file.

Basic Info
  • Host: GitHub
  • Owner: esantiago28
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 20.5 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Citation

README.md

Speech-Recognition-using-WhisperAI

Speech Recognition using WhisperAI is a Python script utilizing Whisper and whisper-timestamped to automatically transcribe and output a .TextGrid file with timestamps of each word from the start and the end of each word. Supports any language model that Whisper has.

Installation

First Installation

Requirements: - python3 (version higher or equal to 3.7, at least 3.9 is recommended) - ffmpeg (see instructions for installation on the whisper repository) - tgt (version 1.4.4)

Installation of whisper-timestamped with pip:

bash pip3 install git+https://github.com/linto-ai/whisper-timestamped or clone repository and running installation bash git clone https://github.com/linto-ai/whisper-timestamped cd whisper-timestamped/ python3 setup.py install

Installation of 'tgt' with pip: bash pip install tgt

Additional Packages that might be needed (Depends on usage)

Usage

Python

In Python, you can use the function whisper_timestamped.transcribe(), which is similar to the function whisper.transcribe(): python import whisper_timestamped help(whisper_timestamped.transcribe) The main difference with whisper.transcribe() is that the output will include a key "words" for all segments, with the word start and end position. Note that the word will include punctuation. See the example below.

Besides, the default decoding options are different to favour efficient decoding (greedy decoding instead of beam search, and no temperature sampling fallback). To have same default as in whisper, use beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0).

There are also additional options related to word alignement.

In general, if you import whisper_timestamped instead of whisper in your Python script and use transcribe(model, ...) instead of model.transcribe(...), it should do the job: ``` import whisper_timestamped as whisper

audio = whisper.load_audio("AUDIO.wav")

model = whisper.load_model("tiny", device="cpu")

result = whisper.transcribe(model, audio, language="fr")

import json print(json.dumps(result, indent = 2, ensure_ascii = False)) ```

Note that you can use a finetuned Whisper model from HuggingFace or a local folder by using the load_model method of whisper_timestamped. For instance, if you want to use whisper-large-v2-nob, you can simply do the following: ``` import whisper_timestamped as whisper

model = whisper.load_model("NbAiLab/whisper-large-v2-nob", device="cpu")

...

```

Command line

You can also use whisper_timestamped on the command line, similarly to whisper. See help with: bash whisper_timestamped --help

The main differences with whisper CLI are: * Output files: * The output JSON contains word timestamps and confidence scores. See example below. * There is an additional CSV output format. * For SRT, VTT, TSV formats, there will be additional files saved with word timestamps. * Some default options are different: * By default, no output folder is set: Use --output_dir . for Whisper default. * By default, there is no verbose: Use --verbose True for Whisper default. * By default, beam search decoding and temperature sampling fallback are disabled, to favour an efficient decoding. To set the same as Whisper default, you can use --accurate (which is an alias for --beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5). * There are some additional specific options: <!-- * --efficient to use a faster greedy decoding (without beam search neither several sampling at each step), which enables a special path where word timestamps are computed on the fly (no need to run inference twice). Note that transcription results might be significantly worse on challenging audios with this option. --> * --compute_confidence to enable/disable the computation of confidence scores for each word. * --punctuations_with_words to decide whether punctuation marks should be included or not with preceding words.

An example command to process several files using the tiny model and output the results in the current folder, as would be done by default with whisper, is as follows: whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .

Note that you can use a fine-tuned Whisper model from HuggingFace or a local folder. For instance, if you want to use the whisper-large-v2-nob model, you can simply do the following: whisper_timestamped --model NbAiLab/whisper-large-v2-nob <...>

Example output

Here is an example output of the whisper_timestamped.transcribe() function, which can be viewed by using the CLI: bash whisper_timestamped AUDIO_FILE.wav --model tiny --language fr json { "text": " Bonjour! Est-ce que vous allez bien?", "segments": [ { "id": 0, "seek": 0, "start": 0.5, "end": 1.2, "text": " Bonjour!", "tokens": [ 25431, 2298 ], "temperature": 0.0, "avg_logprob": -0.6674491882324218, "compression_ratio": 0.8181818181818182, "no_speech_prob": 0.10241222381591797, "confidence": 0.51, "words": [ { "text": "Bonjour!", "start": 0.5, "end": 1.2, "confidence": 0.51 } ] }, { "id": 1, "seek": 200, "start": 2.02, "end": 4.48, "text": " Est-ce que vous allez bien?", "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ], "temperature": 0.0, "avg_logprob": -0.43492694334550336, "compression_ratio": 0.7714285714285715, "no_speech_prob": 0.06502953916788101, "confidence": 0.595, "words": [ { "text": "Est-ce", "start": 2.02, "end": 3.78, "confidence": 0.441 }, { "text": "que", "start": 3.78, "end": 3.84, "confidence": 0.948 }, { "text": "vous", "start": 3.84, "end": 4.0, "confidence": 0.935 }, { "text": "allez", "start": 4.0, "end": 4.14, "confidence": 0.347 }, { "text": "bien?", "start": 4.14, "end": 4.48, "confidence": 0.998 } ] } ], "language": "fr" } This get's parsed into a .TextGrid file with the following format: ``` File type = "ooTextFile" Object class = "TextGrid"

xmin = 0 xmax = 10 tiers? size = 1 item []: item [1]: class = "IntervalTier" name = "name" xmin = 0 xmax = 4.653038548752835 intervals: size = 7 intervals [1]: xmin = 0 xmax = 0.7242654159817712 text = "" intervals [2]: xmin = 0.7242654159817712 xmax = 0.9192658855632697 text = "This" intervals [3]: xmin = 0.9192658855632697 xmax = 1.08 text = "is" intervals [4]: xmin = 1.08 xmax = 1.1402664177556348 text = "an" intervals [5]: xmin = 1.1402664177556348 xmax = 1.48 text = "example" intervals [6]: xmin = 1.48 xmax = 1.9885184604351533 text = "demo" intervals [7]: xmin = 1.9885184604351533 xmax = 10 text = "" ```

Options that may improve results

Here are some options that are not enabled by default but might improve results.

Accurate Whisper transcription

As mentioned earlier, some decoding options are disabled by default to offer better efficiency. However, this can impact the quality of the transcription. To run with the options that have the best chance of providing a good transcription, use the following options. * In Python: python results = whisper_timestamped.transcribe(model, audio, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), ...) * On the command line: bash whisper_timestamped --accurate ...

Running Voice Activity Detection (VAD) before sending to Whisper

Whisper models can "hallucinate" text when given a segment without speech. This can be avoided by running VAD and gluing speech segments together before transcribing with the Whisper model. This is possible with whisper-timestamped. * In Python: python results = whisper_timestamped.transcribe(model, audio, vad=True, ...) * On the command line: bash whisper_timestamped --vad True ...

Detecting disfluencies

Whisper models tend to remove speech disfluencies (filler words, hesitations, repetitions, etc.). Without precautions, the disfluencies that are not transcribed will affect the timestamp of the following word: the timestamp of the beginning of the word will actually be the timestamp of the beginning of the disfluencies. whisper-timestamped can have some heuristics to avoid this. * In Python: python results = whisper_timestamped.transcribe(model, audio, detect_disfluencies=True, ...) * On the command line: bash whisper_timestamped --detect_disfluencies True ... Important: Note that when using these options, possible disfluencies will appear in the transcription as a special "[*]" word.

Acknowlegment

  • whisper: Whisper speech recognition (License MIT).
  • dtw-python: Dynamic Time Warping (License GPL v3).
  • whisper-timestamped: Multilingual Automatic Speech Recognition (License AGPL v3).
  • TextGridTools: TextGridTools Read, write, and manipulate Praat TextGrid files with Python. (License GPL v3).

Owner

  • Login: esantiago28
  • Kind: user

Citation (CITATION)

@misc{lintoai2023whispertimestamped,
  title={whisper-timestamped},
  author={Louradour, J{\'e}r{\^o}me},
  journal={GitHub repository},
  year={2023},
  publisher={GitHub},
  howpublished = {\url{https://github.com/linto-ai/whisper-timestamped}}
}

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

@article{JSSv031i07,
  title={Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package},
  author={Giorgino, Toni},
  journal={Journal of Statistical Software},
  year={2009},
  volume={31},
  number={7},
  doi={10.18637/jss.v031.i07}
}

Buschmeier, H. & Włodarczak, M. (2013). TextGridTools: A TextGrid processing and analysis toolkit for Python. In Proceedings der 24. Konferenz zur Elektronischen Sprachsignalverarbeitung, pp. 152–157, Bielefeld, Germany.

GitHub Events

Total
Last Year