speechmlpipeline

SpeechMLPipeline is a complete pipeline to deploy Machine Learning Models to generate labelled and timestamped transcripts from audio inputs

https://github.com/princeton-ddss/speechmlpipeline

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
✓
Institutional organization owner
Organization princeton-ddss has institutional domain (ddss.princeton.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

SpeechMLPipeline is a complete pipeline to deploy Machine Learning Models to generate labelled and timestamped transcripts from audio inputs

Basic Info

Host: GitHub
Owner: princeton-ddss
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 126 MB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

SpeechMLPipeline

SpeechMLPipeline is a Python package for users to run the complete speech machine learning pipeline via one simple function to get transcriptions with speaker labels from input audio files. SpeechMLPipeline applys and implements the most widely used and the innovative machine learning models at each step of the pipeline: * Audio-to-Text Transcription: OpenAI Whisper with timestamp adjustment * Speaker Change Detection: PyAnnotate, Audio-based Spectral Clustering Model, Text-based Llama2-70b Speaker Change Detection Model, Rule-based NLP Speaker Change Detection Model, Ensemble Audio-and-text-based Speaker Change Detection Model * Speaker Identification: Speechbrain

Audio-to-text Transcription * The OpenAI Whisper is selected for the audio-to-text transcription as it is the most accurate model available for English transcription. The OpenAI Whisper with timestamp adjustment is used to reduce the misalignment between the timestamps and the transcription texts by identifying the silence parts and predicting timestamps at the word level.

Speaker Change Detection * The PyAnnotate models is by far one of the most popular models for speaker diarization. It detects speaker change by applying clustering methods based on audio features. The speaker change detection results are directly inferred from speaker diarization results.

The Audio-based Spectral Clustering Model is developed by extracting audio features from Librosa and applying spectral clustering to audio features. This model is one of the most common speaker change detection models used in academic research.
The Text-based Llama2-70b Speaker Change Detection Model is an innovative speaker change detection model based on LLMs. It is developed by asking Llama2 if the speaker changes across two consecutive text segments by understanding the interrelationships between these two texts via their semantic meaning.
The Rule-based NLP Speaker Change Detection Model is applied to detect speaker change by analyzing text using well-defined rules developed by human comprehension.
The Ensemble Audio-and-text-based Speaker Change Detection Models is built by ensembling audio-based or text-based speaker change detection models. The voting methods are used to aggregate the predictions of the speaker change detection models above except for Rule-based NLP model. The aggregated predictions are then corrected by Rule-based NLP model. It has two models, the Majority Model based on the majority voting and the Singularity Model based on the singularity voting.

Speaker Identification * The Speechbrain models are used to perform the speaker identification by comparing the similarities between the vector embeddings of each input audio segment and labelled speakers audio segments.

Create New Python Environment to Avoid Packages Versions Conflict If Needed

python -m venv <envname> source <envname>/bin/activate

Dependencies Installation

Please download requirements.txt from the main repo folder to install the package dependencies. pip install -r <.../requirements.txt>

Package Installation

The speechmlpipeline package could be installed either via Pypi or Github.

Install speechmlpipeline via Pypi for the stable version of the package

pip install speechmlpipeline

Install speechmlpipeline via Github for the latest version of the package

git lfs install git clone https://github.com/princeton-ddss/SpeechMLPipeline cd <.../SpeechMLPipeline> pip install .

Download Models Offline to Run Them without Internet Connection

Download Spacy NLP Model by Running Commands below in Terminal

python -m spacy download en_core_web_lg

Download Whisper, Llama2, and Speechbrain Models by using the Download Module from the Package

is the access token to Hugging Face. Please create a Hugging Face account if it does not exist. The new access token could be created by following the instructions.

is the list of names of models to be downloaded. Usually, the value of models_list should be set as ['whisper', 'llama2-70b', 'speechbrain'].

is the local path where all the downloaded models would be saved.

```python from speechmlpipeline.DownloadModels.downloadmodelsmainfunction import downloadmodelsmainfunction

downloadmodelsmainfunction(<downloadmodelpath>, <modelslist>, ) ```

Download PyAnnote Models using Dropbox Link

To download PyAnnotate models, please download pyannote3.1 folder in this Dropbox Link.

To use the PyAnnotate models, please replace with the local parent folder of the downloaded pyannote3.1 folder in pyannote3.1/Diarization/config.yaml and pyannote3.1/Segmentation/config.yaml.

Usage

The complete pipeline could be run by using runspeechml_pipeline function which could be directly imported as Python from speechmlpipeline import run_speech_ml_pipeline

The runspeechml_pipeline function takes four classes instances corresponding to each step in the Speech Machine Learning Pipeline as the inputs:

transcription: TranscriptionInputs Class to specify inputs to run OpenAI Whisper for Audio-to-Text Transcription with Timestamps Adjustment
speakerchangedetection: SpeakerChangeDetectionInputs Class to specify inputs to run various models including PyAnnote Model, Spectral Clustering, Llama2, and NLP Rule-Based Analysis for Speaker Change Detection
ensembledetection: EnsembleDetectionInputs Class to specify inputs to build an Ensemble Model of Speaker Change Detection by considering both audio and textual features
speakeridentification: SpeakerIdentificationInputs Class to specify inputs to run Speechbrain Verification Model for Speaker Identification

To run the complete pipeline, the function could be called as Python run_speech_ml_pipeline(transcription = TranscriptionInputs(...), speakerchangedetection=SpeakerChangeDetectionInputs(...), ensembledetection=EnsembleDetectionInputs(...), speakeridentification=SpeakerIdentificationInputs(...))

To run any particular steps, please simply just use the inputs corresponding to the particular steps.

For instance, to run all steps of the pipeline with the existing transcriptions: Python run_speech_ml_pipeline(speakerchangedetection=SpeakerChangeDetectionInputs(...), ensembledetection=EnsembleDetectionInputs(...), speakeridentification=SpeakerIdentificationInputs(...))

For instance, to run speaker change detection with the existing transcriptions: Python run_speech_ml_pipeline(speakerchangedetection=SpeakerChangeDetectionInputs(...), ensembledetection=EnsembleDetectionInputs(...))

For instance, to run speaker identification with the existing transcriptions and speaker change detection results: Python run_speech_ml_pipeline(speakeridentification=SpeakerIdentificationInputs(...))

Please view the descriptions below to specify the attributes of the class instance corresponding to each step of the pipeline. Please convert the audio files type to wav to run the whole pipeline or speaker identification. * TranscriptionInputs * audiofileinputpath: A path which contains the audio file * audiofileinputname: A audio file name ending with .wav * whispermodelpath: A path where the Whisper model files are saved * whisperoutputpath: A path to save the csv file of transcription outputs * device: Torch device type to run the model; If device is set as None, GPU would be automatically used if it is available. * onlyruninenglish: True or False to Indicate if Whisper would only be run when the identified langauge in the audio file is English * SpeakerChangeDetectionInputs * audiofileinputpath: A path which contains an input audio file * audiofileinputname: A audio file name containing the file type * minspeakers: The minimal number of speakers in the input audio file * maxspeakers: The maximal number of speakers in the input audio file * whisperoutputpath: A path where a Whisper transcription output csv file is saved * whisperoutputfilename: A Whisper transcription output csv file name ending with .csv * detectionmodels: A list of names of speaker change detection models to be run * detectionoutputpath: A path to save the speaker change detection output in csv file * hfaccesstoken: Access token to HuggingFace * llama2modelpath: A path where the Llama2 model files are saved * pyannotemodelpath: A path where the Pyannote model files are saved * device: Torch device type to run the model; If device is set as None, GPU would be automatically used if it is available. * detectionllama2outputpath: A path where the pre-run Llama2 speaker change detection output in csv file * tempoutputpath: A path to save the current run of Llama2 speaker change detection output to avoid future rerunning * EnsembleDetectionInputs * detectionfileinputpath: A path where the speaker change detection output in csv file is saved * detectionfileinputname: A speaker change detection output csv file name ending with .csv * ensembleoutputpath: A path to save the ensemble detection output in csv file * ensemblevoting: A list of voting methods to be used to build the final ensemble model * SpeakerIdentificationInputs * detectionfileinputpath: A path where the speaker change detection output in csv file is saved * detectionfileinputname: A speaker change detection output csv file name ending with .csv * audiospeakerfileinputpath: A path which contains a verified audio file of each speaker * audiofileinputpath: A path which contains an input audio file * verificationmodelpath: A path where the speaker verification model files are saved, default to None * speakerchangecol: A column name in the detection output csv file which specifies which speaker change detection model result is used for speaker identification * verificationscorethreshold: A score threshold in which the speaker would be identified as "OTHERS" if the verification score is below this threshold，ranging from negative value to 1 * identificationoutputpath: A path to save the speaker identification output in csv file * tempoutputpath: A path to save the temporary cut audio file of each segment

Please view the sample codes to run the function in sample_run.py and samplerunexistingllama2output.py in the src/speechmlpipeline folder. For detailed functions and class decriptions, please refer to src/speechmlpipeline/mainpipelinelocal_function.py

Evaluation

Please view the summary of the prediction performance of speaker change detection models: * Audio-based Model: PyAnnote * Text-based Model: The Llama2 Model * Audio-and-Text based Models: The Singularity Model and the Majority Model

VoxConverse Dataset v0.3

VoxConverse is an only audio-visual diarization dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos, usually in a political debate or news segment context to ensure multi-speaker dialogue. The audio files in the dataset have lots of variations of the proportion of speaker changes, which indicates the effectiveness of the dataset as the evaluation dataset to evaluate the models robustness.

Average Coverage, Purity, Precision, and Recall

| | PyAnnote | Llama2 | Singularity | Majority | |-----------|----------|--------|-----------|----------| | Coverage | 86% | 45% | 59% | 84% | | Purity | 83% | 89% | 87% | 70% | | Precision | 23% | 14% | 24% | 32% | | Recall | 19% | 32% | 41% | 19% |

AMI Headset Mix

The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains. Different from VoxConverse Dataset, AMI dataset is not that diverse as it only consists of meeting recordings. The median and average proportion of speaker change is both around 78%, and the minimal proportion is above 59%. Thus, the evaluation analysis based on AMI is more applicable to measure the models performance under regular conversational setting.

Average Coverage, Purity, Precision, and Recall

| | PyAnnote | Llama2 | Singularity | Majority | |-----------|----------|--------|-----------|----------| | Coverage | 89% | 75% | 80% | 92% | | Purity | 60% | 65% | 64% | 46% | | Precision | 44% | 32% | 40% | 46% | | Recall | 18% | 18% | 25% | 11% |

For the detailed descriptions of the models, metrics, and analysis, please download the evaluation_analysis pdf file from the AudioAndTextBasedSpeakerChangeDetection repo.

License

Owner

Name: Princeton DDSS
Login: princeton-ddss
Kind: organization
Location: United States of America

Website: https://ddss.princeton.edu
Repositories: 1
Profile: https://github.com/princeton-ddss

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: speechmlpipeline
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Junying
    family-names: Fang
    email: jf3375@princeton.edu
identifiers:
  - type: doi
    value: 10.5281/zenodo.10712895
repository-code: 'https://github.com/princeton-ddss/SpeechMLPipeline'
abstract: >-
  SpeechMLPipeline is a Python package for users to run the
  complete speech machine learning pipeline via one simple
  function (Audio-to-Text Transcription, Speaker Change
  Detection, and Speaker Identification) to get
  transcriptions with speaker labels from input audio
  files. 
license: MIT

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- pypi 7 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

pypi.org: speechmlpipeline

A package of speech machine learning pipeline to automatically get transcriptions with speaker labels from audio inputs

Homepage: https://github.com/princeton-ddss/SpeechMLPipeline
Documentation: https://speechmlpipeline.readthedocs.io/
License: MIT License
Latest release: 1.1.0
published about 2 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 7 Last month

Rankings

Dependent packages count: 9.8%

Average: 37.2%

Dependent repos count: 64.6%

Maintainers (1)

fjying

Last synced: 9 months ago

Dependencies

pyproject.toml pypi

numpy *

requirements.txt pypi

accelerate *
audiotextspeakerchangedetect *
auditok *
huggingface-hub *
pyannote.audio ==3.1.1
pyannote.core ==5.0.0
pyannote.database ==5.0.1
pyannote.metrics ==3.2.1
pyannote.pipeline ==3.0.1
pydub *
resemblyzer *
spacy *
spectralcluster *
speechbrain ==0.5.16
torch *
torchaudio *
torchvision *
transformers *
whisper-timestamped ==1.14.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science