audiotextspeakerchangedetect

Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering

https://github.com/princeton-ddss/audioandtextbasedspeakerchangedetection

Last synced: 7 months ago · JSON representation ·

Repository

Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering

Basic Info

Host: GitHub
Owner: princeton-ddss
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 47.5 MB

Statistics

Stars: 1
Watchers: 2
Forks: 1
Open Issues: 4
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Audiotextspeakerchangedetect

Audiotextspeakerchangedetect is a Python package to detect speaker change by analyzing both audio and textual features.

The package develops and applies Large Language Models and the Rule-based NLP Model to detect speaker change based on textual features.

Currently, the package provides the main function so users could directly pass transcriptions to apply Llama2-70b to detect speaker change. The prompt of speaker change detection is developed meticulously to ensure that Llama2 could understand its role of detecting speaker change, perform the speaker change detection for almost every segment, and return the answer in a standardized JSON format. Specifically, two texts of the current segment and the next segment would be shown to ask Llama2 if the speaker changes across these two segments by understanding the interrelationships between these two texts. The codes are developed to parse input csv files to prompts and parse the returned answers into csv files while considering possible missing values and mismatches.

In addition to Llama2, the Rule-based NLP model is also developed to detect speaker change by analyzing the text using human comprehension. Well-defined patterns exist in the text segments so humans could use them to identify that the speaker indeed changes across these text segments with the high degree of certainty. By using Spacy NLP model, human comprehension could be written as rules in programming language. These rules are used to determine if these well-defined patterns exist in text segments to identify if speaker changes across these segments. These human-specified rules are developed by analyzing OpenAI Whisper transcription text segments. Specifically, the rules are below. * If the segment starts with the lowercase character, the segment continues the previous sentence. The speaker does not change in this segment. * If the sentence ends with ?, and its following sentence ends with . The speaker changes in the next segment. * If there is the conjunction word in the beginning of segment. The speaker does not change in this segment.

Besides text features, audio features are used to detect speaker change via the widely used clustering method, PyAnnote and Spectral Clustering.

In the end, the Ensemble Audio-and-text-based Speaker Change Detection Model is built by aggregating predictions across all the speaker change detection models. Two types of ensemble models are developed based on different methods of ensembling the prediction results of the three models above, Pyannote, Spectral Clustering, and Llama2-70b models. * The Majority Model: The ensemble method is majority voting. The Majority model predicts the speaker change as true if the majority of models predict it as true. * The Singularity Model: The ensemble method is singularity voting. The Singularity model predicts the speaker change as true only if any of models predict it as true.

The ensemble models correct the aggregated predictions using the rule-based NLP analysis to get its final predictions. Specifically, the ensemble model predicts speaker change as true or false if the rule-based NLP analysis predicts that based on rules developed by human comprehension.

Create New Python Environment to Avoid Packages Versions Conflict If Needed

python -m venv <envname> source <envname>/bin/activate

Install the Package

The package Audiotextspeakerchangedetect could be installed either via Pypi or Github.

Install via Pypi for the Stable Version of THE Package

pip install audiotextspeakerchangedetect

Install via Github Repo for the Latest Version Of the Package

git lfs install git clone https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection cd <.../AudioAndTextBasedSpeakerChangeDetection> pip install .

Download Models Offline to Run Them without Internet Connection

Download Spacy NLP Model by Running Commands below in Terminal

python -m spacy download en_core_web_lg

Download Llama2 Model by Running Codes below in Python

is the access token to Hugging Face. Please create a Hugging Face account if it does not exist.
The new access token could be created by following the instructions.

is the local path where the downloaded Llama2 model would be saved. ``` from huggingfacehub import snapshotdownload, login

login(token=) snapshotdownload(repoid ='meta-llama/Llama-2-70b-chat-hf', cachedir= <downloadmodel_path>) ```

Download PyAnnote Models using Dropbox Link

To download PyAnnotate models, please download pyannote3.1 folder in this Dropbox Link.

To use the PyAnnotate models, please replace with the local parent folder of the downloaded pyannote3.1 folder in pyannote3.1/Diarization/config.yaml and pyannote3.1/Segmentation/config.yaml.

Usage

The audio-and-text-based ensemble speaker change detection model could be applied to get speaker change detection results by running only one function. The function is runensembleaudiotextbasedspeakerchangedetectionmodel in src/audiotextspeakerchangedetect/main.py. ``` from audiotextspeakerchangedetect.main import runensembleaudiotextbasedspeakerchangedetectionmodel

runensembleaudiotextbasedspeakerchangedetectionmodel(detectionmodels, minspeakers, maxspeakers, audiofileinputpath, audiofileinputname, transcriptioninputpath, transcriptionfileinputname, detectionoutputpath, hfaccesstoken, llama2modelpath, pyannotemodelpath, device, detectionllama2outputpath, tempoutputpath, ensemblevoting) ``` Please view the descriptions of the function inputs: * detectionmodels: A list of names of speaker change detection models to be run * minspeakers: The minimal number of speakers in the input audio file * maxspeakers: The maximal number of speakers in the input audio file * audiofileinputpath: A path which contains an input audio file * audiofileinputname: A audio file name containing the file type * transcriptioninputpath: A path where a transcription output csv file is saved * transcriptionfileinputname: A transcription output csv file name ending with .csv * detectionoutputpath: A path to save the speaker change detection output in csv file * hfaccesstoken: Access token to HuggingFace * llama2modelpath: A path where the Llama2 model files are saved * pyannotemodelpath: A path where the Pyannote model files are saved * device: Torch device type to run the model, defaults to None so GPU would be automatically used if it is available * detectionllama2outputpath: A path where the pre-run Llama2 speaker change detection output in csv file is saved if exists, default to None * tempoutputpath: A path to save the current run of Llama2 speaker change detection output to avoid future rerunning, default to None * ensembleoutputpath: A path to save the ensemble detection output in csv file * ensemblevoting: A list of voting methods to be used to build the final ensemble model

Please view sample codes to run the function in sample_run.py and samplerunexistingllama2output.py in the src/audiotextspeakerchangedetect folder. Please view the detailed function description and its inputs descriptions inside the Python file src/audiotextspeakerchangedetect/main.py.

Please note that running llama2-70b requires at least 2 gpus and 250GB memory. If the computing resources is not available for running llama2-70b, please exclude llama2-70b from detection_models input.

Evaluation

VoxConverse Dataset v0.3

VoxConverse is an only audio-visual diarization dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos, usually in a political debate or news segment context to ensure multi-speaker dialogue. The audio files in the dataset have lots of variations of the proportion of speaker changes, which indicates the effectiveness of the dataset as the evaluation dataset to evaluate the models robustness.

Average Coverage, Purity, Precision, and Recall

| | PyAnnote | Llama2 | Singularity | Majority | |-----------|----------|--------|-----------|----------| | Coverage | 86% | 45% | 59% | 84% | | Purity | 83% | 89% | 87% | 70% | | Precision | 23% | 14% | 24% | 32% | | Recall | 19% | 32% | 41% | 19% |

AMI Headset Mix

The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains. Different from VoxConverse Dataset, AMI dataset is not that diverse as it only consists of meeting recordings. The median and average proportion of speaker change is both around 78%, and the minimal proportion is above 59%. Thus, the evaluation analysis based on AMI is more applicable to measure the models performance under regular conversational setting.

Average Coverage, Purity, Precision, and Recall

| | PyAnnote | Llama2 | Singularity | Majority | |-----------|----------|--------|-----------|----------| | Coverage | 89% | 75% | 80% | 92% | | Purity | 60% | 65% | 64% | 46% | | Precision | 44% | 32% | 40% | 46% | | Recall | 18% | 18% | 25% | 11% |

For the detailed evaluation analysis, please refer to evaluation_analysis.pdf in the main repo folder.

License

Owner

Name: Princeton DDSS
Login: princeton-ddss
Kind: organization
Location: United States of America

Website: https://ddss.princeton.edu
Repositories: 1
Profile: https://github.com/princeton-ddss

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'audiotextspeakerchangedetect '
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Junying
    family-names: Fang
    email: jf3375@princeton.edu
identifiers:
  - type: doi
    value: 10.5281/zenodo.10712695
repository-code: >-
  https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
abstract: >-
  Audiotextspeakerchangedetect is a Python Package to Detect
  Speaker Change based on Textual Features via LLMs &
  Rule-Based NLP and Audio Features via Pyannote & Spectral
  Clustering
license: MIT

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Packages

Total packages: 2
Total downloads:
- pypi 56 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 6
Total maintainers: 1

pypi.org: audiotextspeakerchangedetect

A Package to Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering

Homepage: https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
Documentation: https://audiotextspeakerchangedetect.readthedocs.io/
License: MIT License
Latest release: 1.3.0
published almost 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 46 Last month

Rankings

Dependent packages count: 9.8%

Average: 37.3%

Dependent repos count: 64.7%

Maintainers (1)

fjying

Last synced: 7 months ago

pypi.org: speakerchangedetect

A Package to Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering

Homepage: https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
Documentation: https://speakerchangedetect.readthedocs.io/
License: MIT License
Latest release: 0.0.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 10 Last month

Rankings

Dependent packages count: 9.8%

Forks count: 33.2%

Average: 37.6%

Stargazers count: 42.8%

Dependent repos count: 64.6%

Maintainers (1)

fjying

Last synced: 7 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

audiotextspeakerchangedetect

Science Score: 75.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Audiotextspeakerchangedetect

Create New Python Environment to Avoid Packages Versions Conflict If Needed

Install the Package

Install via Pypi for the Stable Version of THE Package

Install via Github Repo for the Latest Version Of the Package

Download Models Offline to Run Them without Internet Connection

Download Spacy NLP Model by Running Commands below in Terminal

Download Llama2 Model by Running Codes below in Python

Download PyAnnote Models using Dropbox Link

Usage

Evaluation

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: audiotextspeakerchangedetect

Rankings

Maintainers (1)

pypi.org: speakerchangedetect

Rankings

Maintainers (1)