audiotextspeakerchangedetect
Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering
https://github.com/princeton-ddss/audioandtextbasedspeakerchangedetection
Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization princeton-ddss has institutional domain (ddss.princeton.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering
Basic Info
Statistics
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 4
- Releases: 0
Metadata Files
README.md
Audiotextspeakerchangedetect
Audiotextspeakerchangedetect is a Python package to detect speaker change by analyzing both audio and textual features.
The package develops and applies Large Language Models and the Rule-based NLP Model to detect speaker change based on textual features.
Currently, the package provides the main function so users could directly pass transcriptions to apply Llama2-70b to detect speaker change. The prompt of speaker change detection is developed meticulously to ensure that Llama2 could understand its role of detecting speaker change, perform the speaker change detection for almost every segment, and return the answer in a standardized JSON format. Specifically, two texts of the current segment and the next segment would be shown to ask Llama2 if the speaker changes across these two segments by understanding the interrelationships between these two texts. The codes are developed to parse input csv files to prompts and parse the returned answers into csv files while considering possible missing values and mismatches.
In addition to Llama2, the Rule-based NLP model is also developed to detect speaker change by analyzing the text using human comprehension. Well-defined patterns exist in the text segments so humans could use them to identify that the speaker indeed changes across these text segments with the high degree of certainty. By using Spacy NLP model, human comprehension could be written as rules in programming language. These rules are used to determine if these well-defined patterns exist in text segments to identify if speaker changes across these segments. These human-specified rules are developed by analyzing OpenAI Whisper transcription text segments. Specifically, the rules are below. * If the segment starts with the lowercase character, the segment continues the previous sentence. The speaker does not change in this segment. * If the sentence ends with ?, and its following sentence ends with . The speaker changes in the next segment. * If there is the conjunction word in the beginning of segment. The speaker does not change in this segment.
Besides text features, audio features are used to detect speaker change via the widely used clustering method, PyAnnote and Spectral Clustering.
In the end, the Ensemble Audio-and-text-based Speaker Change Detection Model is built by aggregating predictions across all the speaker change detection models. Two types of ensemble models are developed based on different methods of ensembling the prediction results of the three models above, Pyannote, Spectral Clustering, and Llama2-70b models. * The Majority Model: The ensemble method is majority voting. The Majority model predicts the speaker change as true if the majority of models predict it as true. * The Singularity Model: The ensemble method is singularity voting. The Singularity model predicts the speaker change as true only if any of models predict it as true.
The ensemble models correct the aggregated predictions using the rule-based NLP analysis to get its final predictions. Specifically, the ensemble model predicts speaker change as true or false if the rule-based NLP analysis predicts that based on rules developed by human comprehension.
Create New Python Environment to Avoid Packages Versions Conflict If Needed
python -m venv <envname>
source <envname>/bin/activate
Install the Package
The package Audiotextspeakerchangedetect could be installed either via Pypi or Github.
Install via Pypi for the Stable Version of THE Package
pip install audiotextspeakerchangedetect
Install via Github Repo for the Latest Version Of the Package
git lfs install
git clone https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
cd <.../AudioAndTextBasedSpeakerChangeDetection>
pip install .
Download Models Offline to Run Them without Internet Connection
Download Spacy NLP Model by Running Commands below in Terminal
python -m spacy download en_core_web_lg
Download Llama2 Model by Running Codes below in Python
The new access token could be created by following the instructions.
login(token=
Download PyAnnote Models using Dropbox Link
To download PyAnnotate models, please download pyannote3.1 folder in this Dropbox Link.
To use the PyAnnotate models, please replace
Usage
The audio-and-text-based ensemble speaker change detection model could be applied to get speaker change detection results by running only one function. The function is runensembleaudiotextbasedspeakerchangedetectionmodel in src/audiotextspeakerchangedetect/main.py. ``` from audiotextspeakerchangedetect.main import runensembleaudiotextbasedspeakerchangedetectionmodel
runensembleaudiotextbasedspeakerchangedetectionmodel(detectionmodels, minspeakers, maxspeakers, audiofileinputpath, audiofileinputname, transcriptioninputpath, transcriptionfileinputname, detectionoutputpath, hfaccesstoken, llama2modelpath, pyannotemodelpath, device, detectionllama2outputpath, tempoutputpath, ensemblevoting) ``` Please view the descriptions of the function inputs: * detectionmodels: A list of names of speaker change detection models to be run * minspeakers: The minimal number of speakers in the input audio file * maxspeakers: The maximal number of speakers in the input audio file * audiofileinputpath: A path which contains an input audio file * audiofileinputname: A audio file name containing the file type * transcriptioninputpath: A path where a transcription output csv file is saved * transcriptionfileinputname: A transcription output csv file name ending with .csv * detectionoutputpath: A path to save the speaker change detection output in csv file * hfaccesstoken: Access token to HuggingFace * llama2modelpath: A path where the Llama2 model files are saved * pyannotemodelpath: A path where the Pyannote model files are saved * device: Torch device type to run the model, defaults to None so GPU would be automatically used if it is available * detectionllama2outputpath: A path where the pre-run Llama2 speaker change detection output in csv file is saved if exists, default to None * tempoutputpath: A path to save the current run of Llama2 speaker change detection output to avoid future rerunning, default to None * ensembleoutputpath: A path to save the ensemble detection output in csv file * ensemblevoting: A list of voting methods to be used to build the final ensemble model
Please view sample codes to run the function in sample_run.py and samplerunexistingllama2output.py in the src/audiotextspeakerchangedetect folder. Please view the detailed function description and its inputs descriptions inside the Python file src/audiotextspeakerchangedetect/main.py.
Please note that running llama2-70b requires at least 2 gpus and 250GB memory. If the computing resources is not available for running llama2-70b, please exclude llama2-70b from detection_models input.
Evaluation
VoxConverse is an only audio-visual diarization dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos, usually in a political debate or news segment context to ensure multi-speaker dialogue. The audio files in the dataset have lots of variations of the proportion of speaker changes, which indicates the effectiveness of the dataset as the evaluation dataset to evaluate the models robustness.
Average Coverage, Purity, Precision, and Recall
| | PyAnnote | Llama2 | Singularity | Majority | |-----------|----------|--------|-----------|----------| | Coverage | 86% | 45% | 59% | 84% | | Purity | 83% | 89% | 87% | 70% | | Precision | 23% | 14% | 24% | 32% | | Recall | 19% | 32% | 41% | 19% |
The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains. Different from VoxConverse Dataset, AMI dataset is not that diverse as it only consists of meeting recordings. The median and average proportion of speaker change is both around 78%, and the minimal proportion is above 59%. Thus, the evaluation analysis based on AMI is more applicable to measure the models performance under regular conversational setting.
Average Coverage, Purity, Precision, and Recall
| | PyAnnote | Llama2 | Singularity | Majority | |-----------|----------|--------|-----------|----------| | Coverage | 89% | 75% | 80% | 92% | | Purity | 60% | 65% | 64% | 46% | | Precision | 44% | 32% | 40% | 46% | | Recall | 18% | 18% | 25% | 11% |
For the detailed evaluation analysis, please refer to evaluation_analysis.pdf in the main repo folder.
License
Owner
- Name: Princeton DDSS
- Login: princeton-ddss
- Kind: organization
- Location: United States of America
- Website: https://ddss.princeton.edu
- Repositories: 1
- Profile: https://github.com/princeton-ddss
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: 'audiotextspeakerchangedetect '
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Junying
family-names: Fang
email: jf3375@princeton.edu
identifiers:
- type: doi
value: 10.5281/zenodo.10712695
repository-code: >-
https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
abstract: >-
Audiotextspeakerchangedetect is a Python Package to Detect
Speaker Change based on Textual Features via LLMs &
Rule-Based NLP and Audio Features via Pyannote & Spectral
Clustering
license: MIT
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Packages
- Total packages: 2
-
Total downloads:
- pypi 56 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 6
- Total maintainers: 1
pypi.org: audiotextspeakerchangedetect
A Package to Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering
- Homepage: https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
- Documentation: https://audiotextspeakerchangedetect.readthedocs.io/
- License: MIT License
-
Latest release: 1.3.0
published almost 2 years ago
Rankings
Maintainers (1)
pypi.org: speakerchangedetect
A Package to Detect Speaker Change based on Textual Features via LLMs & Rule-Based NLP and Audio Features via Pyannote & Spectral Clustering
- Homepage: https://github.com/princeton-ddss/AudioAndTextBasedSpeakerChangeDetection
- Documentation: https://speakerchangedetect.readthedocs.io/
- License: MIT License
-
Latest release: 0.0.1
published about 2 years ago