Speakerbox

Speakerbox: Few-Shot Learning for Speaker Identification with Transformers - Published in JOSS (2023)

https://github.com/councildataproject/speakerbox

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

audio-classification speaker-id speaker-identification transformers

Keywords from Contributors

frequencies cdp-deployments civic-tech cookiecutter-template government-data local-government open-government mesh

Scientific Fields

Engineering Computer Science - 40% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Speakerbox: Fine-tune Audio Transformers for speaker identification.

Basic Info

Host: GitHub
Owner: CouncilDataProject
License: mit
Language: Python
Default Branch: main
Homepage: https://councildataproject.org/speakerbox
Size: 17.7 MB

Statistics

Stars: 59
Watchers: 6
Forks: 6
Open Issues: 9
Releases: 2

Topics

audio-classification speaker-id speaker-identification transformers

Created about 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct Citation Zenodo

speakerbox

Few-Shot Multi-Recording Speaker Identification Transformer Fine-Tuning and Application

Installation

Stable Release: pip install speakerbox
Development Head: pip install git+https://github.com/CouncilDataProject/speakerbox.git

Documentation

For full package documentation please visit councildataproject.github.io/speakerbox.

Example Usage Video

Link: https://youtu.be/SK2oVqSKPTE

In the example video, we use the Speakerbox library to quickly annotate a dataset of audio clips from the show The West Wing and train a speaker identification model to identify three of the show's characters (President Bartlet, Charlie Young, and Leo McGarry).

Problem

Given a set of recordings of multi-speaker recordings:

example/ ├── 0.wav ├── 1.wav ├── 2.wav ├── 3.wav ├── 4.wav └── 5.wav

Where each recording has some or all of a set of speakers, for example:

0.wav -- contains speakers: A, B, C
1.wav -- contains speakers: A, C
2.wav -- contains speakers: B, C
3.wav -- contains speakers: A, B, C
4.wav -- contains speakers: A, B, C
5.wav -- contains speakers: A, B, C

You want to train a model to classify portions of audio as one of the N known speakers in future recordings not included in your original training set.

f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]

i.e. f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]

The speakerbox library contains methods for both generating datasets for annotation and for utilizing multiple audio annotation schemes to train such a model.

The following table shows model performance results as the dataset size increases:

| datasetsize | meanaccuracy | meanprecision | meanrecall | meantrainingduration_seconds | |:---------------|----------------:|-----------------:|--------------:|---------------------------------:| | 15-minutes | 0.874 ± 0.029 | 0.881 ± 0.037 | 0.874 ± 0.029 | 101 ± 1 | | 30-minutes | 0.929 ± 0.006 | 0.94 ± 0.007 | 0.929 ± 0.006 | 186 ± 3 | | 60-minutes | 0.937 ± 0.02 | 0.94 ± 0.017 | 0.937 ± 0.02 | 453 ± 7 |

All results reported are the average of five model training and evaluation trials for each of the different dataset sizes. All models were fine-tuned using an NVIDIA GTX 1070 TI.

Note: this table can be reproduced in ~1 hour using an NVIDIA GTX 1070 TI by:

Installing the example data download dependency:

bash pip install speakerbox[example_data]

Then running the following commands in Python:

```python from speakerbox.examples import ( downloadpreprocessedexampledata, trainandevalallexamplemodels, )

Download and unpack the preprocessed example data

dataset = downloadpreprocessedexample_data()

Train and eval models with different subsets of the data

results = trainandevalallexample_models(dataset) ```

Workflow

Diarization

We quickly generate an annotated dataset by first diarizing (or clustering based on the features of speaker audio) portions of larger audio files and splitting each the of the clusters into their own directories that you can then manually clean up (by removing incorrectly clustered audio segments).

Notes

It is recommended to have each larger audio file named with a unique id that can be used to act as a "recording id".
Diarization time depends on machine resources and make take a long time -- one potential recommendation is to run a diarization script overnight and clean up the produced annotations the following day.
During this process audio will be duplicated in the form of smaller audio clips -- ensure you have enough space on your machine to complete this process before you begin.
Clustering accuracy depends on how many speakers there are, how distinct their voices are, and how much speech is talking over one-another.
If possible, try to find recordings where speakers have a roughly uniform distribution of speaking durations.

⚠️ To use the diarization portions of speakerbox you need to complete the following steps: ⚠️

Visit hf.co/pyannote/speaker-diarization and accept user conditions.
Visit hf.co/pyannote/segmentation and accept user conditions.
Visit hf.co/settings/tokens to create an access token (only if you had to complete 1.)

Diarize a single file:

```python from speakerbox import preprocess

The token can also be provided via the 'HUGGINGFACE_TOKEN` environment variable.

diarizedandsplitaudiodir = preprocess.diarizeandsplitaudio( "0.wav", hftoken="token-from-hugging-face", ) ```

Diarize all files in a directory: ```python from speakerbox import preprocess from pathlib import Path from tqdm import tqdm

Iterate over all 'wav' format files in a directory called 'data'

for audiofile in tqdm(list(Path("data").glob("*.wav"))): # The token can also be provided via the 'HUGGINGFACETOKENenvironment variable. diarized_and_split_audio_dir = preprocess.diarize_and_split_audio( audio_file, # Create a new directory to place all created sub-directories within storage_dir=f"diarized-audio/{audio_file.stem}", hf_token="token-from-hugging-face", )``

Cleaning

Diarization will produce a directory structure organized by unlabeled speakers with the audio clips that were clustered together.

For example, if "0.wav" had three speakers, the produced directory structure may look like the following tree:

0/ ├── SPEAKER_00 │ ├── 567-12928.wav │ ├── ... │ └── 76192-82901.wav ├── SPEAKER_01 │ ├── 34123-38918.wav │ ├── ... │ └── 88212-89111.wav └── SPEAKER_02 ├── ... └── 53998-62821.wav

We leave it to you as a user to then go through these directories and remove any audio clips that were incorrectly clustered together as well as renaming the sub-directories to their correct speaker labels. For example, labelled sub-directories may look like the following tree:

0/ ├── A │ ├── 567-12928.wav │ ├── ... │ └── 76192-82901.wav ├── B │ ├── 34123-38918.wav │ ├── ... │ └── 88212-89111.wav └── D ├── ... └── 53998-62821.wav

Notes

Most operating systems have an audio playback application to queue an entire directory of audio files as a playlist for playback. This makes it easy to listen to a whole unlabeled sub-directory (i.e. "SPEAKER_00") at a time and pause playback and remove files from the directory which were incorrectly clustered.
If any clips have overlapping speakers, it is up to you as a user if you want to remove those clips or keep them and properly label them with the speaker you wish to associate them with.

Training Preparation

Once you have annotated what you think is enough recordings, you can try preparing a dataset for training.

The following functions will prepare the audio for training by:

Finding all labeled audio clips in the provided directories
Chunk all found audio clips into smaller duration clips (parametrizable)
Check that the provided annotated dataset meets the following conditions:
1. There is enough data such that the training, test, and validation subsets all contain different recording ids.
2. There is enough data such that the training, test, and validation subsets each contain all labels present in the whole dataset.

Notes

During this process audio will be duplicated in the form of smaller audio clips -- ensure you have enough space on your machine to complete this process before you begin.
Directory names are used as recording ids during dataset construction.

```python from speakerbox import preprocess

dataset = preprocess.expandlabeleddiarizedaudiodirtodataset( labeleddiarizedaudio_dir=[ "0/", # The cleaned and checked audio clips for recording id 0 "1/", # ... recording id 1 "2/", # ... recording id 2 "3/", # ... recording id 3 "4/", # ... recording id 4 "5/", # ... recording id 5 ] )

datasetdict, valuecounts = preprocess.preparedataset( dataset, # good if you have large variation in number of data points for each label equalizedatawithinsplits=True, # set seed to get a reproducible data split seed=60, )

You can print the value_counts dataframe to see how many audio clips of each label

(speaker) are present in each data subset.

value_counts ```

Model Training and Evaluation

Once you have your dataset prepared and available, you can provide it directly to the training function to begin training a new model.

The eval_model function will store a filed called results.md with the accuracy, precision, and recall of the model and additionally store a file called validation-confusion.png which is a confusion matrix.

Notes

The model (and evaluation metrics) will be stored in a new directory called trained-speakerbox (parametrizable).
Training time depends on how much data you have annotated and provided.
It is recommended to train with an NVidia GPU with CUDA available to speed up the training process.
Speakerbox has only been tested on English-language audio and the base model for fine-tuning was trained on English-language audio. We provide no guarantees as to it's effectiveness on non-English-language audio. If you try Speakerbox on with non-English-language audio, please let us know!

```python from speakerbox import train, eval_model

dataset_dict comes from previous preparation step

train(dataset_dict)

evalmodel(datasetdict["valid"]) ```

Model Inference

Once you have a trained model, you can use it against a new audio file:

```python from speakerbox import apply

annotation = apply( "new-audio.wav", "path-to-my-model-directory/", ) ```

The apply function returns a pyannote.core.Annotation.

Development

See CONTRIBUTING.md for information related to developing the code.

Citation

bibtex @article{Brown2023, doi = {10.21105/joss.05132}, url = {https://doi.org/10.21105/joss.05132}, year = {2023}, publisher = {The Open Journal}, volume = {8}, number = {83}, pages = {5132}, author = {Eva Maxfield Brown and To Huynh and Nicholas Weber}, title = {Speakerbox: Few-Shot Learning for Speaker Identification with Transformers}, journal = {Journal of Open Source Software} }

MIT License

Owner

Name: CouncilDataProject
Login: CouncilDataProject
Kind: organization

Website: https://councildataproject.org
Repositories: 44
Profile: https://github.com/CouncilDataProject

Tools for transparency and accessibility in council action.

JOSS Publication

Speakerbox: Few-Shot Learning for Speaker Identification with Transformers

Published

March 20, 2023

DOI

10.21105/joss.05132

Volume 8, Issue 83, Page 5132

Authors

Eva Maxfield Brown

University of Washington Information School, University of Washington, Seattle

To Huynh

University of Washington, Seattle

Nicholas Weber

University of Washington Information School, University of Washington, Seattle

Editor

Fabian-Robert Stöter

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Speakerbox: Few-Shot Learning for Speaker Identification
  with Transformers
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Eva
    family-names: Maxfield Brown
    email: evamxb@uw.edu
    affiliation: University of Washington Information School
    orcid: 'https://orcid.org/0000-0003-2564-0373'
  - given-names: To
    family-names: Huynh
    affiliation: University of Washington
    orcid: 'https://orcid.org/0000-0002-9664-3662'
  - given-names: Nicholas
    family-names: Weber
    email: nmweber@uw.edu
    affiliation: University of Washington Information School
    orcid: 'https://orcid.org/0000-0002-6008-3763'
identifiers:
  - type: doi
    value: 10.21105/joss.05132
    description: The JOSS Paper for Speakerbox
repository-code: 'https://github.com/CouncilDataProject/speakerbox'
repository-artifact: 'https://zenodo.org/record/7729994'
abstract: >-
  Automated speaker identification is a modeling challenge
  for research when large-scale corpora, such as audio
  recordings or transcripts, are relied upon for evidence
  (e.g. Journalism, Qualitative Research, Law, etc.). To
  address current difficulties in training speaker
  identification models, we propose Speakerbox: a method for
  few-shot fine-tuning of an audio transformer.
  Specifically, Speakerbox makes multi-recording,
  multi-speaker identification model fine-tuning as simple
  as possible while still fitting an accurate, useful model
  for application. Speakerbox works by ensuring data are
  safely stratified by speaker id and held-out by recording
  id prior to fine-tuning of a pretrained speaker
  identification Transformer on a small number of audio
  examples. We show that with less than an hour of
  audio-recorded input, Speakerbox can fine-tune a
  multi-speaker identification model for use in assisting
  researchers in audio and transcript annotation.
keywords:
  - Python
  - speaker identification
  - audio classification
  - machine learning
  - transformers
license: MIT
version: 1.2.0
preferred-citation:
  type: article
  authors:
  - given-names: Eva
    family-names: Maxfield Brown
    email: evamxb@uw.edu
    affiliation: University of Washington Information School
    orcid: 'https://orcid.org/0000-0003-2564-0373'
  - given-names: To
    family-names: Huynh
    affiliation: University of Washington
    orcid: 'https://orcid.org/0000-0002-9664-3662'
  - given-names: Nicholas
    family-names: Weber
    email: nmweber@uw.edu
    affiliation: University of Washington Information School
    orcid: 'https://orcid.org/0000-0002-6008-3763'
  doi: "10.21105/joss.05132"
  journal: "Journal of Open Source Software"
  start: 5132 # First page number
  end: 5132 # Last page number
  title: "Speakerbox: Few-Shot Learning for Speaker Identification with Transformers"
  issue: 83
  volume: 8
  year: 2023
  url: "https://doi.org/10.21105/joss.05132"
  publisher: "The Open Journal"

GitHub Events

Total

Watch event: 8
Delete event: 1
Issue comment event: 1
Pull request event: 2
Create event: 1

Last Year

Watch event: 8
Delete event: 1
Issue comment event: 1
Pull request event: 2
Create event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 71
Total Committers: 5
Avg Commits per committer: 14.2
Development Distribution Score (DDS): 0.31

Past Year

Commits: 7
Committers: 2
Avg Commits per committer: 3.5
Development Distribution Score (DDS): 0.429

Top Committers

Name	Email	Commits
Eva Maxfield Brown	e**n@g**m	49
JacksonMaxfield	j**n@g**m	15
dependabot[bot]	4****]	4
Nic	n**r@u**u	2
Fabian-Robert Stöter	f****t	1

Committer Domains (Top 20 + Academic)

uw.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 16
Total pull requests: 15
Average time to close issues: about 1 month
Average time to close pull requests: about 1 month
Total issue authors: 5
Total pull request authors: 6
Average comments per issue: 1.56
Average comments per pull request: 2.87
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 6

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

gregoryfoster (6)
evamaxfield (5)
NicolasMICAUX (3)
DieguJota (1)
squeipom (1)

Pull Request Authors

dependabot[bot] (10)
evamaxfield (4)
nniiicc (2)
NicolasMICAUX (1)
kristopher-smith (1)
faroit (1)

Top Labels

Issue Labels

enhancement (8) bug (6) good first issue (1)

Pull Request Labels

dependencies (10) enhancement (2)

Packages

Total packages: 1
Total downloads:
- pypi 153 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 10
Total maintainers: 1

pypi.org: speakerbox

Few-Shot Speaker Identification Model Training and Application

Homepage: https://github.com/CouncilDataProject/speakerbox
Documentation: https://CouncilDataProject.github.io/speakerbox
License: MIT License
Latest release: 1.2.0
published almost 3 years ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 153 Last month

Rankings

Dependent packages count: 10.1%

Stargazers count: 13.2%

Forks count: 14.3%

Average: 15.6%

Downloads: 18.9%

Dependent repos count: 21.6%

Maintainers (1)

councildataproject

Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
codecov/codecov-action v3 composite
extractions/setup-just v1 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/docs.yml actions

JamesIves/github-pages-deploy-action v4 composite
actions/checkout v3 composite
actions/setup-python v4 composite
extractions/setup-just v1 composite

.github/workflows/paper.yml actions

actions/checkout v3 composite
actions/upload-artifact v3 composite
openjournals/openjournals-draft-action master composite

Speakerbox

Science Score: 100.0%

Keywords

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

speakerbox

Installation

Documentation

Example Usage Video

Problem

Download and unpack the preprocessed example data

Train and eval models with different subsets of the data

Workflow

Diarization

Notes

The token can also be provided via the 'HUGGINGFACE_TOKEN` environment variable.

Iterate over all 'wav' format files in a directory called 'data'

Cleaning

Notes

Training Preparation

Notes

You can print the value_counts dataframe to see how many audio clips of each label

(speaker) are present in each data subset.

Model Training and Evaluation

Notes

dataset_dict comes from previous preparation step

Model Inference

Development

Citation

Owner

JOSS Publication

Speakerbox: Few-Shot Learning for Speaker Identification with Transformers

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: speakerbox

Rankings

Maintainers (1)

Dependencies