voxaboxen

https://github.com/earthspecies/voxaboxen

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: earthspecies
License: agpl-3.0
Language: Python
Default Branch: main
Size: 1.06 MB

Statistics

Stars: 28
Watchers: 10
Forks: 6
Open Issues: 7
Releases: 3

Created about 3 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

Voxaboxen

Voxaboxen is a deep learning framework designed to find the start and stop times of (possibly overlapping) sound events in a recording. We designed it with bioacoustics applications in mind, so it accepts annotations in the form of Raven selection tables.

If you use this software in your research, please cite it.

19_AL_Naranja_1025_detect

Read the preprint!

Get the data from the preprint!

Installation

With uv, Voxaboxen can be run using uv run main.py.

Alternatively, install dependencies with pip install -r requirements.txt and run using python main.py.

To use the BEATs encoder, you can obtain the weights from here. To use the Frame-ATST encoder, you can obtain the weights from here. Place these file in weights.

Quick start

Create a directory for your data. Add to it a train_info.csv file with three columns:

fn: Unique filename associated with each audio file
audio_fp: Filepaths to audio files in train set
selection_table_fp: Filepath to Raven selection tables

Repeat this for the other folds of your dataset, creating val_info.csv and test_info.csv. Run project setup and model training following the template in the Example Usage below.

Notes: - Audio will be automatically resampled to 16000 Hz mono, no resampling is necessary prior to training. - Selection tables are .txt files, with tab-separated columns. We only require the following columns: Begin Time (s), End Time (s), Annotation.

Example usage:

Get the BEATs weights from the link above. Get the preprocessed Meerkat (MT) dataset:

mkdir datasets/MT_demo; wget https://storage.googleapis.com/esp-public-files/voxaboxen-demo/formatted.zip -P datasets/MT_demo; unzip datasets/MT_demo/formatted.zip -d datasets/MT_demo; wget https://storage.googleapis.com/esp-public-files/voxaboxen-demo/original_readme_and_license.md -P datasets/MT_demo

Project setup:

uv run main.py project-setup --data-dir=datasets/MT_demo/formatted --project-dir=projects/MT_demo_experiment

Train model:

uv run main.py train-model --project-config-fp=projects/MT_demo_experiment/project_config.yaml --name=demo --n-epochs=50 --batch-size=4 --encoder-type=beats --beats-checkpoint-fp=weights/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt --bidirectional

Use trained model to infer annotations:

python main.py inference --model-args-fp=projects/MT_demo_experiment/demo/params.yaml --file-info-for-inference=datasets/MT_demo/formatted/test_info.csv

Reproduce the experiments

Obtain the datasets from here. Place them in the datasets directory.

For some datasets, events are above the 8kHz Nyquist frequency of the model. To get around this, we use slowed-down versions of the files. To create slowed-down versions, run uv run scripts/make_slowed_version.py

The main experiments from the paper can be reproduced using the shell script scripts/voxaboxen_experiments.sh

Other features

Here are some additional options that can be applied during training:

Flag --stereo accepts stereo audio. Order of channels matters; used for e.g. speaker diarization.
Flag --bidirectional predicts the ends of events in addition to the beginning, matches starts and ends based on IoU. May improve box regression.
Flag --segmentation-based switches to a frame-based approach. If used, we recommend putting --rho=1.
Flag --mixup applies mixup augmentation.

Editing Project Config

After running python main.py project-setup, a project_config.yaml file will be created in the project directory you specified. This config file codifies how different labels will be handled by any model within this project. This config file is automatically generated by the project setup script, but you can edit this file to revise how these labels are handled. There are a few things you can edit:

label_set: This is a list of all the label types that a model will be able to output. It is automatically populated with all the label types that appear in the Annotation column of the selection table. If you want your model to ignore a particular label type, perhaps because there are few events with that label type, you must delete that label type from this list.
label_mapping: This is a set of key: value pairs. Often, it is useful to group multiple types of labels into one. For example, maybe in your data there are multiple species from the same family, and you would like the model to treat this entire family with one label type. Upon training, Voxaboxen converts each annotation that appears as a key into the label specified by the corresponding value. When modifying label_mapping, you should ensure that each value that appears in label_mapping either also appears in label_set, or is the unknown_label.
unknown_label: This is set to Unknown by default. Any sound event labeled with the unknown_label will be treated as an event of interest, but the label type of the event will be treated as unknown. This may be desirable when there are vocalizations that are clearly audible, but are difficult for an annotator to identify to species. When the model is trained, it learns to predict a uniform distribution across possible label types whenever it encounters an event with the unknown_label. When the model is evaluated, it is not penalized for predicting the label of events which are annotated with the unknown_label. The unknown_label should not appear in the label_set.

For example, say you annotate your audio with the labels Red-eyed Vireo REVI, Philidelphia VireoPHVI, and Unsure REVI/PHVI. To reflect your uncertainty about REVI/PHVI, your label_set would include REVI and PHVI, and your label_mapping would include the pairs REVI: REVI, PHVI: PHVI, and REVI/PHVI: Unknown. Alternatively, you could group both types of Vireo together by making your label_set only include Vireo, and your label_mapping include REVI: Vireo, PHVI: Vireo, REVI/PHVI: Vireo.

The name

Voxaboxen is designed to put a box around each vocalization (vox). It also rhymes with Roxaboxen.

Owner

Name: Earth Species Project
Login: earthspecies
Kind: organization
Email: humans@earthspecies.org

Website: https://earthspecies.org
Repositories: 25
Profile: https://github.com/earthspecies

An open-source collaborative and nonprofit dedicated to decoding animal communication.

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the article from preferred-citation and the software itself.
authors:
  - family-names: Mahon
    given-names: Louis
  - family-names: Hoffman
    given-names: Benjamin
  - family-names: James
    given-names: Logan
  - family-names: Cusimano
    given-names: Maddie
  - family-names: Hagiwara
    given-names: Masato
  - family-names: Woolley
    given-names: Sarah
  - family-names: Pietquin
    given-names: Olivier
title: Robust detection of overlapping bioacoustic sound events
version: 1.0.0
url: https://arxiv.org/abs/2503.02389
date-released: '2025-05-28'

GitHub Events

Total

Issues event: 5
Watch event: 11
Delete event: 3
Issue comment event: 9
Push event: 5
Pull request event: 9
Fork event: 2
Create event: 2

Last Year

Issues event: 5
Watch event: 11
Delete event: 3
Issue comment event: 9
Push event: 5
Pull request event: 9
Fork event: 2
Create event: 2

Dependencies

requirements.txt pypi

PyYAML >=6.0
einops >=0.6.1
intervaltree >=3.1.0
librosa >=0.10.0
matplotlib >=3.7.1
mir_eval >=0.7
numpy >=1.24.3
pandas >=2.0.2
plumbum >=1.8.2
scipy >=1.10.1
seaborn >=0.12.2
soundfile >=0.12.1
torch >=2.0.1
torchaudio >=2.0.1
tqdm >=4.65.0

.github/workflows/pre-commit.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
pre-commit/action v3.0.1 composite

.github/workflows/pythonapp.yml actions

actions/checkout v2 composite
actions/setup-python v5 composite
astral-sh/setup-uv v5 composite

pyproject.toml pypi

einops >=0.6.1
intervaltree >=3.1.0
librosa >=0.10.0
matplotlib >=3.7.1
mir-eval >=0.7
numpy >=1.24.3
pandas >=2.0.2
pytorch-lightning >=2.5.1.post0
pyyaml >=6.0
scipy >=1.10.1
seaborn >=0.12.2
soundfile >=0.12.1
torch >=2.0.1
torchaudio >=2.0.1
torchvision >=0.15.2
tqdm >=4.67.0

requirements-dev.txt pypi

cfgv ==3.4.0 development
click ==8.1.7 development
colorama ==0.4.6 development
deptry ==0.23.0 development
distlib ==0.3.9 development
filelock ==3.18.0 development
identify ==2.6.10 development
iniconfig ==2.1.0 development
isort ==5.13.2 development
nodeenv ==1.9.1 development
packaging ==24.2 development
pathspec ==0.12.1 development
platformdirs ==4.3.8 development
pluggy ==1.6.0 development
pre-commit ==4.2.0 development
pytest ==7.4.0 development
pyyaml ==6.0.2 development
requirements-parser ==0.13.0 development
ruff ==0.11.2 development
virtualenv ==20.31.2 development
yamllint ==1.35.1 development

uv.lock pypi

103 dependencies

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science