audio-datasets

GitHub Repository for the Survey Paper on Audio-Language Datasets for Scenes and Events

https://github.com/gljs/audio-datasets

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: ieee.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Keywords

audio-datasets

Last synced: 11 months ago · JSON representation ·

Repository

GitHub Repository for the Survey Paper on Audio-Language Datasets for Scenes and Events

Basic Info

Host: GitHub
Owner: GLJS
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 4.86 MB

Statistics

Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

audio-datasets

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

Audio-Language Datasets of Scenes and Events: A Survey

Corresponding paper: https://ieeexplore.ieee.org/abstract/document/10854210/

This repository contains various scripts used for creating the survey paper on audio-language datasets. Also, it includes useful splits to mitigate overlap between datasets. Furthermore, we provide a bash script to easily download all the data in download.sh.

We also share all the audio-dataset text files used in the paper on our HuggingFace dataset repository: https://huggingface.co/datasets/gijs/audio-datasets.

Script files

clap.py: Extracts CLAP audio/text embeddings from WebDataset files.
datasettofiles.py: Converts audio files to FLAC format with JSON metadata.
datasets.py: Functions to load and standardize 40+ audio datasets into a common file/caption/split format.
get_diff.py: Identifies overlapping audio files between datasets using CLAP embeddings similarity analysis.
get_dogs.py: Checks each dataset's captions for presence of "dog" and prints first matching caption.
laion-tarify.py: Creates tar archives from audio-text pairs with configurable batch size and data splits.
main.py: Calculates audio and text statistics for different datasets.
maketarutils.py: Utilities for creating tar archives of dataset files.

Visualization Scripts

analysis.py: Performs various analyses on dataset statistics.
barplot.py: Creates bar plots for visualizing dataset distributions.
calcaudiosetclass_unique.py: Calculates unique classes in AudioSet dataset.
clapevaluationheatmap.py and clapevaluationheatmap_combined.py: Generate heatmaps for CLAP embedding similarities.
clapevaluationpca.py: Performs PCA analysis on CLAP embeddings.
clapevaluationtsne.py and clapevaluationtsne_batchsearch.py: Performs t-SNE visualization of CLAP embeddings.
clapevaluationumap.py: Generates GPU-accelerated UMAP plots comparing audio and text embeddings across datasets.
count.py: Counts audio and text embeddings per dataset, including remapped LAION-Audio-630k datasets.
heatmap_evaluation.py: Generates heatmaps for CLAP embedding similarities.
pca_analysis.py: Analyzes correlation between dataset size and embedding variability for CLAP embeddings.

New splits

When training on one dataset, and evaluating on another dataset, the training dataset should not include the ids present in the specific file in the splits directory (based on 99\% similarity, we also will upload a 95% similarity split soon). For example, for training on AudioCaps and evaluating on FAVDBench, one should remove the audios from the AudioCaps dataset that are present in the audiocaps_in_favdbench.csv. Of course, one should be more careful when training and evaluating on datasets that share the same origin.

Citation

If you use this work in some way, please cite it as follows: @article{wijngaard2025audio, author={Wijngaard, Gijs and Formisano, Elia and Esposito, Michele and Dumontier, Michel}, journal={IEEE Access}, title={Audio-Language Datasets of Scenes and Events: A Survey}, year={2025}, volume={13}, number={}, pages={20328-20360}, doi={10.1109/ACCESS.2025.3534621} }

Owner

Name: Gijs Wijngaard
Login: GlJS
Kind: user

Repositories: 1
Profile: https://github.com/GlJS

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite as below:"
authors:
  - family-names: Wijngaard
    given-names: Gijs
  - family-names: Formisano
    given-names: Elia
  - family-names: Esposito
    given-names: Michele
  - family-names: Dumontier
    given-names: Michel
title: "Audio-Language Datasets of Scenes and Events: A Survey"
version: 1.0.0
doi: 10.1109/ACCESS.2025.3534621
date-released: '2025-02-07'
preferred-citation:
  authors:
    - family-names: Wijngaard
      given-names: Gijs
    - family-names: Formisano
      given-names: Elia
    - family-names: Esposito
      given-names: Michele
    - family-names: Dumontier
      given-names: Michel
  title: "Audio-Language Datasets of Scenes and Events: A Survey"
  doi: 10.1109/ACCESS.2025.3534621
  type: article
  start: 20328
  end: 20360
  volume: 13
  year: 2025
  journal: IEEE Access
  publisher: IEEE

GitHub Events

Total

Watch event: 6
Delete event: 1
Push event: 6
Pull request event: 2
Fork event: 1

Last Year

Watch event: 6
Delete event: 1
Push event: 6
Pull request event: 2
Fork event: 1

Dependencies

requirements.txt pypi

accelerate *
datasets *
librosa *
matplotlib *
numba *
numpy *
pandas *
polars *
pydub *
python-dotenv *
pyyaml *
scikit-learn *
scipy *
seaborn *
soundfile *
submitit *
torch *
torchaudio *
tqdm *
transformers *
webdataset *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

audio-datasets

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Audio-Language Datasets of Scenes and Events: A Survey

Script files

Visualization Scripts

New splits

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies