audio-datasets

GitHub Repository for the Survey Paper on Audio-Language Datasets for Scenes and Events

https://github.com/gljs/audio-datasets

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: ieee.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary

Keywords

audio-datasets
Last synced: 6 months ago · JSON representation ·

Repository

GitHub Repository for the Survey Paper on Audio-Language Datasets for Scenes and Events

Basic Info
  • Host: GitHub
  • Owner: GLJS
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 4.86 MB
Statistics
  • Stars: 6
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
audio-datasets
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Citation

README.md

Audio-Language Datasets of Scenes and Events: A Survey

Corresponding paper: https://ieeexplore.ieee.org/abstract/document/10854210/

This repository contains various scripts used for creating the survey paper on audio-language datasets. Also, it includes useful splits to mitigate overlap between datasets. Furthermore, we provide a bash script to easily download all the data in download.sh.

We also share all the audio-dataset text files used in the paper on our HuggingFace dataset repository: https://huggingface.co/datasets/gijs/audio-datasets.

Script files

  • clap.py: Extracts CLAP audio/text embeddings from WebDataset files.

  • datasettofiles.py: Converts audio files to FLAC format with JSON metadata.

  • datasets.py: Functions to load and standardize 40+ audio datasets into a common file/caption/split format.

  • get_diff.py: Identifies overlapping audio files between datasets using CLAP embeddings similarity analysis.

  • get_dogs.py: Checks each dataset's captions for presence of "dog" and prints first matching caption.

  • laion-tarify.py: Creates tar archives from audio-text pairs with configurable batch size and data splits.

  • main.py: Calculates audio and text statistics for different datasets.

  • maketarutils.py: Utilities for creating tar archives of dataset files.

Visualization Scripts

  • analysis.py: Performs various analyses on dataset statistics.

  • barplot.py: Creates bar plots for visualizing dataset distributions.

  • calcaudiosetclass_unique.py: Calculates unique classes in AudioSet dataset.

  • clapevaluationheatmap.py and clapevaluationheatmap_combined.py: Generate heatmaps for CLAP embedding similarities.

  • clapevaluationpca.py: Performs PCA analysis on CLAP embeddings.

  • clapevaluationtsne.py and clapevaluationtsne_batchsearch.py: Performs t-SNE visualization of CLAP embeddings.

  • clapevaluationumap.py: Generates GPU-accelerated UMAP plots comparing audio and text embeddings across datasets.

  • count.py: Counts audio and text embeddings per dataset, including remapped LAION-Audio-630k datasets.

  • heatmap_evaluation.py: Generates heatmaps for CLAP embedding similarities.

  • pca_analysis.py: Analyzes correlation between dataset size and embedding variability for CLAP embeddings.

New splits

When training on one dataset, and evaluating on another dataset, the training dataset should not include the ids present in the specific file in the splits directory (based on 99\% similarity, we also will upload a 95% similarity split soon). For example, for training on AudioCaps and evaluating on FAVDBench, one should remove the audios from the AudioCaps dataset that are present in the audiocaps_in_favdbench.csv. Of course, one should be more careful when training and evaluating on datasets that share the same origin.

Citation

If you use this work in some way, please cite it as follows: @article{wijngaard2025audio, author={Wijngaard, Gijs and Formisano, Elia and Esposito, Michele and Dumontier, Michel}, journal={IEEE Access}, title={Audio-Language Datasets of Scenes and Events: A Survey}, year={2025}, volume={13}, number={}, pages={20328-20360}, doi={10.1109/ACCESS.2025.3534621} }

Owner

  • Name: Gijs Wijngaard
  • Login: GlJS
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite as below:"
authors:
  - family-names: Wijngaard
    given-names: Gijs
  - family-names: Formisano
    given-names: Elia
  - family-names: Esposito
    given-names: Michele
  - family-names: Dumontier
    given-names: Michel
title: "Audio-Language Datasets of Scenes and Events: A Survey"
version: 1.0.0
doi: 10.1109/ACCESS.2025.3534621
date-released: '2025-02-07'
preferred-citation:
  authors:
    - family-names: Wijngaard
      given-names: Gijs
    - family-names: Formisano
      given-names: Elia
    - family-names: Esposito
      given-names: Michele
    - family-names: Dumontier
      given-names: Michel
  title: "Audio-Language Datasets of Scenes and Events: A Survey"
  doi: 10.1109/ACCESS.2025.3534621
  type: article
  start: 20328
  end: 20360
  volume: 13
  year: 2025
  journal: IEEE Access
  publisher: IEEE

GitHub Events

Total
  • Watch event: 6
  • Delete event: 1
  • Push event: 6
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Watch event: 6
  • Delete event: 1
  • Push event: 6
  • Pull request event: 2
  • Fork event: 1

Dependencies

requirements.txt pypi
  • accelerate *
  • datasets *
  • librosa *
  • matplotlib *
  • numba *
  • numpy *
  • pandas *
  • polars *
  • pydub *
  • python-dotenv *
  • pyyaml *
  • scikit-learn *
  • scipy *
  • seaborn *
  • soundfile *
  • submitit *
  • torch *
  • torchaudio *
  • tqdm *
  • transformers *
  • webdataset *