audioset-convnext-inf
Adapting a ConvNeXt model to audio classification on AudioSet
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
Adapting a ConvNeXt model to audio classification on AudioSet
Basic Info
- Host: GitHub
- Owner: topel
- License: mit
- Language: Python
- Default Branch: main
- Size: 1.46 MB
Statistics
- Stars: 25
- Watchers: 2
- Forks: 2
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
Adapting a ConvNeXt model to audio classification on AudioSet
In this work, we adapted the computer vision architecture ConvNeXt (Tiny) to perform audio tagging on AudioSet.
In this repo, we provide the PyTorch code to do inference with our best checkpoint, trained on the AudioSet dev subset (balanced + unbalanced subsets). We do not provide code to train our models, but it is heavily based on PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, thanks a lot to Qiuqiang Kong and colleagues for their amazing open-source work.
Install instructions
conda env create -f environment.yml
The most important modules are:
- pytorch 1.11.0 but more recent 1.* versions should work (maybe 2.* also),
- torchaudio,
- torchlibrosa, needed to generate log mel spectrograms just as in PANN's code.
Activate the newly created env:
conda activate audio_retrieval
Then either clone this repo and work localy, or pip install it with:
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
Get a checkpoint
Create a checkpoints directory, in which a checkpoint should be added.
A checkpoint is available on Zenodo: https://zenodo.org/record/8020843
Get the convnext_tiny_471mAP.pth one to do audio tagging and embedding extraction.
The following results were obtained on the AudioSet test set:
| mAP | 0.471 | |---------|-------| | AUC | 0.973 | | d-prime | 3.071 |
A second checkpoint is also available, in case you are interested in doing experiments on the AudioCaps dataset (audio captioning and audio-text retrieval).
Audio tagging demo
The script demo_convnext.py provides an example of how to do audio tagging on a single audio file, provided in the audio_samples directory.
It will give the following output:
``` Loaded ckpt from: /gpfswork/rech/djl/uzj43um/audioretrieval/audioset-convnext-inf/checkpoints/convnexttiny_471mAP.pth
params: 28222767
Inference on:f62-S-v2swA200000210000.wav
logits size: torch.Size([1, 527]) probs size: torch.Size([1, 527]) Predicted labels using activity threshold 0.25:
[ 0 137 138 139 151 506]
Scene embedding, shape: torch.Size([1, 768])
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7]) ```
You can associate the corresponding tag names to the predicted indexes thanks to the file metadata/class_labels_indices.csv:
[ 0 137 138 139 151 506] Speech; Music; Musical instrument; Plucked string instrument; Ukulele; Inside, small room
When the ground-truth is for this recording, as given in audio_samples/f62-S-v2swA_200000_210000_labels.txt:
[ 0 137 151] Speech; Music; Ukulele;
Additionally, the methods model.forward_scene_embeddings(waveform) and model.forward_frame_embeddings(waveform) provide you with audio scene and frame-level embeddings. The respective shapes are printed out in the script example:
- scene embedding: a 768-d vector
- frame-level embedding: 768, 31, 7. Thus, 768 "images" of size 31 time frames x 7 frequency coefficients.
Evaluate a checkpoint on the balanced and the test subsets of AudioSet
You can retrieve the results afore-mentioned with this script: evaluate_convnext_on_audioset.py
The sbatch script is provided: scripts/5_evaluate_convnext_on_audioset.sbatch
It loads the checkpoint and run it on a single GPU. It should take a few minutes to run and get the metric results in the log file.
Citation
If you find this work useful, please consider citing our paper, to be presented at INTERSPEECH 2023:
Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., & Masquelier, T. (2023). Adapting a ConvNeXt model to audio classification on AudioSet. arXiv preprint arXiv:2306.00830.
@misc{pellegrini2023adapting,
title={{Adapting a ConvNeXt model to audio classification on AudioSet}},
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labbé and Timothée Masquelier},
year={2023},
eprint={2306.00830},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Owner
- Name: Thomas Pellegrini
- Login: topel
- Kind: user
- Location: Toulouse, France
- Company: IRIT
- Website: http://www.irit.fr/~Thomas.Pellegrini/
- Repositories: 38
- Profile: https://github.com/topel
Citation (CITATION.cff)
# -*- coding: utf-8 -*-
cff-version: 1.2.0
message: If you use this code, please consider cite the following paper.
title: audioset-convnext-inf
authors:
- given-names: Thomas
family-names: Pellegrini
affiliation: IRIT
url: https://github.com/topel/audioset-convnext-inf
preferred-citation:
authors:
- family-names: Pellegrini
given-names: Thomas
affiliation: ANITI, IRIT, UPS
orcid: 'https://orcid.org/0000-0001-8984-1399'
- family-names: Ismail
given-names: Khalfaoui-Hassani
affiliation: ANITI, UPS
- family-names: Labbé
given-names: Etienne
affiliation: IRIT, UPS
orcid: 'https://orcid.org/0000-0002-7219-5463'
- family-names: Timothée
given-names: Masquelier
affiliation: CerCo
# arxiv citation
doi: "10.48550/arXiv.2306.00830"
month: 6
title: "Adapting a ConvNeXt model to audio classification on AudioSet"
url: "https://doi.org/10.48550/arXiv.2306.00830"
year: 2023
type: proceedings
GitHub Events
Total
- Issues event: 3
- Watch event: 7
- Issue comment event: 2
- Push event: 1
Last Year
- Issues event: 3
- Watch event: 7
- Issue comment event: 2
- Push event: 1
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: about 2 hours
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: about 2 hours
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- JNaranjo-Alcazar (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- h5py ==3.2.1
- matplotlib ==3.4.2
- numpy ==1.20.1
- scikit-learn ==0.24.2
- scipy ==1.6.3
- torch ==1.11.0
- torchaudio ==0.11.0
- torchlibrosa ==0.0.9
- tqdm ==4.64.1
- _libgcc_mutex 0.1
- _openmp_mutex 4.5
- bzip2 1.0.8
- ca-certificates 2023.7.22
- ld_impl_linux-64 2.40
- libffi 3.4.2
- libgcc-ng 13.1.0
- libgomp 13.1.0
- libnsl 2.0.0
- libsqlite 3.43.0
- libuuid 2.38.1
- libzlib 1.2.13
- ncurses 6.4
- openssl 3.1.2
- pip 23.2.1
- python 3.9.18
- readline 8.2
- setuptools 68.1.2
- tk 8.6.12
- tzdata 2023c
- wheel 0.41.2
- xz 5.2.6