dcase2024-task6-baseline
DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.6%) to scientific vocabulary
Keywords
Repository
DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)
Basic Info
- Host: GitHub
- Owner: Labbeti
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://dcase.community/challenge2024/task-automated-audio-captioning
- Size: 308 KB
Statistics
- Stars: 6
- Watchers: 2
- Forks: 3
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
dcase2024-task6-baseline
The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task page.
This repository includes: - AAC model trained on the Clotho dataset - Extract features using ConvNeXt - System reaches 29.6% SPIDEr-FL score on Clotho-eval (development-testing) - Output detailed training characteristics (number of parameters, MACs, energy consumption...)
Installation
First, you need to create an environment that contains python>=3.11 and pip. You can use venv, conda, micromamba or other python environment tool.
Here is an example with micromamba:
bash
micromamba env create -n env_dcase24 python=3.11 pip -c defaults
micromamba activate env_dcase24
Then, you can clone this repository and install it:
bash
git clone https://github.com/Labbeti/dcase2024-task6-baseline
cd dcase2024-task6-baseline
pip install -e .
pre-commit install
You also need to install Java >= 1.8 and <= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable AAC_METRICS_JAVA_PATH.
Usage
Download external data, models and prepare
To download, extract and process data, you need to run:
bash
dcase24t6-prepare
By default, the dataset is stored in ./data directory. It will requires approximatively 33GB of disk space.
Train the default model
bash
dcase24t6-train +expt=baseline
By default, the model and results are saved in directory ./logs/SAVE_NAME. SAVE_NAME is the name of the script with the starting date.
Metrics are computed at the end of the training with the best checkpoint.
Test a pretrained model
bash
dcase24t6-test resume=./logs/SAVE_NAME
or specify each path separtely:
bash
dcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json
You need to replace SAVE_NAME by the save directory name and MODEL by the checkpoint filename.
If you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:
bash
dcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline
Inference on a file
If you want to test the baseline model on a single file, you can use the baseline_pipeline function:
```python from dcase24t6.nn.hub import baseline_pipeline
sr = 44100 audio = torch.rand(1, sr * 15)
model = baseline_pipeline() item = {"audio": audio, "sr": sr} outputs = model(item) candidate = outputs["candidates"][0]
print(candidate) ```
Code overview
The source code extensively use PyTorch Lightning for training and Hydra for configuration. It is highly recommanded to learn about them if you want to understand this code.
Installation has three main steps: - Download external models (ConvNeXt to extract audio features) - Download Clotho dataset using aac-datasets - Create HDF files containing each Clotho subset with preprocessed audio features using torchoutil
Training follows the standard way to create a model with lightning: - Initialize callbacks, tokenizer, datamodule, model. - Start fitting the model on the specified datamodule. - Evaluate the model using aac-metrics
Model
The model outperforms previous baselines with a SPIDEr-FL score of 29.6% on the Clotho evaluation subset. The captioning model architecture is described in this paper and called CNext-trans. The encoder part (ConvNeXt) is described in more detail in this paper.
The pretrained weights of the AAC model are available on Zenodo: ConvNeXt encoder (BL_AC), Transformer decoder. Both weights are automatically downloaded during dcase24t6-prepare.
Main hyperparameters
| Hyperparameter | Value | Option |
| --- | --- | --- |
| Number of epochs | 400 | trainer.max_epochs |
| Batch size | 64 | datamodule.batch_size |
| Gradient accumulation | 8 | trainer.accumulate_grad_batches |
| Learning rate | 5e-4 | model.lr |
| Weight decay | 2 | model.weight_decay |
| Gradient clipping | 1 | trainer.gradient_clip_val |
| Beam size | 3 | model.beam_size |
| Model dimension size | 256 | model.d_model |
| Label smoothing | 0.2 | model.label_smoothing |
| Mixup alpha | 0.4 | model.mixup_alpha |
Detailed results
| Metric | Score on Clotho-eval | | --- | --- | | BLEU-1 | 0.5948 | | BLEU-2 | 0.3924 | | BLEU-3 | 0.2603 | | BLEU-4 | 0.1695 | | METEOR | 0.1897 | | ROUGE-L | 0.3927 | | CIDEr-D | 0.4619 | | SPICE | 0.1335 | | SPIDEr | 0.2977 | | SPIDEr-FL | 0.2962 | | SBERT-sim | 0.5059 | | FER | 0.0038 | | FENSE | 0.5040 | | BERTScore | 0.9766 | | Vocabulary (words) | 551 |
Here is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file "Santa Motor.wav":
| Name | Params (M) | MACs (G) | | --- | --- | --- | | Encoder | 29.4 | 44.4 | | Decoder | 11.9 | 4.3 | | Total | 41.3 | 48.8 |
Tips
Modify the model. The model class is located in
src/dcase24t6/models/trans_decoder.py. It is recommanded to create another class and conf to keep different models architectures. The loss is computed in the method calledtraining_step. You can also modify the model architecture in the method calledsetup.Extract different audio features. For that, you can add a new pre-process function in
src/dcase24t6/pre_processesand the related conf insrc/conf/pre_process. Then, re-rundcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=falseto create new HDF files with your own features. To train a new model on these features, you can specify the HDF files required indcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=.... Depending on the features extracted, some parameters could be modified in the model to handle them.Using as a package. If you do not want ot use the entire codebase but only parts of it, you can install it as a package using:
bash
pip install git+https://github.com/Labbeti/dcase2024-task6-baseline
Then you will be able to import any object from the code like for example from dcase24t6.models.trans_decoder import TransDecoderModel. There is also several important dependencies that you can install separately:
aac-datasetsto download and load AAC datasets,aac-metricsto compute AAC metrics,torchoutil[extras]to pack datasets to HDF files.
Additional information
- The code has been made for Ubuntu 20.04 and should work on more recent Ubuntu versions and Linux-based distributions.
- The GPU used is NVIDIA GeForce RTX 2080 Ti (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.
- In this code, clotho subsets are named according to the Clotho convention, not the DCASE convention. See more information on this page.
See also
- DCASE2023 Audio Captioning baseline
- DCASE2022 Audio Captioning baseline
- DCASE2021 Audio Captioning baseline
- DCASE2020 Audio Captioning baseline
- aac-datasets
- aac-metrics
Contact
Maintainer: - Étienne Labbé "Labbeti": labbeti.pub@gmail.com
Owner
- Name: Labbeti
- Login: Labbeti
- Kind: user
- Location: Toulouse, France
- Company: IRIT
- Website: labbeti.github.io
- Repositories: 5
- Profile: https://github.com/Labbeti
PhD student at IRIT (Institut de Recherche en Informatique de Toulouse), working mainly on Automated Audio Captioning.
Citation (CITATION.cff)
# -*- coding: utf-8 -*-
cff-version: 1.2.0
message: If you use this software, please cite it as below.
title: dcase24t6
authors:
- given-names: Etienne
family-names: Labbé
affiliation: IRIT
orcid: 'https://orcid.org/0000-0002-7219-5463'
url: https://github.com/Labbeti/dcase2024-task6-baseline
keywords:
- baseline
- audio-captioning
- dcase2024
license: MIT
version: 1.1.0
date-released: '2024-04-19'
preferred-citation:
authors:
- family-names: Labbé
given-names: Etienne
affiliation: IRIT
orcid: 'https://orcid.org/0000-0002-7219-5463'
- family-names: Pellegrini
given-names: Thomas
affiliation: IRIT
orcid: 'https://orcid.org/0000-0001-8984-1399'
- family-names: Pinquier
given-names: Julien
affiliation: IRIT
doi: "10.48550/arXiv.2309.00454"
journal: "arxiv preprint"
title: "CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding"
type: article
url: "https://arxiv.org/abs/2309.00454"
year: 2023
GitHub Events
Total
- Watch event: 3
- Pull request event: 2
- Fork event: 3
Last Year
- Watch event: 3
- Pull request event: 2
- Fork event: 3
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: 3 minutes
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 0.5
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: 3 minutes
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 0.5
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- akhilkumardonka (1)
- seb-son (1)
Pull Request Authors
- mumbert (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v2 composite
- actions/setup-python v4 composite
- PyYAML ==6.0.1
- aac-datasets *
- aac-metrics *
- black ==24.2.0
- codecarbon ==2.3.4
- deepspeed *
- flake8 *
- hydra-colorlog ==1.2.0
- hydra-core ==1.3.2
- ipykernel ==6.29.3
- ipython ==8.22.1
- lightning ==2.2.0
- nltk ==3.8.1
- pre-commit ==3.6.2
- pytest *
- tensorboard ==2.16.2
- tokenizers ==0.15.2
- torch ==2.2.1
- torchlibrosa ==0.1.0
- torchoutil >=0.2.2,<0.3.0