dcase2024-task6-baseline

DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)

https://github.com/labbeti/dcase2024-task6-baseline

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.6%) to scientific vocabulary

Keywords

audio-captioning baseline dcase2024
Last synced: 6 months ago · JSON representation ·

Repository

DCASE2024 Challenge Task 6 baseline system (Automated Audio Captioning)

Basic Info
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 3
  • Open Issues: 0
  • Releases: 2
Topics
audio-captioning baseline dcase2024
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Changelog License Citation

README.md

dcase2024-task6-baseline

**DCASE2024 Challenge Task 6 baseline system of Automated Audio Captioning (AAC)** Python PyTorch Code style: black Build

The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task page.

This repository includes: - AAC model trained on the Clotho dataset - Extract features using ConvNeXt - System reaches 29.6% SPIDEr-FL score on Clotho-eval (development-testing) - Output detailed training characteristics (number of parameters, MACs, energy consumption...)

Installation

First, you need to create an environment that contains python>=3.11 and pip. You can use venv, conda, micromamba or other python environment tool.

Here is an example with micromamba: bash micromamba env create -n env_dcase24 python=3.11 pip -c defaults micromamba activate env_dcase24

Then, you can clone this repository and install it: bash git clone https://github.com/Labbeti/dcase2024-task6-baseline cd dcase2024-task6-baseline pip install -e . pre-commit install

You also need to install Java >= 1.8 and <= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable AAC_METRICS_JAVA_PATH.

Usage

Download external data, models and prepare

To download, extract and process data, you need to run: bash dcase24t6-prepare By default, the dataset is stored in ./data directory. It will requires approximatively 33GB of disk space.

Train the default model

bash dcase24t6-train +expt=baseline

By default, the model and results are saved in directory ./logs/SAVE_NAME. SAVE_NAME is the name of the script with the starting date. Metrics are computed at the end of the training with the best checkpoint.

Test a pretrained model

bash dcase24t6-test resume=./logs/SAVE_NAME or specify each path separtely: bash dcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json You need to replace SAVE_NAME by the save directory name and MODEL by the checkpoint filename.

If you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:

bash dcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline

Inference on a file

If you want to test the baseline model on a single file, you can use the baseline_pipeline function:

```python from dcase24t6.nn.hub import baseline_pipeline

sr = 44100 audio = torch.rand(1, sr * 15)

model = baseline_pipeline() item = {"audio": audio, "sr": sr} outputs = model(item) candidate = outputs["candidates"][0]

print(candidate) ```

Code overview

The source code extensively use PyTorch Lightning for training and Hydra for configuration. It is highly recommanded to learn about them if you want to understand this code.

Installation has three main steps: - Download external models (ConvNeXt to extract audio features) - Download Clotho dataset using aac-datasets - Create HDF files containing each Clotho subset with preprocessed audio features using torchoutil

Training follows the standard way to create a model with lightning: - Initialize callbacks, tokenizer, datamodule, model. - Start fitting the model on the specified datamodule. - Evaluate the model using aac-metrics

Model

The model outperforms previous baselines with a SPIDEr-FL score of 29.6% on the Clotho evaluation subset. The captioning model architecture is described in this paper and called CNext-trans. The encoder part (ConvNeXt) is described in more detail in this paper.

The pretrained weights of the AAC model are available on Zenodo: ConvNeXt encoder (BL_AC), Transformer decoder. Both weights are automatically downloaded during dcase24t6-prepare.

Main hyperparameters

| Hyperparameter | Value | Option | | --- | --- | --- | | Number of epochs | 400 | trainer.max_epochs | | Batch size | 64 | datamodule.batch_size | | Gradient accumulation | 8 | trainer.accumulate_grad_batches | | Learning rate | 5e-4 | model.lr | | Weight decay | 2 | model.weight_decay | | Gradient clipping | 1 | trainer.gradient_clip_val | | Beam size | 3 | model.beam_size | | Model dimension size | 256 | model.d_model | | Label smoothing | 0.2 | model.label_smoothing | | Mixup alpha | 0.4 | model.mixup_alpha |

Detailed results

| Metric | Score on Clotho-eval | | --- | --- | | BLEU-1 | 0.5948 | | BLEU-2 | 0.3924 | | BLEU-3 | 0.2603 | | BLEU-4 | 0.1695 | | METEOR | 0.1897 | | ROUGE-L | 0.3927 | | CIDEr-D | 0.4619 | | SPICE | 0.1335 | | SPIDEr | 0.2977 | | SPIDEr-FL | 0.2962 | | SBERT-sim | 0.5059 | | FER | 0.0038 | | FENSE | 0.5040 | | BERTScore | 0.9766 | | Vocabulary (words) | 551 |

Here is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file "Santa Motor.wav":

| Name | Params (M) | MACs (G) | | --- | --- | --- | | Encoder | 29.4 | 44.4 | | Decoder | 11.9 | 4.3 | | Total | 41.3 | 48.8 |

Tips

  • Modify the model. The model class is located in src/dcase24t6/models/trans_decoder.py. It is recommanded to create another class and conf to keep different models architectures. The loss is computed in the method called training_step. You can also modify the model architecture in the method called setup.

  • Extract different audio features. For that, you can add a new pre-process function in src/dcase24t6/pre_processes and the related conf in src/conf/pre_process. Then, re-run dcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=false to create new HDF files with your own features. To train a new model on these features, you can specify the HDF files required in dcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=.... Depending on the features extracted, some parameters could be modified in the model to handle them.

  • Using as a package. If you do not want ot use the entire codebase but only parts of it, you can install it as a package using:

bash pip install git+https://github.com/Labbeti/dcase2024-task6-baseline

Then you will be able to import any object from the code like for example from dcase24t6.models.trans_decoder import TransDecoderModel. There is also several important dependencies that you can install separately:

  • aac-datasets to download and load AAC datasets,
  • aac-metrics to compute AAC metrics,
  • torchoutil[extras] to pack datasets to HDF files.

Additional information

  • The code has been made for Ubuntu 20.04 and should work on more recent Ubuntu versions and Linux-based distributions.
  • The GPU used is NVIDIA GeForce RTX 2080 Ti (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.
  • In this code, clotho subsets are named according to the Clotho convention, not the DCASE convention. See more information on this page.

See also

Contact

Maintainer: - Étienne Labbé "Labbeti": labbeti.pub@gmail.com

Owner

  • Name: Labbeti
  • Login: Labbeti
  • Kind: user
  • Location: Toulouse, France
  • Company: IRIT

PhD student at IRIT (Institut de Recherche en Informatique de Toulouse), working mainly on Automated Audio Captioning.

Citation (CITATION.cff)

# -*- coding: utf-8 -*-

cff-version: 1.2.0
message: If you use this software, please cite it as below.

title: dcase24t6
authors:
  - given-names: Etienne
    family-names: Labbé
    affiliation: IRIT
    orcid: 'https://orcid.org/0000-0002-7219-5463'
url: https://github.com/Labbeti/dcase2024-task6-baseline

keywords:
  - baseline
  - audio-captioning
  - dcase2024

license: MIT
version: 1.1.0
date-released: '2024-04-19'

preferred-citation:
  authors:
    - family-names: Labbé
      given-names: Etienne
      affiliation: IRIT
      orcid: 'https://orcid.org/0000-0002-7219-5463'
    - family-names: Pellegrini
      given-names: Thomas
      affiliation: IRIT
      orcid: 'https://orcid.org/0000-0001-8984-1399'
    - family-names: Pinquier
      given-names: Julien
      affiliation: IRIT

  doi: "10.48550/arXiv.2309.00454"
  journal: "arxiv preprint"
  title: "CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding"
  type: article
  url: "https://arxiv.org/abs/2309.00454"
  year: 2023

GitHub Events

Total
  • Watch event: 3
  • Pull request event: 2
  • Fork event: 3
Last Year
  • Watch event: 3
  • Pull request event: 2
  • Fork event: 3

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 3
  • Total Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 3
  • Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Labbeti e****1@g****m 3

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 3 minutes
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 0.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: 3 minutes
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • akhilkumardonka (1)
  • seb-son (1)
Pull Request Authors
  • mumbert (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/test.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v4 composite
pyproject.toml pypi
requirements.txt pypi
  • PyYAML ==6.0.1
  • aac-datasets *
  • aac-metrics *
  • black ==24.2.0
  • codecarbon ==2.3.4
  • deepspeed *
  • flake8 *
  • hydra-colorlog ==1.2.0
  • hydra-core ==1.3.2
  • ipykernel ==6.29.3
  • ipython ==8.22.1
  • lightning ==2.2.0
  • nltk ==3.8.1
  • pre-commit ==3.6.2
  • pytest *
  • tensorboard ==2.16.2
  • tokenizers ==0.15.2
  • torch ==2.2.1
  • torchlibrosa ==0.1.0
  • torchoutil >=0.2.2,<0.3.0
setup.py pypi