https://github.com/cahya-wirawan/audio-captioning

Audio captioning - DCASE challenge 2023 task 6a

Last synced: 10 months ago · JSON representation

Repository

Audio captioning - DCASE challenge 2023 task 6a

Basic Info

Host: GitHub
Owner: cahya-wirawan
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 90.1 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of prompteus/audio-captioning

Created over 1 year ago · Last pushed over 1 year ago

https://github.com/cahya-wirawan/audio-captioning/blob/main/

# Audio captioning

This is the official repository for a technical report [A Whisper transformer for audio captioning trained with synthetic captions and transfer learning](https://arxiv.org/abs/2305.09690).

This repository serves to train and evaluate the Whisper model for general audio-scene captioning.
The input is a short audio clip, and the output is a brief text description of what is happening.

You can find our checkpoints [on Huggingface](https://huggingface.co/collections/MU-NLPC/whisper-for-audio-captioning-653fc8f8fd9b567733359593):
- [Whisper tiny](https://huggingface.co/MU-NLPC/whisper-tiny-audio-captioning)
- [Whisper small](https://huggingface.co/MU-NLPC/whisper-small-audio-captioning)
- [Whisper large](https://huggingface.co/MU-NLPC/whisper-large-v2-audio-captioning)

If you find our work useful, cite us as follows:
```
@misc{kadlk2023whisper,
title={A Whisper transformer for audio captioning trained with synthetic captions and transfer learning},
author={Marek Kadlk and Adam Hjek and Jrgen Kieslich and Radosaw Winiecki},
year={2023},
eprint={2305.09690},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```

## Setting up environment

Start by creating a conda environment:
```shell
git clone --recursive ... # recursive because there is `evaluation_tools` as git submodule
cd audio-captioning
conda create -n malach23 python=3.8
conda activate malach23
pip install -r requirements.txt
pip install -e .
```
If the last line does not work, update your pip. e.g. `pip install --upgrade pip`

After you have the environment ready, run the script inside audiocap/evaluation_tools
```
chmod +x audiocap/evaluation_tools/coco_caption/get_stanford_models.sh
./audiocap/evaluation_tools/coco_caption/get_stanford_models.sh
```
This will download the data necessary for computing evaluation metrics.

## Preparing data

We train on multiple datasets: Audioset (our selected subset), AudioCaps, and finally Clotho.
To make it simple to work with multiple datasets downloaded them to convert them into a file
structure that is as compatible as possible. We call it AudioFolder, because it is inspired
by HuggingFace's AudioFolder or ImageFolder.

While the datasets are not *completely* compatible (e.g. one caption vs multiple captions per
audio clip), AudioFolder structure and python class `audiocap.data.AudioFolder` helps us work
with them in a systematic way. The following sections explain how to get the data and prepare
AudioFolder from them.

### Clotho dataset

Getting the data

```shell
mkdir -p data/clotho_v2.1/audiofolder
```

Download the data from and extract csv into the `data/clotho_v2.1` and audios into `data/clotho_v2.1/audiofolder` folder. Your tree structure should look like this:

```
audio-captioning/
audiocap
...
...
|
data
clotho_v2.1
audiofolder
development
evaluation
test
validation
clotho_captions_development.csv
clotho_captions_evaluation.csv
clotho_captions_validation.csv
clotho_metadata_development.csv
clotho_metadata_evaluation.csv
clotho_metadata_test.csv
clotho_metadata_validation.csv
...
```

Creating AudioFolder

Now, prepare

```shell
python audiocap/prepare_audiofolder.py prepare-clotho-audiofolder data/clotho_v2.1/
```

This will prepare the folder into the format that is easily loadable.

To limit a size of a split (like validation and evaluation), run:

```shell
python audiocap/prepare_audiofolder.py limit-clotho-split data/clotho_v2.1/audiofolder/ validation --limit 200
python audiocap/prepare_audiofolder.py limit-clotho-split data/clotho_v2.1/audiofolder/ evaluation --limit 400
```

This will sample (with a seed) a subset with a desired size and move the remaining examples to the development split.

### Pretraining data

Getting AudioSet

AudioSet is a large multi-label classification dataset. In our repository, we use information from
AudioSet ontology to construct keyword-based synthetic captions. This makes it possible to pretrain a
seq2seq captioning model (like Whisper) on AudioSet using an end-to-end supervised training pipeline.

AudioSet annotations are copied into this repository, but audios must be scraped from youtube.
You can use `scripts/download_audioset.sh` script that will use all cores to download and
convert audios based on youtube ids.

Make the script executable

```shell
chmod +x ./scripts/download_audioset.sh
```

Download the audio files

```shell
SPLIT='train_unbalanced' # run again with 'train_balanced' or 'eval'

mkdir -p logs/download_audioset

./scripts/download_audioset.sh \
"data/audioset_full/csvs/${SPLIT}.csv" \
"data/audioset_full/audios/${SPLIT}/" 2>&1 \
| tee >( sed 's/.*\r//' > "logs/download_audioset/${SPLIT}.txt" )
```

(`sed` is there to delete output lines that just update the progress)

Please note that scraping AudioSet is best-effort only. Videos could be deleted from youtube.
Now, you should select a subset of AudioSet that suits your needs. AudioSet is heavily imbalanced,
with music and speech ocurring in a vast majority of examples. In our case, we selected
around 130k instances that covered as much of the underrepresented classes. However, before we
select the subset, we prepare AudioCaps - a different dataset we use for pretraining. This is
to prevent a leakage between the two datasets because they have audio files in common.

Getting AudioCaps

AudioCaps is a captioning dataset with much more audios than Clotho (but is arguably of a lower quality).

AudioCaps annotations are also part of this repository. Furthermore, AudioCaps is a subset of AudioSet,
so you have all AudioCaps audios prepared once you download AudioSet.

Creating AudioCaps AudioFolder

Run:

```shell
python audiocap/prepare_audiofolder.py prepare-audiocaps-audiofolder \
--audiocaps-path data/audiocaps \
--audioset-path data/audioset_full \
--audio-format mp3
```

This will copy the files from AudioSet, and prepare AudioFolder structure
and annotations with dropped records about audios that were listed inside AudioCaps csvs
but files were missing (unavailable when you scraped AudioSet).

Creating a balanced AudioSet subset

This part is most intricate. We want at the same time
- a diverse subset
- a balanced subset
- a large subset
- no leakeage with AudioCaps

This is difficult and has no optimal solution. Especially balancing a dataset is difficult when each example has multiple labels.
In this repository, there are some utilities help select it. If you want to select your own subset, you can look into `notebooks/select_audioset_subset.ipynb`

However, the subset we selected is also available in this repository in `data/audioset_small`.

Creating AudioSet-small AudioFolder

Run:

```shell
python audiocap/prepare_audiofolder.py prepare-audioset-small-audiofolder \
--audioset-small-path data/audioset_small \
--audioset-full-path data/audioset_full \
--audio-format mp3
```

Congrats. Now you have all three datasets prepared for training.

### Checking corrupted audio files

During training, corrupted audio files (not loadable by librosa) are skipped.
However, if you want to check corrupted files, you can use the `audiocap.data.find_corrupted_audios`.

## Training

We train in two phases. We pretrain on a mixture of AudioCaps and AudioSet small, and
then finetune on Clotho.

We monitor metrics (into wandb) on each dataset separately and also log some predictions
so that one can see the outputs the model generates.

Because we can pretrain using the same audio-to-text objective as we do on finetuning,
we can only have a single configurable training script.

### Pretraining

AudioSet is originally a classification dataset. During training, we convert the labels on the fly
into keyword-based synthetic captions.

```shell
CUDA_VISIBLE_DEVICES="..." python \
audiocap/train_whisper_supervised.py \
--checkpoint-dir-root="./checkpoints" \
--audioset-dir="./data/audioset_small/audiofolder" \
--audiocaps-dir="./data/audiocaps/audiofolder" \
--training-config="./configs/pretrain_1on1_large_config.yaml" \
--wandb-group="pretraining"
```

Argument `--training-config` is the most important - it specifies everything important about training.
We experimented with different setups. you can find the different configs inside `configs/` folder.

### Finetuning

To run finetuning, use the following command:

```shell
CUDA_VISIBLE_DEVICES="..." python \
audiocap/train_whisper_supervised.py \
--checkpoint-dir-root="./checkpoints" \
--clotho-dir="./data/clotho_v2.1/audiofolder" \
--training-config="./configs/finetune_large_config.yaml" \
--load-checkpoint="..." \
--wandb-group="finetuning"
```

`--load-checkpoint` is an optional argument that allows initializing the model with weights from local file.

## Multitask training and inference

To effectively train on multiple datasets, we put a dataset and task identifiers into the captions.

Example:
- **clotho > caption:** Fair kind music is being played at the circus grounds.
- **audiocaps > caption:** The wind is blowing, insects are singing, and rustling occurs
- **audioset > keywords:** boat - water vehicle, motorboat - speedboat, sounds of things, vehicle

The prefix informs the model about the style of caption that is used. During inference, a prefix is
forced to the decoder, which makes the model generate output in a desired style. This is a trick
inspired by multilingual generative language models where the prefix specifies the output language.

## Inference

If you have a trained model, you can run the inference script:

```shell
CUDA_VISIBLE_DEVICES="..." python \
audiocap/predict.py \
--checkpoint path/to_checkpoint \
--data path/to/folder/with/audio/files \
--output-file foo.csv \
--config-file configs/predict_config.yaml \
--take-first-n 10 # optional, for debugging purposes
```

The inference script will generate outputs using the model, print raw outputs into stdout and
clean outputs into a csv file with two columns, `file_name` and `caption_predicted`.
Raw outputs include invisible tokens such as `<|startoftranscript|>`, forced prefix, padding, etc. Clean outputs only contain the content.

Config file specifies both inference hyperparameters (like number of beams) and technical necessities,
such as batch size, fp precision or number of loader processes.

## Licence

For all code in this repository code, licence in LICENSE file applies.
For the files in the `data` directory, specific licences apply:

- AudioSet labels: CC BY 4.0
- source of data:
- AudioSet ontology: CC BY-SA 4.0
- source of data:
- AudioCaps labels: MIT
- source of data:

Owner

Name: Cahya Wirawan
Login: cahya-wirawan
Kind: user
Location: Vienna, Austria

Website: https://www.linkedin.com/in/cahyawirawan/
Twitter: CahyaWr
Repositories: 171
Profile: https://github.com/cahya-wirawan

System engineer, currently working on NLP, CV and Speech Recognition for fun and curiosity

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/cahya-wirawan/audio-captioning

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/cahya-wirawan/audio-captioning/blob/main/

Owner

GitHub Events

Total

Last Year