https://github.com/cahya-wirawan/audio-captioning
Audio captioning - DCASE challenge 2023 task 6a
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Audio captioning - DCASE challenge 2023 task 6a
Basic Info
- Host: GitHub
- Owner: cahya-wirawan
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 90.1 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of prompteus/audio-captioning
Created over 1 year ago
· Last pushed over 1 year ago
https://github.com/cahya-wirawan/audio-captioning/blob/main/
# Audio captioning
This is the official repository for a technical report [A Whisper transformer for audio captioning trained with synthetic captions and transfer learning](https://arxiv.org/abs/2305.09690).
This repository serves to train and evaluate the Whisper model for general audio-scene captioning.
The input is a short audio clip, and the output is a brief text description of what is happening.
You can find our checkpoints [on Huggingface](https://huggingface.co/collections/MU-NLPC/whisper-for-audio-captioning-653fc8f8fd9b567733359593):
- [Whisper tiny](https://huggingface.co/MU-NLPC/whisper-tiny-audio-captioning)
- [Whisper small](https://huggingface.co/MU-NLPC/whisper-small-audio-captioning)
- [Whisper large](https://huggingface.co/MU-NLPC/whisper-large-v2-audio-captioning)
If you find our work useful, cite us as follows:
```
@misc{kadlk2023whisper,
title={A Whisper transformer for audio captioning trained with synthetic captions and transfer learning},
author={Marek Kadlk and Adam Hjek and Jrgen Kieslich and Radosaw Winiecki},
year={2023},
eprint={2305.09690},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```
## Setting up environment
Start by creating a conda environment:
```shell
git clone --recursive ... # recursive because there is `evaluation_tools` as git submodule
cd audio-captioning
conda create -n malach23 python=3.8
conda activate malach23
pip install -r requirements.txt
pip install -e .
```
If the last line does not work, update your pip. e.g. `pip install --upgrade pip`
After you have the environment ready, run the script inside audiocap/evaluation_tools
```
chmod +x audiocap/evaluation_tools/coco_caption/get_stanford_models.sh
./audiocap/evaluation_tools/coco_caption/get_stanford_models.sh
```
This will download the data necessary for computing evaluation metrics.
## Preparing data
We train on multiple datasets: Audioset (our selected subset), AudioCaps, and finally Clotho.
To make it simple to work with multiple datasets downloaded them to convert them into a file
structure that is as compatible as possible. We call it AudioFolder, because it is inspired
by HuggingFace's AudioFolder or ImageFolder.
While the datasets are not *completely* compatible (e.g. one caption vs multiple captions per
audio clip), AudioFolder structure and python class `audiocap.data.AudioFolder` helps us work
with them in a systematic way. The following sections explain how to get the data and prepare
AudioFolder from them.
### Clotho dataset
Getting the data
```shell
mkdir -p data/clotho_v2.1/audiofolder
```
Download the data from and extract csv into the `data/clotho_v2.1` and audios into `data/clotho_v2.1/audiofolder` folder. Your tree structure should look like this:
```
audio-captioning/
audiocap
...
...
|
data
clotho_v2.1
audiofolder
development
evaluation
test
validation
clotho_captions_development.csv
clotho_captions_evaluation.csv
clotho_captions_validation.csv
clotho_metadata_development.csv
clotho_metadata_evaluation.csv
clotho_metadata_test.csv
clotho_metadata_validation.csv
...
```
Creating AudioFolder
Now, prepare
```shell
python audiocap/prepare_audiofolder.py prepare-clotho-audiofolder data/clotho_v2.1/
```
This will prepare the folder into the format that is easily loadable.
To limit a size of a split (like validation and evaluation), run:
```shell
python audiocap/prepare_audiofolder.py limit-clotho-split data/clotho_v2.1/audiofolder/ validation --limit 200
python audiocap/prepare_audiofolder.py limit-clotho-split data/clotho_v2.1/audiofolder/ evaluation --limit 400
```
This will sample (with a seed) a subset with a desired size and move the remaining examples to the development split.
### Pretraining data
Getting AudioSet
AudioSet is a large multi-label classification dataset. In our repository, we use information from
AudioSet ontology to construct keyword-based synthetic captions. This makes it possible to pretrain a
seq2seq captioning model (like Whisper) on AudioSet using an end-to-end supervised training pipeline.
AudioSet annotations are copied into this repository, but audios must be scraped from youtube.
You can use `scripts/download_audioset.sh` script that will use all cores to download and
convert audios based on youtube ids.
Make the script executable
```shell
chmod +x ./scripts/download_audioset.sh
```
Download the audio files
```shell
SPLIT='train_unbalanced' # run again with 'train_balanced' or 'eval'
mkdir -p logs/download_audioset
./scripts/download_audioset.sh \
"data/audioset_full/csvs/${SPLIT}.csv" \
"data/audioset_full/audios/${SPLIT}/" 2>&1 \
| tee >( sed 's/.*\r//' > "logs/download_audioset/${SPLIT}.txt" )
```
(`sed` is there to delete output lines that just update the progress)
Please note that scraping AudioSet is best-effort only. Videos could be deleted from youtube.
Now, you should select a subset of AudioSet that suits your needs. AudioSet is heavily imbalanced,
with music and speech ocurring in a vast majority of examples. In our case, we selected
around 130k instances that covered as much of the underrepresented classes. However, before we
select the subset, we prepare AudioCaps - a different dataset we use for pretraining. This is
to prevent a leakage between the two datasets because they have audio files in common.
Getting AudioCaps
AudioCaps is a captioning dataset with much more audios than Clotho (but is arguably of a lower quality).
AudioCaps annotations are also part of this repository. Furthermore, AudioCaps is a subset of AudioSet,
so you have all AudioCaps audios prepared once you download AudioSet.
Creating AudioCaps AudioFolder
Run:
```shell
python audiocap/prepare_audiofolder.py prepare-audiocaps-audiofolder \
--audiocaps-path data/audiocaps \
--audioset-path data/audioset_full \
--audio-format mp3
```
This will copy the files from AudioSet, and prepare AudioFolder structure
and annotations with dropped records about audios that were listed inside AudioCaps csvs
but files were missing (unavailable when you scraped AudioSet).
Creating a balanced AudioSet subset
This part is most intricate. We want at the same time
- a diverse subset
- a balanced subset
- a large subset
- no leakeage with AudioCaps
This is difficult and has no optimal solution. Especially balancing a dataset is difficult when each example has multiple labels.
In this repository, there are some utilities help select it. If you want to select your own subset, you can look into `notebooks/select_audioset_subset.ipynb`
However, the subset we selected is also available in this repository in `data/audioset_small`.
Creating AudioSet-small AudioFolder
Run:
```shell
python audiocap/prepare_audiofolder.py prepare-audioset-small-audiofolder \
--audioset-small-path data/audioset_small \
--audioset-full-path data/audioset_full \
--audio-format mp3
```
Congrats. Now you have all three datasets prepared for training.
### Checking corrupted audio files
During training, corrupted audio files (not loadable by librosa) are skipped.
However, if you want to check corrupted files, you can use the `audiocap.data.find_corrupted_audios`.
## Training
We train in two phases. We pretrain on a mixture of AudioCaps and AudioSet small, and
then finetune on Clotho.
We monitor metrics (into wandb) on each dataset separately and also log some predictions
so that one can see the outputs the model generates.
Because we can pretrain using the same audio-to-text objective as we do on finetuning,
we can only have a single configurable training script.
### Pretraining
AudioSet is originally a classification dataset. During training, we convert the labels on the fly
into keyword-based synthetic captions.
```shell
CUDA_VISIBLE_DEVICES="..." python \
audiocap/train_whisper_supervised.py \
--checkpoint-dir-root="./checkpoints" \
--audioset-dir="./data/audioset_small/audiofolder" \
--audiocaps-dir="./data/audiocaps/audiofolder" \
--training-config="./configs/pretrain_1on1_large_config.yaml" \
--wandb-group="pretraining"
```
Argument `--training-config` is the most important - it specifies everything important about training.
We experimented with different setups. you can find the different configs inside `configs/` folder.
### Finetuning
To run finetuning, use the following command:
```shell
CUDA_VISIBLE_DEVICES="..." python \
audiocap/train_whisper_supervised.py \
--checkpoint-dir-root="./checkpoints" \
--clotho-dir="./data/clotho_v2.1/audiofolder" \
--training-config="./configs/finetune_large_config.yaml" \
--load-checkpoint="..." \
--wandb-group="finetuning"
```
`--load-checkpoint` is an optional argument that allows initializing the model with weights from local file.
## Multitask training and inference
To effectively train on multiple datasets, we put a dataset and task identifiers into the captions.
Example:
- **clotho > caption:** Fair kind music is being played at the circus grounds.
- **audiocaps > caption:** The wind is blowing, insects are singing, and rustling occurs
- **audioset > keywords:** boat - water vehicle, motorboat - speedboat, sounds of things, vehicle
The prefix informs the model about the style of caption that is used. During inference, a prefix is
forced to the decoder, which makes the model generate output in a desired style. This is a trick
inspired by multilingual generative language models where the prefix specifies the output language.
## Inference
If you have a trained model, you can run the inference script:
```shell
CUDA_VISIBLE_DEVICES="..." python \
audiocap/predict.py \
--checkpoint path/to_checkpoint \
--data path/to/folder/with/audio/files \
--output-file foo.csv \
--config-file configs/predict_config.yaml \
--take-first-n 10 # optional, for debugging purposes
```
The inference script will generate outputs using the model, print raw outputs into stdout and
clean outputs into a csv file with two columns, `file_name` and `caption_predicted`.
Raw outputs include invisible tokens such as `<|startoftranscript|>`, forced prefix, padding, etc. Clean outputs only contain the content.
Config file specifies both inference hyperparameters (like number of beams) and technical necessities,
such as batch size, fp precision or number of loader processes.
## Licence
For all code in this repository code, licence in LICENSE file applies.
For the files in the `data` directory, specific licences apply:
- AudioSet labels: CC BY 4.0
- source of data:
- AudioSet ontology: CC BY-SA 4.0
- source of data:
- AudioCaps labels: MIT
- source of data:
Owner
- Name: Cahya Wirawan
- Login: cahya-wirawan
- Kind: user
- Location: Vienna, Austria
- Website: https://www.linkedin.com/in/cahyawirawan/
- Twitter: CahyaWr
- Repositories: 171
- Profile: https://github.com/cahya-wirawan
System engineer, currently working on NLP, CV and Speech Recognition for fun and curiosity
GitHub Events
Total
- Push event: 36
- Create event: 4
Last Year
- Push event: 36
- Create event: 4