https://github.com/carlosholivan/audiolm-google-torch

Implementation of the AudioLM model by Google in Pytorch

https://github.com/carlosholivan/audiolm-google-torch

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.3%) to scientific vocabulary

Keywords

audio audio-generation audio-synthesis deep-learning music music-generation sound soundprocessing synthesis vq-vae
Last synced: 5 months ago · JSON representation

Repository

Implementation of the AudioLM model by Google in Pytorch

Basic Info
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
audio audio-generation audio-synthesis deep-learning music music-generation sound soundprocessing synthesis vq-vae
Created about 3 years ago · Last pushed about 3 years ago

https://github.com/carlosholivan/audiolm-google-torch/blob/master/

# AudioLM: a Language Modeling Approach to Audio Generation

AudioLM is a model that generates audio in the waveform domain.

It uses 2 tokenizers: `Soundstream` to compute the Acoustic tokens and `w2v-BERT` to compute the Semantic tokens.

### Soundstream: Acoustic Tokens


Soundstream [2] is a SOTA neaural audio codec. The model has 3 parts:
- Encoder
- Residual Vector Quantizer (RVQ)
- Decoder

The convolutional encoder/decoder takes a single channel waveform $x \in R^T$ and reconstructs it $\hat{x} \in R^T$ from the quantized embeddings. The embeddings are discretized using a residual vector quantizer (RVQ) with $Q$ vector quantizers each one composed by $N$ vocabulary symbols.

- Input: waveform at 16kHz.
- Encoder Embeddings: 50Hz (x320 reduction).
- Codebook symbols: $Y \in \{ 1, . . . , N \}^{
T_A \times Q}$ where $T_A = T/320$
- Encoded audio: $enc(x) = R^{S \times D}$.
- One-hot encoded vectors shape: $S \times D$
- Decoder Embeddings.
- Reconstructed waveform.

Parameters:
- Encoder conv. blocks: $B_{enc} = 4$
- Decoder conv. blocks: $B_{dec} = 4$
- Channels: $C_{enc} = C_{dec}$
- Input samples: $M$
- One embedding: $M = 2\cdot 4\cdot 5\cdot 8 = 320$ (4 encoder strides: 2, 4, 5, 8).
- Embeddings dimensionality: $D$
- Number of samples in the time domain: $T$
- Number of samples of the encoded audio: $S = T / M$



#### Discriminator


The adversarial loss is computed with a
STFT-based discriminator. The input to the STFTDiscriminator is the complexvalued STFT of the input waveform (real and imaginary parts) and the output, the logits.


### w2v-BERT: Semantic Tokens


w2v-BERT [3] is a Transformer-based model for learning self-supervised audio representations. It maps a input waveform to a set of linguistic features.


## Training


Random sampling $ [1:N_q] $ $ \rightarrow $ $ Q_i \quad  i=1...n_q $ quantizers


### Inference
Set $ n_q $ to change the desired bitrate value.


## References

[1] [AudioLM: a Language Modeling Approach to
Audio Generation](https://arxiv.org/pdf/2209.03143.pdf)

[2] [SoundStream: An End-to-End Neural Audio Codec](https://arxiv.org/pdf/2107.03312.pdf)

[3] [w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training](https://arxiv.org/pdf/2108.06209.pdf)

Owner

  • Name: Carlos Hernández Oliván
  • Login: carlosholivan
  • Kind: user
  • Location: Zaragoza, Spain
  • Company: Universidad de Zaragoza

PhD student researching in Machine Learning and Music.

GitHub Events

Total
Last Year