https://github.com/amazon-science/iwslt-autodub-task

https://github.com/amazon-science/iwslt-autodub-task

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 1.14 GB
Statistics
  • Stars: 20
  • Watchers: 11
  • Forks: 10
  • Open Issues: 0
  • Releases: 0
Created over 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Introduction

This repo contains both the data and code to train and run automatic dubbing: translating the speech in a video into a new language such that the new speech is natural when overlayed on the original video.

Our system takes in German videos like this, along with their transcripts (e.g. "Sein Hauptgebiet waren es, romantische und gotische Poesie aus dem Englischen ins Japanische zu bersetzen."):

https://user-images.githubusercontent.com/3534106/217985339-fb31a3a5-7845-4d52-b651-0ab93e426c70.mp4

And produce videos like this, dubbed into English:

https://user-images.githubusercontent.com/3534106/217978682-d74d35b8-3a5f-4e46-82c2-94269e56b3b4.mp4

Setting up the environment

Install Miniconda/Anaconda if needed. For training models, it is assumed that you have at least 1 GPU, with CUDA drivers set up. This has been tested on 1 NVIDIA V100 GPU with CUDA 11.7, on Ubuntu 20.04.

```bash sudo apt install git-lfs awscli ffmpeg build-essential jq git lfs install --skip-repo

Clone this repository (~1.1 GB)

git clone https://github.com/amazon-science/iwslt-autodub-task.git --recursive cd iwslt-autodub-task

Create a conda environment

conda env create --file environment.yml conda activate iwslt-autodub

Download Prism model for evaluation

cd third_party/prism conda create -n prism python=3.7 -y conda activate prism pip install -r requirements.txt # g++ is needed conda deactivate

Downloads Prism model (~1.4 GB)

wget http://data.statmt.org/prism/m39v1.tar tar xf m39v1.tar rm m39v1.tar cd ../.. ```

Download and extract data

Download the CoVoST2 en-de dataset following these steps, or directly follow instructions at https://github.com/facebookresearch/covost#covost-2.

  • First, download Common Voice audio clips and transcripts (English, version 4). Note that after filling out the form you can copy the url to download from the download button and download it with wget.
  • Next, extract validated.tsv from it: bash mkdir covost_tsv tar -xvf en.tar.gz validated.tsv mv validated.tsv covost_tsv/
  • Then extract the required TSV files: bash # Download and split CoVoST2 TSV files pushd covost_tsv wget https://dl.fbaipublicfiles.com/covost/covost_v2.en_de.tsv.tar.gz https://raw.githubusercontent.com/facebookresearch/covost/main/get_covost_splits.py tar -xzf covost_v2.en_de.tsv.tar.gz python3 get_covost_splits.py -v 2 --src-lang en --tgt-lang de --root . --cv-tsv validated.tsv popd You should now have covost_v2.en_de.dev.tsv, covost_v2.en_de.test.tsv, and covost_v2.en_de.train.tsv in the covost_tsv directory.
  • Then extract MFA files: bash mkdir covost_mfa tar -xf data/training/covost2_mfa.tz -C covost_mfa mv covost_mfa/covost2_mfa covost_mfa/data Now, all the json files should be in covost_mfa/data.

Compute the distribution of speech durations

bash python3 get_durations_frequencies.py ./covost_mfa/data

This computes how often each speech duration is observed in our training data, so that we do the binning correctly.

Build dataset

Create the processed datasets for training/evaluation ("text and noised binned segments -> phones and durations" is the provided baseline): ```bash

text -> text. Used to generate references for automatic evaluation of translation quality.

python3 build_datasets.py --en en-text-without-durations --de de-text-without-durations

text and noised binned segments -> phones and durations. For the baseline model.

python3 builddatasets.py --en en-phones-durations --de de-text-noisy-durations --noise-std 0.1 --upsampling 10 --write-segments-to-file `` For full usage options, runbuilddatasets.py -h. For use with factored models, make sure you use the--write-segments-to-file` option, since that will generate some files required for generating the factored data.

Prepare target factor files

For the factored baselines, you need to prepare the datasets in the factored formats and generate the auxiliary factors. bash python3 separate_factors.py -i processed_datasets/de-text-noisy-durations0.1-en-phones-durations -o multi_factored This will generate target factor input files in processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/. * *.en.text: Main output containing the original text, with <shift> tokens to account for internal factor shifts so that the factors are conditioned on the main output. * *.en.duration: Main target factor to predict durations. Contains the durations (number of frames) corresponding to each phoneme in *.en.text. * *.en.total_duration_remaining: Auxiliary factor to count down the number of frames remaining in each line. This is calculated from the (noised) segment durations, and counts down by the number of frames generated at each time step. Note that this may not count down to 0 due to the noise added to the segment durations. * *.en.segment_duration_remaining: Auxiliary factor to count down the number of frames remaining in each segment, i.e. until a [pause] token is encountered. Similar to the previous factor, but initialized by the corresponding target segment duration for each segment within a line. * *.en.pauses_remaining: Auxiliary factor that counts down the number of [pause] tokens remaining in a line.

This is what they should look like: ```bash $ head -2 processeddatasets/de-text-noisy-durations0.1-en-phones-durations/multifactored/test.en.{text,duration,totaldurationremaining,segmentdurationremaining,pausesremaining} ==> processeddatasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.text <== F AO1 R CH AH0 N AH0 T L IY0 DH AH1 R EY1 T S AH1 V D EH1 TH S IH1 N DH IY0 Y UW0 N AY1 T IH0 D K IH1 NG D AH0 M HH AE1 V R IH0 D UW1 S T sp AY1 D IH1 D HH AE1 V sp T AH0 K AH1 T sp AH0 sp F Y UW1 sp K AO1 R N ER0 Z

==> processeddatasets/de-text-noisy-durations0.1-en-phones-durations/multifactored/test.en.duration <== 0 12 3 8 12 4 5 5 9 8 14 0 5 7 0 7 17 5 5 0 10 10 0 3 12 13 3 0 13 13 0 2 3 0 3 5 5 13 8 6 10 0 5 8 12 6 8 7 0 5 8 7 0 3 12 8 16 12 14 0 24 0 11 0 17 9 15 0 1 4 5 0 9 3 5 0 20 6 7 0 8 7 0 7 5 4 16 0 3 11 13 4 6 26 24 0

==> processeddatasets/de-text-noisy-durations0.1-en-phones-durations/multifactored/test.en.totaldurationremaining <== 413 401 398 390 378 374 369 364 355 347 333 333 328 321 321 314 297 292 287 287 277 267 267 264 252 239 236 236 223 210 210 208 205 205 202 197 192 179 171 165 155 155 150 142 130 124 116 109 109 104 96 89 89 86 74 66 50 38 24 24 0 246 235 235 218 209 194 194 193 189 184 184 175 172 167 167 147 141 134 134 126 119 119 112 107 103 87 87 84 73 60 56 50 24 0 0

==> processeddatasets/de-text-noisy-durations0.1-en-phones-durations/multifactored/test.en.segmentdurationremaining <== 413 401 398 390 378 374 369 364 355 347 333 333 328 321 321 314 297 292 287 287 277 267 267 264 252 239 236 236 223 210 210 208 205 205 202 197 192 179 171 165 155 155 150 142 130 124 116 109 109 104 96 89 89 86 74 66 50 38 24 24 0 246 235 235 218 209 194 194 193 189 184 184 175 172 167 167 147 141 134 134 126 119 119 112 107 103 87 87 84 73 60 56 50 24 0 0

==> processeddatasets/de-text-noisy-durations0.1-en-phones-durations/multifactored/test.en.pauses_remaining <== 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ```

Decode test set and evaluate using provided Sockeye baseline model

NOTE: Test set here does not refer to the specific subsets in data/test/; rather it refers to the full test sets generated from CoVoST2.

We provide a baseline model checkpoint in models/sockeye/trainedbaselines/baselinefactored_noised0.1. This uses a target factor to predict durations and additional target factors to help the model keep track of time. The training segment durations have Gaussian noise (std. dev. 0.1) added to teach the model to be flexible about timing in hopes of striking a balance between speech overlap, speech naturalness, and translation quality. (Note that the speech overlap in real human dubs is only about 70%.)

Before you proceed, in sockeye_scripts/config, set ROOT as the path of this repo. For example, ROOT=~/iwslt-autodub-task.

Before decoding, please make sure you have run the data and factor preparation steps, so that you have at least processed_datasets/de-text-noised-durations0.1-en-phones-durations prepared with the multi_factored subdirectory, processed_datasets/de-text-without-durations-en-text-without-durations for the translation reference text files. If you ran the steps in the previous section, you will have these already.

Decoding using baselinefactorednoised0.1

The input format for decoding is a specific JSON format that can be prepared using: ```bash $ python3 sockeyescripts/decoding/create-json-inputs.py -d processeddatasets/de-text-noisy-durations0.1-en-phones-durations --subset test --output-segment-durations -o processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.de.json

Check JSON file looks like this

$ head -2 ~/iwslt-autodub-task/processeddatasets/de-text-noisy-durations0.1-en-phones-durations/test.de.json | jq { "text": "Glck@@ licherweise sind die Ster@@ ber@@ aten im Vereinigten Knigreich ges@@ unken <||> ", "targetprefix": "", "targetprefixfactors": [ "0", "413", "413", "0" ], "targetsegmentdurations": [ 413 ], "usetargetprefixallchunks": false } { "text": "Ich musste einige Ab@@ str@@ ich@@ e mach@@ en@@ . <||> ", "targetprefix": "", "targetprefixfactors": [ "0", "246", "246", "0" ], "targetsegmentdurations": [ 246 ], "usetargetprefixall_chunks": false } ```

To decode using baselinefactorednoised0.1, run bash mkdir -p models/sockeye/trained_baselines/baseline_factored_noised0.1/eval sockeye-translate \ -i processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.de.json \ -o models/sockeye/trained_baselines/baseline_factored_noised0.1/eval/test.en.output \ --models models/sockeye/trained_baselines/baseline_factored_noised0.1/model \ --checkpoints 78 \ -b 5 \ --batch-size 32 \ --chunk-size 20000 \ --output-type translation_with_factors \ --max-output-length 768 \ --force-factors-stepwise frames total_remaining segment_remaining pauses_remaining \ --json-input \ --quiet This should take around 5 minutes on 1 V100 GPU, or around an hour without a GPU.

Evaluate baselinefactorednoised0.1 output

bash ./sockeye_scripts/evaluation/evaluate-factored.sh processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.en models/sockeye/trained_baselines/baseline_factored_noised0.1/eval/test.en.output

This will print: * Translation quality metrics - BLEU - Prism - COMET * Speech overlap metrics

Reproduce Sockeye baseline models

Scripts are included to reproduce the Sockeye baselines included here. Before launching training for the factored models, you need specially created vocab files which can be generated using bash cd sockeye_scripts/training wget https://raw.githubusercontent.com/Proyag/sockeye/factor-pe/sockeye_contrib/create_seq_vocab.py python create_seq_vocab.py --min-val -4000 --max-val 5000 --output seq_vocab_expanded.json

And now, the training can be launched using bash ./train_factored_noised0.1.sh If you're using >1 GPU, adjust the following settings in the script first ```bash

Set the number of GPUs for distributed training

Adjust BATCHSIZE and UPDATEINTERVAL according to your GPU situation.

For example, if you change N_GPU to 2, you should set update-interval to 8 to have the same effective batch size

NGPU=1 BATCHSIZE=4096 UPDATE_INTERVAL=16 `` The trained models will be inmodels/sockeye`.

Generate test set dubs with Sockeye models

There are German videos for two subsets of the test set in data/test/subset{1,2}. We want to generate English dubbed videos for these.

Extract the test set audio/video in data/test bash pushd data/test tar -xzf subset1/subset1.tgz -C subset1 tar -xzf subset2/subset2.tgz -C subset2 popd

Set up FastSpeech2 (only for the first usage) ```bash sudo apt-get install ffmpeg cd third_party/FastSpeech2

Create separate environment for FastSpeech2 dependencies

conda create -n fastspeech2 python=3.8 -y conda activate fastspeech2 pip install -r requirements.txt pip install torchaudio pip install gdown

Download and extract pretrained model

mkdir -p output/ckpt/LJSpeech output/result/LJSpeech preprocesseddata/LJSpeech/duration cd output/ckpt/LJSpeech gdown https://drive.google.com/uc?id=1r3fYhnblBJ8hDKDSUDtidJ-BN-xAM9pe unzip LJSpeech900000.zip cd ../../../hifigan unzip generator_LJSpeech.pth.tar.zip conda deactivate ```

Generate dubbed videos for the test set subsets using FastSpeech2 bash cd ~/iwslt-autodub-task # Or the path to the repo home, if different python synthesize_speech.py --subset 1 python synthesize_speech.py --subset 2 Corresponding to each file *.X.mov or *.X.mp4 in data/test/subset{1,2}, a dubbed video data/test/subset{1,2}/dubbed/*.X.mp4 will be created.

LICENSE

Scripts are provided under Apache 2.0 License. Full text provided in root directory.

New Dubbing data set is available under CC-BY-4.0 license. Full text provided under data/ directory.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 10
  • Average time to close issues: 4 months
  • Average time to close pull requests: 5 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.8
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 4 months
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Meyjann (1)
Pull Request Authors
  • Proyag (6)
  • thompsonb (2)
  • dependabot[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2)