master-thesis

One Bit at a Time: Impact of Quantisation on Neural Machine Translation

https://github.com/marekninja/master-thesis

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Keywords

encoder-decoder fully-quantized-transformer nmt pytorch quantization quantization-aware-training seq2seq transformer transformers

Last synced: 6 months ago · JSON representation ·

Repository

One Bit at a Time: Impact of Quantisation on Neural Machine Translation

Basic Info

Host: GitHub
Owner: marekninja
License: bsd-3-clause
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 2.62 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

encoder-decoder fully-quantized-transformer nmt pytorch quantization quantization-aware-training seq2seq transformer transformers

Created over 4 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

One Bit at a Time: Impact of Quantisation on Neural Machine Translation

Despite the precision of the large language models, the deployment of these models still faces some practical issues. Except for being memory-demanding, the main issue lays in the speed of prediction. In the case of generative language models, the time of auto-regressive generation scales with the output length. Another significant limitation of translation models remains in their domain-specificity given by the domain of the training data.

Our work investigates the impact of model quantization on these issues. In theory, quantisation holds a potential to address these problems through lower bit-width computations allowing for model compression, speed-up, and regularization incorporated in training. Specifically, we inspect the effect that quantization has on Transformer neural language translation model.

In addition to the obtained measurements, the contributions of this work are also in the implementations of quantized Transformer and the reusable framework for evaluation of speed, memory requirements, and distributional robustness of generative language models.

Work outcome:

evaluation framework
- should allow for evaluation of: NMT quality by at least one metric, inference speed, generating a summary on chosen domains
implementation
- at least one method of NMT speed-up: quantization, distilation, adaptive inference
comparison
- of quality/speed of implemented methods on various domains

Repository Structure

enmt/modifiedtransformersfiles contains modified files of 🤗 Transformers library

filename | QAT | SQ | DQ | SiLU | ReLU | Embeddings Quant. :------------ | :-------------| :-------------| :------------- | :------------- | :------------- | :------------- modelingmarianORIG.py | | | :whitecheckmark: | :heavycheckmark: | :heavycheckmark: | | modelingmarianquantv2.py | :whitecheckmark: | :whitecheckmark: | :whitecheckmark: | :heavycheckmark: | | | modelingmarianquantv2.1.py | :whitecheckmark: | :whitecheckmark: | :whitecheckmark: | :heavycheckmark: | | :heavycheckmark:| modelingmarianquantv3.py | :whitecheckmark: | :whitecheckmark: | :whitecheckmark: | | :heavycheckmark: | :heavycheck_mark: |

This files need to be manually replaced in editable 🤗 Transformers installation (as in requirements.txt).

Instalation notes

it is needed to have PyTorch installed ahead of requirements.txt
- tested PyTorch version: 1.11.0.dev20210929+cu102 and 1.12.0.dev20220208+cu102
there might be some non-linux compatible libraries (e.g. pywin*), just skip them when it fails...

```bash python3 -m venv venv source venv/bin/activate

PyTorch Preview (Nightly) with CUDA 11.3 - some quantization operations are from nightly

pip install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cu113 pip install -r requirements.txt ```

Reproducing results:

This repo contains all the scripts we used in our experiments. The experiments and their scripts:

Choice of Hyper-Parameters:

runnerFineTuneFPexposureBiasEuParl*
runnerFineTuneQATexposureBiasEuParl*

QAT Hyper-Params. | FP Fine-Tuning Hyper-Params. :-------------------------:|:-------------------------: Search for learning rate and number of epochs for QAT. The best learning rate for FP fine-tuning is 0.0002. |

Impact of Pretraining on Domain Robustness and Accuracy

runnerTrainFPnormalEuParl*
runnerFineTuneQATfind*
these scripts need to have proper paths set up (so that QAT can access FP checkpoints)

FP Training | QAT Continuation :-------------------------:|:-------------------------: FP training, achieved 37.59 BLEU | QAT continuation, achieved better or same BLEU

Regularization Effects of Quantization-Aware Training

runnerFineTuneFPinf*
runnerFineTuneQATinf*

In-Domain | Out-of-Domain :-------------------------:|:-------------------------: Training for many steps, in-domain | Training for many steps, out-of-domain

Comparison of Quantized and Full Precision Models

runnerCompareSpeed*
QAT comparison script needs to point on QAT pretrained model

| AMD Ryzen 5 3600, NVIDIA GeForce GTX 1660 Super 6GB
| model | Batch s. | in-d. test BLEU | out-d. test BLEU | |--------------------------------------------------------|---------:|---------------------------:|----------------------------:| | FP CUDA | 8 | 37.15 | 11.70 | | FP CPU | 8 | 37.15 | 11.70 | | DQ CPU | 8 | 37.06 | 11.63 | | QAT CPU | 8 | 37.34 | 11.52 |

| model | Batch s. | Samples/second | Speed-Up vs. FP CPU | Speed-Up vs. CUDA | |-----------------------------------------------------|---------:|---------------:|-------------------------------:|-----------------------------:| | FP CUDA | 8 | 16.25 | | | | FP CPU | 8 | 2.94 | | 0.18 | | DQ CPU | 8 | 5.41 | 1.84 | 0.33 | | QAT CPU | 8 | 4.90 | 1.67 | 0.30 |

| model | size [MB] | compression | |-------|----------:|------------:| | QAT | 173.55 | 0.57 | | SQ | 173.55 | 0.57 | | DQ | 200.63 | 0.66 | | FP | 301.94 | |

Running scripts

In our experiments we rely on comet_ml reporting. You can run any experiment like so:

bash COMET_API_KEY=your_comet_api_key CUDA_VISIBLE_DEVICES=1 python runnerFineTuneFPexposureBiasEuParl-1.py

Some experiments rely on previously trained models, so the scripts need to be modified to point to trained models. We advise you to check paths, these paths in script are set in two ways:

set variables saved_model or simillar (in some scripts it is variable checkpoints_dir)
set 'output_dir': "pipeline/outputs/path"

Prototypes:

We used these notebooks to understand model quantization and to prototype enmt framework.

Evaluation of pretrained model

Jupyter notebook

INT8 quantized model(MarianMT) has nearly same BLEU score and is 1.7times faster than in FP

GLUE Bert Quantization

Jupyter notebook

from: https://pytorch.org/tutorials/intermediate/dynamicquantizationbert_tutorial.html

Loding and quantization of pretrained models

Jupyter notebook

Also contains dataset preprocessing

Owner

Name: Marek Petrovič
Login: marekninja
Kind: user
Location: Bratislava, Slovakia
Company: Swiss Re

Repositories: 14
Profile: https://github.com/marekninja

AI enthusiast

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Petrovič"
    given-names: "Marek"
    orcid: "https://orcid.org/0000-0002-2801-1014"
title: "ENMT- Evaluation Framework for evaluation of quantization impacts on NMT Transformers"
version: 1.0.0
date-released: 2022-05-01
url: "https://github.com/marekninja/master-thesis"

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

Jinja2 ==3.0.2
MarkupSafe ==2.0.1
Pillow ==8.4.0
PyYAML ==6.0
Pygments ==2.10.0
Send2Trash ==1.8.0
aiohttp ==3.7.4.post0
argon2-cffi ==21.1.0
async-timeout ==3.0.1
attrs ==21.2.0
autopep8 ==1.6.0
backcall ==0.2.0
bleach ==4.1.0
certifi ==2021.10.8
cffi ==1.15.0
chardet ==4.0.0
charset-normalizer ==2.0.7
click ==8.0.3
colorama ==0.4.4
comet-ml ==3.19.0
configobj ==5.0.6
datasets ==1.18.1
debugpy ==1.5.0
decorator ==5.1.0
defusedxml ==0.7.1
dill ==0.3.4
dulwich ==0.20.25
entrypoints ==0.3
everett ==2.0.1
filelock ==3.3.1
fsspec ==2021.10.1
huggingface-hub ==0.4.0
idna ==3.3
ipykernel ==6.4.1
ipython ==7.28.0
ipython-genutils ==0.2.0
ipywidgets ==7.6.5
jedi ==0.18.0
joblib ==1.1.0
jsonschema ==4.1.0
jupyter-client ==7.0.6
jupyter-core ==4.8.1
jupyterlab-pygments ==0.1.2
jupyterlab-widgets ==1.0.2
matplotlib-inline ==0.1.3
mistune ==0.8.4
multidict ==5.2.0
multiprocess ==0.70.12.2
nbclient ==0.5.4
nbconvert ==6.2.0
nbformat ==5.1.3
nest-asyncio ==1.5.1
notebook ==6.4.4
numpy ==1.21.2
nvidia-ml-py3 ==7.352.0
packaging ==21.0
pandas ==1.3.3
pandocfilters ==1.5.0
parso ==0.8.2
pickleshare ==0.7.5
portalocker ==2.3.2
prometheus-client ==0.11.0
prompt-toolkit ==3.0.20
pyarrow ==5.0.0
pycodestyle ==2.8.0
pycparser ==2.20
pyparsing ==2.4.7
pyrsistent ==0.18.0
python-dateutil ==2.8.2
pytz ==2021.3
pyzmq ==22.3.0
regex ==2021.10.8
requests ==2.26.0
requests-toolbelt ==0.9.1
sacrebleu ==2.0.0
sacremoses ==0.0.46
semantic-version ==2.8.5
sentencepiece ==0.1.96
six ==1.16.0
tabulate ==0.8.9
terminado ==0.12.1
testpath ==0.5.0
tokenizers ==0.10.3
toml ==0.10.2
tornado ==6.1
tqdm ==4.62.3
traitlets ==5.1.0
typing-extensions ==3.10.0.2
urllib3 ==1.26.7
wcwidth ==0.2.5
webencodings ==0.5.1
websocket-client ==1.2.1
widgetsnbextension ==3.5.1
wincertstore ==0.2
wrapt ==1.12.1
wurlitzer ==3.0.2
xxhash ==2.0.2
yarl ==1.7.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

master-thesis

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

One Bit at a Time: Impact of Quantisation on Neural Machine Translation

Work outcome:

Repository Structure

Instalation notes

PyTorch Preview (Nightly) with CUDA 11.3 - some quantization operations are from nightly

Reproducing results:

Choice of Hyper-Parameters:

Impact of Pretraining on Domain Robustness and Accuracy

Regularization Effects of Quantization-Aware Training

Comparison of Quantized and Full Precision Models

Running scripts

Prototypes:

Evaluation of pretrained model

GLUE Bert Quantization

Loding and quantization of pretrained models

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Dependencies