gerpt2

German small and large versions of GPT2.

https://github.com/bminixhofer/gerpt2

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.8%) to scientific vocabulary

Keywords

common-crawl german gpt2 language-model machine-learning nlp
Last synced: 6 months ago · JSON representation ·

Repository

German small and large versions of GPT2.

Basic Info
  • Host: GitHub
  • Owner: bminixhofer
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 60.5 KB
Statistics
  • Stars: 20
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
common-crawl german gpt2 language-model machine-learning nlp
Created over 5 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License Citation

README.md

GerPT2

German large and small versions of GPT2:

  • https://huggingface.co/benjamin/gerpt2
  • https://huggingface.co/benjamin/gerpt2-large

See the GPT2 model card for considerations on limitations and bias. See the GPT2 documentation for details on GPT2.

Comparison to dbmdz/german-gpt2

I evaluated both GerPT2-large and the other German GPT2, dbmdz/german-gpt2 on the CC-100 dataset and on the German Wikipedia:

| | CC-100 (PPL) | Wikipedia (PPL) | |-------------------|--------------|-----------------| | dbmdz/german-gpt2 | 49.47 | 62.92 | | GerPT2 | 24.78 | 35.33 | | GerPT2-large | 16.08 | 23.26 | | | | |

See the script evaluate.py in the GerPT2 Github repository for the code.

Usage

```python from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.frompretrained("benjamin/gerpt2-large") model = AutoModelForCausalLM.frompretrained("benjamin/gerpt2-large")

prompt = ""

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe(prompt)[0]["generated_text"]) ```

Also, two tricks might improve the generated text:

python output = model.generate( # during training an EOS token was used to mark the beginning of each text # so it can help to insert it at the start torch.tensor( [tokenizer.eos_token_id] + tokenizer.encode(prompt) ).unsqueeze(0), do_sample=True, # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is # prone to ending generation early because a significant number of texts from the training corpus # is quite short bad_words_ids=[[0]], max_length=max_length, )[0] print(tokenizer.decode(output))

Training details

GerPT2-large is trained on the entire German data from the CC-100 Corpus and weights were initialized from the English GPT2 model. GerPT2-large was trained with:

  • a batch size of 256
  • using OneCycle learning rate with a maximum of 5e-3
  • with AdamW with a weight decay of 0.01
  • for 2 epochs

Training took roughly 12 days on 8 TPUv3 cores.

To train GerPT2-large, follow these steps. Scripts are located in the Github repository:

  1. Download and unzip training data from http://data.statmt.org/cc-100/.
  2. Train a tokenizer using prepare/train_tokenizer.py. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
  3. (optionally) generate a German input embedding matrix with prepare/generate_aligned_wte.py. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:

ĠMinde -> Ġleast Ġjed -> Ġwhatsoever flughafen -> Air vermittlung -> employment teilung -> ignment ĠInterpretation -> Ġinterpretation Ġimport -> Ġimported hansa -> irl genehmigungen -> exempt ĠAuflist -> Ġlists Ġverschwunden -> Ġdisappeared ĠFlyers -> ĠFlyers Kanal -> Channel Ġlehr -> Ġteachers Ġnahelie -> Ġconvenient gener -> Generally mitarbeiter -> staff

This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights.

  1. Tokenize the corpus using prepare/tokenize_text.py. This generates files for train and validation tokens in JSON Lines format.
  2. Run the training script train.py! run.sh shows how this was executed for the full run with config configs/tpu_large.json.

License

GerPT2 is licensed under the MIT License.

Citing

Please cite GerPT2 as follows:

@misc{Minixhofer_GerPT2_German_large_2020, author = {Minixhofer, Benjamin}, doi = {10.5281/zenodo.5509984}, month = {12}, title = {{GerPT2: German large and small versions of GPT2}}, url = {https://github.com/bminixhofer/gerpt2}, year = {2020} }

Acknowledgements

Thanks to Hugging Face for awesome tools and infrastructure. Huge thanks to Artus Krohn-Grimberghe at LYTiQ for making this possible by sponsoring the resources used for training.

Owner

  • Name: Benjamin Minixhofer
  • Login: bminixhofer
  • Kind: user
  • Location: Linz, Austria

PhD Student @cambridgeltl

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this model, please cite it as below."
authors:
- family-names: "Minixhofer"
  given-names: "Benjamin"
  orcid: "https://orcid.org/0000-0002-6520-4245"
title: "GerPT2: German large and small versions of GPT2"
version: 1.0.0
doi: 10.5281/zenodo.5509984
date-released: 2020-12-27
url: "https://github.com/bminixhofer/gerpt2"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels