uniformers

Token-free Language Modeling with ByGPT5 & Friends!

https://github.com/potamides/uniformers

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.6%) to scientific vocabulary

Keywords

generation language modeling poetry token-free transformers
Last synced: 6 months ago · JSON representation

Repository

Token-free Language Modeling with ByGPT5 & Friends!

Basic Info
  • Host: GitHub
  • Owner: potamides
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 155 KB
Statistics
  • Stars: 11
  • Watchers: 3
  • Forks: 3
  • Open Issues: 0
  • Releases: 1
Topics
generation language modeling poetry token-free transformers
Created over 3 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Uniformers
Token-free Language Modeling with ByGPT5 & Friends

ACL Anthology arXiv Semantic Scholar Colab

Uniformers (short for Universal Coded Character Set Transformers) is a library for token-free language modeling. In particular, it contains the reference implementation of ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. ByGPT5 is a token-free decoder-only transformer that excels at character-level tasks such as style-conditioned poetry generation.

  • :scroll: Read our paper on ByGPT5 for details.
  • :feather: An interactive demo for poetry generation is available.
  • :bulb: If you make use of this library in your work please cite it.

Installation

If you want to use this project as a library you can install it as a regular package using pip: sh pip install 'git+https://github.com/potamides/uniformers.git#egg=uniformers' If your goal is to run the included examples (e.g., to reproduce results) clone the repository and install it in editable mode: sh git clone https://github.com/potamides/uniformers pip install -e uniformers[examples]

Usage

Uniformers builds upon the transformers library and can be used very similarly. ```python from torch import device from transformers.pipelines.text_generation import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer

prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

pipeline = TextGenerationPipeline( model=ByGPT5LMHeadModel.frompretrained("nllg/bygpt5-medium-en"), tokenizer=ByGPT5Tokenizer.frompretrained("nllg/bygpt5-medium-en"), device=device("cuda:0"), )

completion = pipeline( prompt, maxlength=1024, dosample=True, topk=40, temperature=1.0, topp=0.9, )

print(completion[0]["generatedtext"]) Poetry can also be generated easily. For more involved usage examples take a look at the provided [examples](examples). python from torch import device from transformers.pipelines.textgeneration import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer from uniformers.utils import Poetry2Tokens

model = ByGPT5LMHeadModel.frompretrained("nllg/poetry-bygpt5-base-en") tokenizer = ByGPT5Tokenizer.frompretrained("nllg/poetry-bygpt5-base-en") p2t = Poetry2Tokens(tokenizer)

pipeline = TextGenerationPipeline( model=model, tokenizer=tokenizer, device=device("cuda:0"), )

styles = ( tokenizer.bos_token + p2t.rhymes2tokens["ABAB"] + p2t.meters2tokens["iambus"] + p2t.alliterations2tokens["medium"] )

quatrain = pipeline( styles, returnfulltext=False, badwordsids=[[id] for id in tokenizer.additionalspecialtokensids], dosample=True, maxlength=384, topk=0, temperature=0.7, top_p=0.9, )

print(quatrain[0]["generated_text"]) ```

Released Model Checkpoints

We have released the following checkpoints for pre-trained ByGPT5 language models on the Hugging Face Model Hub:

| ByGPT5 | Parameters | Language Modeling | Poetry Generation | |:-------|:-----------|:------------------|:------------------| | Small | 73.5M | English, German | English, German | | Base | 139.2M | English, German | English, German | | Medium | 289.1M | English, German | English, German |

Released Datasets

By default, this library creates QuaTrain on-the-fly when needed (which can take some time). A preprocessed version (both in English and German) can be found under releases.

| Dataset | Language | #Quatrains | |:---------|:---------|:-----------| | QuaTrain | English | 2.7M | | QuaTrain | German | 5.9M |

Owner

  • Name: Jonas Belouadi
  • Login: potamides
  • Kind: user

GitHub Events

Total
  • Watch event: 2
  • Issue comment event: 2
  • Push event: 2
  • Pull request event: 2
  • Pull request review comment event: 1
  • Pull request review event: 1
Last Year
  • Watch event: 2
  • Issue comment event: 2
  • Push event: 2
  • Pull request event: 2
  • Pull request review comment event: 1
  • Pull request review event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 2.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 2.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • nickhnsn (2)
  • dcgaines (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pyproject.toml pypi
  • datasets ~=2.3.2
  • ipapy @ https://github.com/ionite34/ipapy/archive/4fedf540a68b998ddd982c05f113d40aa4f3f97f.zip
  • libarchive-c ~=4.0
  • lxml ~=4.9.0
  • numpy ~=1.22.4
  • optuna ~=2.10.1
  • sacremoses ~=0.0.53
  • scikit-learn ~=1.1.1
  • tokenizers ~=0.12.1
  • torch ~=1.11.0
  • transformers ==4.20.0
  • zstandard ~=0.17.0
setup.py pypi