uniformers

Token-free Language Modeling with ByGPT5 & Friends!

https://github.com/potamides/uniformers

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.6%) to scientific vocabulary

Keywords

generation language modeling poetry token-free transformers

Last synced: 6 months ago · JSON representation

Repository

Token-free Language Modeling with ByGPT5 & Friends!

Basic Info

Host: GitHub
Owner: potamides
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 155 KB

Statistics

Stars: 11
Watchers: 3
Forks: 3
Open Issues: 0
Releases: 1

Topics

generation language modeling poetry token-free transformers

Created over 3 years ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

Uniformers
_{^{Token-free Language Modeling with ByGPT5 & Friends}}

Uniformers (short for Universal Coded Character Set Transformers) is a library for token-free language modeling. In particular, it contains the reference implementation of ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. ByGPT5 is a token-free decoder-only transformer that excels at character-level tasks such as style-conditioned poetry generation.

:scroll: Read our paper on ByGPT5 for details.
:feather: An interactive demo for poetry generation is available.
:bulb: If you make use of this library in your work please cite it.

Installation

If you want to use this project as a library you can install it as a regular package using pip: sh pip install 'git+https://github.com/potamides/uniformers.git#egg=uniformers' If your goal is to run the included examples (e.g., to reproduce results) clone the repository and install it in editable mode: sh git clone https://github.com/potamides/uniformers pip install -e uniformers[examples]

Usage

Uniformers builds upon the transformers library and can be used very similarly. ```python from torch import device from transformers.pipelines.text_generation import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer

prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

pipeline = TextGenerationPipeline( model=ByGPT5LMHeadModel.frompretrained("nllg/bygpt5-medium-en"), tokenizer=ByGPT5Tokenizer.frompretrained("nllg/bygpt5-medium-en"), device=device("cuda:0"), )

completion = pipeline( prompt, maxlength=1024, dosample=True, topk=40, temperature=1.0, topp=0.9, )

print(completion[0]["generatedtext"]) Poetry can also be generated easily. For more involved usage examples take a look at the provided [examples](examples).python from torch import device from transformers.pipelines.textgeneration import TextGenerationPipeline

from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer from uniformers.utils import Poetry2Tokens

model = ByGPT5LMHeadModel.frompretrained("nllg/poetry-bygpt5-base-en") tokenizer = ByGPT5Tokenizer.frompretrained("nllg/poetry-bygpt5-base-en") p2t = Poetry2Tokens(tokenizer)

pipeline = TextGenerationPipeline( model=model, tokenizer=tokenizer, device=device("cuda:0"), )

styles = ( tokenizer.bos_token + p2t.rhymes2tokens["ABAB"] + p2t.meters2tokens["iambus"] + p2t.alliterations2tokens["medium"] )

quatrain = pipeline( styles, returnfulltext=False, badwordsids=[[id] for id in tokenizer.additionalspecialtokensids], dosample=True, maxlength=384, topk=0, temperature=0.7, top_p=0.9, )

print(quatrain[0]["generated_text"]) ```

Released Model Checkpoints

We have released the following checkpoints for pre-trained ByGPT5 language models on the Hugging Face Model Hub:

| ByGPT5 | Parameters | Language Modeling | Poetry Generation | |:-------|:-----------|:------------------|:------------------| | Small | 73.5M | English, German | English, German | | Base | 139.2M | English, German | English, German | | Medium | 289.1M | English, German | English, German |

Released Datasets

By default, this library creates QuaTrain on-the-fly when needed (which can take some time). A preprocessed version (both in English and German) can be found under releases.

| Dataset | Language | #Quatrains | |:---------|:---------|:-----------| | QuaTrain | English | 2.7M | | QuaTrain | German | 5.9M |

Owner

Name: Jonas Belouadi
Login: potamides
Kind: user

Repositories: 20
Profile: https://github.com/potamides

GitHub Events

Total

Watch event: 2
Issue comment event: 2
Push event: 2
Pull request event: 2
Pull request review comment event: 1
Pull request review event: 1

Last Year

Watch event: 2
Issue comment event: 2
Push event: 2
Pull request event: 2
Pull request review comment event: 1
Pull request review event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

nickhnsn (2)
dcgaines (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

datasets ~=2.3.2
ipapy @ https://github.com/ionite34/ipapy/archive/4fedf540a68b998ddd982c05f113d40aa4f3f97f.zip
libarchive-c ~=4.0
lxml ~=4.9.0
numpy ~=1.22.4
optuna ~=2.10.1
sacremoses ~=0.0.53
scikit-learn ~=1.1.1
tokenizers ~=0.12.1
torch ~=1.11.0
transformers ==4.20.0
zstandard ~=0.17.0

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

uniformers

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Uniformers
_{^{Token-free Language Modeling with ByGPT5 & Friends}}

Installation

Usage

Released Model Checkpoints

Released Datasets

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

uniformers

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

UniformersToken-free Language Modeling with ByGPT5 & Friends

Installation

Usage

Released Model Checkpoints

Released Datasets

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Uniformers
_{^{Token-free Language Modeling with ByGPT5 & Friends}}