uniformers
Token-free Language Modeling with ByGPT5 & Friends!
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.6%) to scientific vocabulary
Keywords
Repository
Token-free Language Modeling with ByGPT5 & Friends!
Basic Info
Statistics
- Stars: 11
- Watchers: 3
- Forks: 3
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Uniformers
Token-free Language Modeling with ByGPT5 & Friends
Uniformers (short for Universal Coded Character Set Transformers) is a library for token-free language modeling. In particular, it contains the reference implementation of ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. ByGPT5 is a token-free decoder-only transformer that excels at character-level tasks such as style-conditioned poetry generation.
- :scroll: Read our paper on ByGPT5 for details.
- :feather: An interactive demo for poetry generation is available.
- :bulb: If you make use of this library in your work please cite it.
Installation
If you want to use this project as a library you can install it as a regular
package using pip:
sh
pip install 'git+https://github.com/potamides/uniformers.git#egg=uniformers'
If your goal is to run the included examples (e.g., to reproduce
results) clone the repository and install it in editable mode:
sh
git clone https://github.com/potamides/uniformers
pip install -e uniformers[examples]
Usage
Uniformers builds upon the transformers library and can be used very similarly. ```python from torch import device from transformers.pipelines.text_generation import TextGenerationPipeline
from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer
prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
pipeline = TextGenerationPipeline( model=ByGPT5LMHeadModel.frompretrained("nllg/bygpt5-medium-en"), tokenizer=ByGPT5Tokenizer.frompretrained("nllg/bygpt5-medium-en"), device=device("cuda:0"), )
completion = pipeline( prompt, maxlength=1024, dosample=True, topk=40, temperature=1.0, topp=0.9, )
print(completion[0]["generatedtext"])
Poetry can also be generated easily. For more involved usage examples
take a look at the provided [examples](examples).
python
from torch import device
from transformers.pipelines.textgeneration import TextGenerationPipeline
from uniformers.models.bygpt5 import ByGPT5LMHeadModel, ByGPT5Tokenizer from uniformers.utils import Poetry2Tokens
model = ByGPT5LMHeadModel.frompretrained("nllg/poetry-bygpt5-base-en") tokenizer = ByGPT5Tokenizer.frompretrained("nllg/poetry-bygpt5-base-en") p2t = Poetry2Tokens(tokenizer)
pipeline = TextGenerationPipeline( model=model, tokenizer=tokenizer, device=device("cuda:0"), )
styles = ( tokenizer.bos_token + p2t.rhymes2tokens["ABAB"] + p2t.meters2tokens["iambus"] + p2t.alliterations2tokens["medium"] )
quatrain = pipeline( styles, returnfulltext=False, badwordsids=[[id] for id in tokenizer.additionalspecialtokensids], dosample=True, maxlength=384, topk=0, temperature=0.7, top_p=0.9, )
print(quatrain[0]["generated_text"]) ```
Released Model Checkpoints
We have released the following checkpoints for pre-trained ByGPT5 language models on the Hugging Face Model Hub:
| ByGPT5 | Parameters | Language Modeling | Poetry Generation | |:-------|:-----------|:------------------|:------------------| | Small | 73.5M | English, German | English, German | | Base | 139.2M | English, German | English, German | | Medium | 289.1M | English, German | English, German |
Released Datasets
By default, this library creates QuaTrain on-the-fly when needed (which can take some time). A preprocessed version (both in English and German) can be found under releases.
| Dataset | Language | #Quatrains | |:---------|:---------|:-----------| | QuaTrain | English | 2.7M | | QuaTrain | German | 5.9M |
Owner
- Name: Jonas Belouadi
- Login: potamides
- Kind: user
- Repositories: 20
- Profile: https://github.com/potamides
GitHub Events
Total
- Watch event: 2
- Issue comment event: 2
- Push event: 2
- Pull request event: 2
- Pull request review comment event: 1
- Pull request review event: 1
Last Year
- Watch event: 2
- Issue comment event: 2
- Push event: 2
- Pull request event: 2
- Pull request review comment event: 1
- Pull request review event: 1
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 2.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 2.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- nickhnsn (2)
- dcgaines (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- datasets ~=2.3.2
- ipapy @ https://github.com/ionite34/ipapy/archive/4fedf540a68b998ddd982c05f113d40aa4f3f97f.zip
- libarchive-c ~=4.0
- lxml ~=4.9.0
- numpy ~=1.22.4
- optuna ~=2.10.1
- sacremoses ~=0.0.53
- scikit-learn ~=1.1.1
- tokenizers ~=0.12.1
- torch ~=1.11.0
- transformers ==4.20.0
- zstandard ~=0.17.0