Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary
Keywords
Repository
Train transformer-based models.
Basic Info
- Host: GitHub
- Owner: LoicGrobol
- License: other
- Language: Python
- Default Branch: main
- Homepage: https://zeldarose.readthedocs.io
- Size: 1.05 MB
Statistics
- Stars: 28
- Watchers: 3
- Forks: 3
- Open Issues: 13
- Releases: 18
Topics
Metadata Files
README.md
Zelda Rose
A straightforward trainer for transformer-based models.
Installation
Simply install with pipx
bash
pipx install zeldarose
Train MLM models
Here is a short example of training first a tokenizer, then a transformer MLM model:
bash
TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt
The .txt files are meant to be raw text files, with one sample (e.g. sentence) per line.
There are other parameters (see zeldarose transformer --help for a comprehensive list), the one
you are probably mostly interested in is --config, giving the path to a training config (for which
we have examples/).
The parameters --pretrained-models, --tokenizer and --model-config are all fed directly to
Huggingface's transformers and can be pretrained
models names or local path.
Distributed training
This is somewhat tricky, you have several options
- If you are running in a SLURM cluster use
--strategy ddpand invoke viasrun- You might want to preprocess your data first outside of the main compute allocation. The
--profileoption might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
- You might want to preprocess your data first outside of the main compute allocation. The
Otherwise you have two options
- Run with
--strategy ddp_spawn, which usesmultiprocessing.spawnto start the process swarm (tested, but possibly slower and more limited, seepytorch-lightningdoc) - Run with
--strategy ddpand start withtorch.distributed.launchwith--use_envand--no_python(untested)
- Run with
Other hints
- Data management relies on 🤗 datasets and use their cache management system. To run in a clear
environment, you might have to check the cache directory pointed to by the
HF_DATASETS_CACHEenvironment variable.
Inspirations
- https://github.com/shoarora/lmtuners
- https://github.com/huggingface/transformers/blob/243e687be6cd701722cce050005a2181e78a08a8/examples/run_language_modeling.py
Citation
bibtex
@inproceedings{grobol:hal-04262806,
TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
AUTHOR = {Grobol, Lo{\"i}c},
URL = {https://hal.science/hal-04262806},
BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
ADDRESS = {Singapore, Indonesia},
YEAR = {2023},
MONTH = Dec,
PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
HAL_ID = {hal-04262806},
HAL_VERSION = {v1},
}
Owner
- Name: Loïc Grobol
- Login: LoicGrobol
- Kind: user
- Location: Paris, France
- Company: Université Paris Nanterre
- Website: https://lgrobol.eu
- Repositories: 75
- Profile: https://github.com/LoicGrobol
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Zelda Rose
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Loïc
family-names: Grobol
email: loic.grobol@gmail.com
affiliation: Université Paris Nanterre
orcid: 'https://orcid.org/0000-0002-4619-7836'
repository-code: 'https://github.com/LoicGrobol/zeldarose'
url: 'https://zeldarose.readthedocs.io'
abstract: >-
Zelda Rose is a command line interface for pretraining
transformer-based models. Its purpose is to enable an easy
start for users interested in training these ubiquitous
models, but unable or unwilling to engage with more
comprehensive --- but more complex --- frameworks and the
complex interactions between libraries for managing
models, datasets and computations. Training a model
requires no code on the user's part and produce models
directly compatible with the HuggingFace ecosystem,
allowing quick and easy distribution and reuse. A
particular care is given to lowering the cost of
maintainability and future-proofing, by making the code as
modular as possible and taking advantage of third-party
libraries to limit ad-hoc code to the strict minimum.
keywords:
- transformers
- nlp
- neural networks
license: MIT
GitHub Events
Total
- Create event: 20
- Release event: 1
- Issues event: 2
- Delete event: 19
- Member event: 1
- Issue comment event: 24
- Push event: 41
- Pull request review event: 4
- Pull request event: 34
Last Year
- Create event: 20
- Release event: 1
- Issues event: 2
- Delete event: 19
- Member event: 1
- Issue comment event: 24
- Push event: 41
- Pull request review event: 4
- Pull request event: 34
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 19
- Total pull requests: 121
- Average time to close issues: 7 months
- Average time to close pull requests: 24 days
- Total issue authors: 2
- Total pull request authors: 4
- Average comments per issue: 0.53
- Average comments per pull request: 1.0
- Merged pull requests: 53
- Bot issues: 0
- Bot pull requests: 97
Past Year
- Issues: 3
- Pull requests: 37
- Average time to close issues: 3 days
- Average time to close pull requests: about 2 months
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.7
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 33
Top Authors
Issue Authors
- LoicGrobol (16)
- pjox (3)
Pull Request Authors
- dependabot[bot] (130)
- LoicGrobol (22)
- sobamchan (2)
- renovate[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 106 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 19
- Total maintainers: 1
pypi.org: zeldarose
Train transformer-based models
- Documentation: https://zeldarose.readthedocs.io
- License: MIT
-
Latest release: 0.12.0
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- pypa/gh-action-pypi-publish v1 composite
- ymyzk/run-tox-gh-actions main composite