zeldarose

Train transformer-based models.

https://github.com/loicgrobol/zeldarose

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary

Keywords

bert fine-tuning machine-learning neural-networks nlp pretraining transformers

Last synced: 6 months ago · JSON representation ·

Repository

Train transformer-based models.

Basic Info

Host: GitHub
Owner: LoicGrobol
License: other
Language: Python
Default Branch: main
Homepage: https://zeldarose.readthedocs.io
Size: 1.05 MB

Statistics

Stars: 28
Watchers: 3
Forks: 3
Open Issues: 13
Releases: 18

Topics

bert fine-tuning machine-learning neural-networks nlp pretraining transformers

Created almost 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Citation

Zelda Rose

A straightforward trainer for transformer-based models.

Installation

Simply install with pipx

bash pipx install zeldarose

Train MLM models

Here is a short example of training first a tokenizer, then a transformer MLM model:

bash TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name "my-muppet" tests/fixtures/raw.txt zeldarose transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt

The .txt files are meant to be raw text files, with one sample (e.g. sentence) per line.

There are other parameters (see zeldarose transformer --help for a comprehensive list), the one you are probably mostly interested in is --config, giving the path to a training config (for which we have examples/).

The parameters --pretrained-models, --tokenizer and --model-config are all fed directly to Huggingface's transformers and can be pretrained models names or local path.

Distributed training

This is somewhat tricky, you have several options

If you are running in a SLURM cluster use --strategy ddp and invoke via srun
- You might want to preprocess your data first outside of the main compute allocation. The --profile option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
Otherwise you have two options
- Run with --strategy ddp_spawn, which uses multiprocessing.spawn to start the process swarm (tested, but possibly slower and more limited, see pytorch-lightning doc)
- Run with --strategy ddp and start with torch.distributed.launch with --use_env and --no_python (untested)

Other hints

Data management relies on 🤗 datasets and use their cache management system. To run in a clear environment, you might have to check the cache directory pointed to by theHF_DATASETS_CACHE environment variable.

Inspirations

Citation

bibtex @inproceedings{grobol:hal-04262806, TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}}, AUTHOR = {Grobol, Lo{\"i}c}, URL = {https://hal.science/hal-04262806}, BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}}, ADDRESS = {Singapore, Indonesia}, YEAR = {2023}, MONTH = Dec, PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf}, HAL_ID = {hal-04262806}, HAL_VERSION = {v1}, }

Owner

Name: Loïc Grobol
Login: LoicGrobol
Kind: user
Location: Paris, France
Company: Université Paris Nanterre

Website: https://lgrobol.eu
Repositories: 75
Profile: https://github.com/LoicGrobol

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Zelda Rose
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Loïc
    family-names: Grobol
    email: loic.grobol@gmail.com
    affiliation: Université Paris Nanterre
    orcid: 'https://orcid.org/0000-0002-4619-7836'
repository-code: 'https://github.com/LoicGrobol/zeldarose'
url: 'https://zeldarose.readthedocs.io'
abstract: >-
  Zelda Rose is a command line interface for pretraining
  transformer-based models. Its purpose is to enable an easy
  start for users interested in training these ubiquitous
  models, but unable or unwilling to engage with more
  comprehensive --- but more complex --- frameworks and the
  complex interactions between libraries for managing
  models, datasets and computations. Training a model
  requires no code on the user's part and produce models
  directly compatible with the HuggingFace ecosystem,
  allowing quick and easy distribution and reuse. A
  particular care is given to lowering the cost of
  maintainability and future-proofing, by making the code as
  modular as possible and taking advantage of third-party
  libraries to limit ad-hoc code to the strict minimum.
keywords:
  - transformers
  - nlp
  - neural networks
license: MIT

GitHub Events

Total

Create event: 20
Release event: 1
Issues event: 2
Delete event: 19
Member event: 1
Issue comment event: 24
Push event: 41
Pull request review event: 4
Pull request event: 34

Last Year

Create event: 20
Release event: 1
Issues event: 2
Delete event: 19
Member event: 1
Issue comment event: 24
Push event: 41
Pull request review event: 4
Pull request event: 34

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 19
Total pull requests: 121
Average time to close issues: 7 months
Average time to close pull requests: 24 days
Total issue authors: 2
Total pull request authors: 4
Average comments per issue: 0.53
Average comments per pull request: 1.0
Merged pull requests: 53
Bot issues: 0
Bot pull requests: 97

Past Year

Issues: 3
Pull requests: 37
Average time to close issues: 3 days
Average time to close pull requests: about 2 months
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.7
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 33

View more stats

Top Authors

Issue Authors

LoicGrobol (16)
pjox (3)

Pull Request Authors

dependabot[bot] (130)
LoicGrobol (22)
sobamchan (2)
renovate[bot] (1)

Top Labels

Issue Labels

enhancement (3) bug (2) needs info (2) documentation (2) test (1)

Pull Request Labels

dependencies (131) python (101) github_actions (19) enhancement (3) bug (1)

Packages

Total packages: 1
Total downloads:
- pypi 106 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 19
Total maintainers: 1

pypi.org: zeldarose

Train transformer-based models

Documentation: https://zeldarose.readthedocs.io
License: MIT
Latest release: 0.12.0
published about 1 year ago

Versions: 19
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 106 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 12.0%

Average: 16.4%

Forks count: 19.1%

Downloads: 19.3%

Dependent repos count: 21.8%

Maintainers (1)

loicgrobol