zeldarose

Train transformer-based models.

https://github.com/loicgrobol/zeldarose

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.7%) to scientific vocabulary

Keywords

bert fine-tuning machine-learning neural-networks nlp pretraining transformers
Last synced: 6 months ago · JSON representation ·

Repository

Train transformer-based models.

Basic Info
Statistics
  • Stars: 28
  • Watchers: 3
  • Forks: 3
  • Open Issues: 13
  • Releases: 18
Topics
bert fine-tuning machine-learning neural-networks nlp pretraining transformers
Created almost 6 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Citation

README.md

Zelda Rose

Latest PyPI version Build Status Code style: black Documentation Status

A straightforward trainer for transformer-based models.

Installation

Simply install with pipx

bash pipx install zeldarose

Train MLM models

Here is a short example of training first a tokenizer, then a transformer MLM model:

bash TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer --model-name "my-muppet" tests/fixtures/raw.txt zeldarose transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt

The .txt files are meant to be raw text files, with one sample (e.g. sentence) per line.

There are other parameters (see zeldarose transformer --help for a comprehensive list), the one you are probably mostly interested in is --config, giving the path to a training config (for which we have examples/).

The parameters --pretrained-models, --tokenizer and --model-config are all fed directly to Huggingface's transformers and can be pretrained models names or local path.

Distributed training

This is somewhat tricky, you have several options

  • If you are running in a SLURM cluster use --strategy ddp and invoke via srun
    • You might want to preprocess your data first outside of the main compute allocation. The --profile option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
  • Otherwise you have two options

    • Run with --strategy ddp_spawn, which uses multiprocessing.spawn to start the process swarm (tested, but possibly slower and more limited, see pytorch-lightning doc)
    • Run with --strategy ddp and start with torch.distributed.launch with --use_env and --no_python (untested)

Other hints

  • Data management relies on 🤗 datasets and use their cache management system. To run in a clear environment, you might have to check the cache directory pointed to by theHF_DATASETS_CACHE environment variable.

Inspirations

Citation

bibtex @inproceedings{grobol:hal-04262806, TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}}, AUTHOR = {Grobol, Lo{\"i}c}, URL = {https://hal.science/hal-04262806}, BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}}, ADDRESS = {Singapore, Indonesia}, YEAR = {2023}, MONTH = Dec, PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf}, HAL_ID = {hal-04262806}, HAL_VERSION = {v1}, }

Owner

  • Name: Loïc Grobol
  • Login: LoicGrobol
  • Kind: user
  • Location: Paris, France
  • Company: Université Paris Nanterre

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Zelda Rose
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Loïc
    family-names: Grobol
    email: loic.grobol@gmail.com
    affiliation: Université Paris Nanterre
    orcid: 'https://orcid.org/0000-0002-4619-7836'
repository-code: 'https://github.com/LoicGrobol/zeldarose'
url: 'https://zeldarose.readthedocs.io'
abstract: >-
  Zelda Rose is a command line interface for pretraining
  transformer-based models. Its purpose is to enable an easy
  start for users interested in training these ubiquitous
  models, but unable or unwilling to engage with more
  comprehensive --- but more complex --- frameworks and the
  complex interactions between libraries for managing
  models, datasets and computations. Training a model
  requires no code on the user's part and produce models
  directly compatible with the HuggingFace ecosystem,
  allowing quick and easy distribution and reuse. A
  particular care is given to lowering the cost of
  maintainability and future-proofing, by making the code as
  modular as possible and taking advantage of third-party
  libraries to limit ad-hoc code to the strict minimum.
keywords:
  - transformers
  - nlp
  - neural networks
license: MIT

GitHub Events

Total
  • Create event: 20
  • Release event: 1
  • Issues event: 2
  • Delete event: 19
  • Member event: 1
  • Issue comment event: 24
  • Push event: 41
  • Pull request review event: 4
  • Pull request event: 34
Last Year
  • Create event: 20
  • Release event: 1
  • Issues event: 2
  • Delete event: 19
  • Member event: 1
  • Issue comment event: 24
  • Push event: 41
  • Pull request review event: 4
  • Pull request event: 34

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 19
  • Total pull requests: 121
  • Average time to close issues: 7 months
  • Average time to close pull requests: 24 days
  • Total issue authors: 2
  • Total pull request authors: 4
  • Average comments per issue: 0.53
  • Average comments per pull request: 1.0
  • Merged pull requests: 53
  • Bot issues: 0
  • Bot pull requests: 97
Past Year
  • Issues: 3
  • Pull requests: 37
  • Average time to close issues: 3 days
  • Average time to close pull requests: about 2 months
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.7
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 33
Top Authors
Issue Authors
  • LoicGrobol (16)
  • pjox (3)
Pull Request Authors
  • dependabot[bot] (130)
  • LoicGrobol (22)
  • sobamchan (2)
  • renovate[bot] (1)
Top Labels
Issue Labels
enhancement (3) bug (2) needs info (2) documentation (2) test (1)
Pull Request Labels
dependencies (131) python (101) github_actions (19) enhancement (3) bug (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 106 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 19
  • Total maintainers: 1
pypi.org: zeldarose

Train transformer-based models

  • Versions: 19
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 106 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 12.0%
Average: 16.4%
Forks count: 19.1%
Downloads: 19.3%
Dependent repos count: 21.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • pypa/gh-action-pypi-publish v1 composite
  • ymyzk/run-tox-gh-actions main composite