https://github.com/bramvanroy/lt3-2019-transformer-trainer
Transformer trainer for variety of classification problems that has been used in-house at LT3 for different research topics.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary
Keywords
Repository
Transformer trainer for variety of classification problems that has been used in-house at LT3 for different research topics.
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Base LT3 Transformer classifier 2019-2020
!!Code will not be updated!!
This is legacy code that works perfectly fine, but that lacks good documentation and that lacks a consistent approach to inference and testing as well as model save/loading. I recommend using the easily adapatable Trainers over at HuggingFace. This repository was made public to showcase how to allow for a configurable user experience, using automatic mixed precision, using distributed training, using early stopping, and so on.
This code has been used for many different research topics within LT3.
Using it in your own work?
Great! Glad to hear that this repository has been of use to you. If you publish code based on this repository, always include the provided LICENSE file in your copy/modification of the code (and optionally link to this repository from your own). Also, if you produce and publish research results by use of (a modification of) this repository, please place a reference (link) to this repository in your publication.
Requirements
- At least Python 3.6
- PyTorch, preferably with CUDA support. Have a look at th PyTorch Getting Started page on how to install torch for our environment. (You don't need to install torchvision or torchaudio.)
- Apart from PyTorch you also need:
- pandas
- sklearn
- scipy
- matplotlib
- transformers
How to use
- The only thing you should ever change, is the configuration file
config.json. Particularly thetask,num_labels,multi_label,filesandoutput_dir. Don't change any other files unless you know what you are doing. Do not change defaults.json. It contains useful default values that are merged with your custom configuration file. If you need a base config file to start from, ask me. - Your input files should contain three named columns:
id(a given sentence ID),text(the actual text),label(the correct label index for the given sentence). By default, the script expects a tab-separated file, but you can change the separator in the config file by adding"sep": "yourseperator"to thetrainingsection. - Do not train on your laptop. This will take billions times longer than running it on the server.
- Some parameters accept a list of arguments. This will do a naive hyperparameter search, i.e. all possible combinations
will be tried out. For example, you can put
"dropout": [0, 0.2], which implies that the script will be run two times with different dropout values. You can mix-and-match, too! E.g. in addition to the dropout list, you can set"pre_classifier": [null, 2048, 1536]. All possible combinations will be tried out, so with this combination of dropout and preclassifier, that will lead to six different runs. All results will be saved the the given `outputdir`. At the end of all runs, the script will tell you which combination was best, both in terms of minimal loss or maximal secondary score (F1 or Pearson).
Multi-label classification
Multi-label classification (and regression) was first added by Luna De Bruyne.
You can use it by setting the option multi_label to true in your config file
under the training options. In such a scenario and when using categorical classification, it is expected that you
have encoded your categories 0 and 1. Both for regression and classification, you must set num_labels to the number
of labels per item that you are predicting, so if you have three categories that are encoded as 0/1, e.g. 1,0,1
in your data file, then num_labels should be 3 (not 2).
In terms of data, the columns should still be id, text, and label as above but for multi-label classification,
the labels should be separated by commas in the label column.
Running the script
Again: do not run this on your laptop.
Because fine-tuning a language model takes a while, you may want to run this in screen mode, meaning that you can
leave the process running on the server without your laptop's terminal being on as well. To start a new, named screen:
bash
screen -S myscreenname
You can leave this screen ('detach') at any time by pressing CTRL+a+d. You can resume ('attach') the screen's session
by typing
bash
screen -rad myscreenname
You can completely terminate a screen by typing exit. (But remember that this is the same command as for exiting a
virtual environment. So if you are inside a screen and inside a virtual environment in that screen, you have to exit
twice.)
After installation of the virtual environment, you can run the script on multiple GPUs on Weoh. This is great, because PyTorch scales really well - meaning that training will be a lot faster.
Be aware of other users on the server! You don't want to hog the server while others are working on it, too. You can
check if a GPU is used by typing nvidia-smi. The bottom table shows active processes.
If all GPUs are free, you can run the following code (preferably in your newly created screen session):
```bash
activate environment in root dir
pipenv shell
4 because we have 4 GPUs
python -m torch.distributed.launch --nprocpernode 4 newsdna_classifier/predict.py
However, if one or more GPUs are already being used, you need to tell your terminal that you only wish to "see" a
subset of available devices. This can be done by the CUDA_VISIBLE_DEVICES environment variable. As an example, if
you see that the first GPU is taken (GPU ID=0), then we can only use 1, 2 and 3.
```bash
1,2,3 because we don't want to use 0
3 because we can only use 3 GPUs
CUDAVISIBLEDEVICES=1,2,3 python -m torch.distributed.launch --nprocpernode 3 newsdna_classifier/predict.py
By default, the script will use the CPU. If you want to use the GPU, you explicitly have to pass --local_rank 0, or
similar, e.g.
```bash
Only make GPU ID #2 visible to the script, which internally then will be rank 0
CUDAVISIBLEDEVICES=2 python newsdnaclassifier/predict.py
Inference is possible by passing the path to the saved model. In such a case, predictions will be created for the
test file in your config file. These predictions will be saved to predictions.csv by default, but the path can be
changed by setting pred_output_path in the config file under training options.
bash
python newsdna_classifier/predict.py <your-config-file> --infer data/saved/best-model-chkpnt.pth --local_rank 0
Simply testing the performance of a model according to the test file in the config, is also easy. Running a
commando similar to the one below will test the outputs of a given models compared with the correct labels. The
command line will print the results to the screen but not save them.
bash
python newsdna_classifier/predict.py <your-config-file> --test data/saved/best-model-chkpnt.pth --local_rank 0
Owner
- Name: Bram Vanroy
- Login: BramVanroy
- Kind: user
- Location: Belgium
- Company: @CCL-KULeuven @instituutnederlandsetaal
- Website: https://bramvanroy.github.io/
- Repositories: 29
- Profile: https://github.com/BramVanroy
👋 My name is Bram and I work on natural language processing and machine translation (evaluation) but I also spend a lot of time in this open-source world 🌍
GitHub Events
Total
Last Year
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Bram Vanroy | B****y@U****e | 16 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0