zabantu-beta

ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu Languages

https://github.com/dsfsi/zabantu-beta

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

low-resource-languages nlp roberta sotho tshivenda tsonga xlm-roberta zulu

Last synced: 6 months ago · JSON representation ·

Repository

ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu Languages

Basic Info

Host: GitHub
Owner: dsfsi
License: other
Language: Python
Default Branch: main
Homepage: https://huggingface.co/dsfsi/zabantu-xlm-roberta
Size: 3.12 MB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

low-resource-languages nlp roberta sotho tshivenda tsonga xlm-roberta zulu

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Citation

ZaBantu ^Beta

Training Lite Cross-Lingual Language Models for South African Bantu Languages - Preview

This repository aims to provide a QuickStart template(s) for training a polyglot(i.e. multilingual) Large Language Models (LLMs) for low-resource settings with a specific focus on Bantu languages.
You can use this repo as a starting point for:
- Masked Language Modeling (MLM) - see train_masked folder
- Fine-tuning on semantic downstream tasks - see notebooks folder
- For example Named Entity Recognition (NER), Sentiment Analysis, Fake News/Misinformation Detection, Text Generatio etc.
Refer to the docs folder for more details or visit the project website
You can also try out some trained models on Huggingface

Pre-requisites

Ubuntu 20.04 - The project is guaranteed to work on Ubuntu 20.04, but should work on other Linux distributions as well.
NVIDIA GPU - for training the Large Language Model (LLM) on a GPU
CUDA Toolkit - for GPU acceleration. If you are training on a cloud Data Science VM, this should be pre-installed.

(Recommended) Cloud Data Science VMs

SKIP THIS STEP if you are using a VM provided by DSFSI or your own custom VM with atleast 1 NVIDIA CUDA-compatible GPU.
Refer to the Infrastructure Guide for more details on how to deploy a GPU powered VM on AWS, GCP or Azure.
Alternatively, you can check the notebooks folder to run the example code for free on Google Colab or AWS SageMaker Studio Labs
Other cheap GPU compute options include Paperspace, Run Pod, Jarvis Labs or Lambda GPU Cloud

QuickStart

1. Clone the repository

bash git clone https://github.com/ndamulelonemakh/zabantu-beta.git cd zabantu-beta

2. Install dependencies

2.1. NVIDIA Drivers and CUDA Toolkit
- If you have opted not to use a Managed Data Science VM, you will need to manually install NVIDIA drivers and CUDA Toolkit using our utility scripts as follows:
- SKIP THIS STEP if you are using a VM provided by DSFSI as the drivers are pre-installed.

```bash bash scripts/nvidia_setup.sh

On the first run, the script will reboot your machine to load the NVIDIA drivers

After rebooting, run the script again to install the CUDA Toolkit

bash scripts/nvidia_setup.sh

reload the .bashrc file to make sure the CUDA Toolkit is in your PATH

source ~/.bashrc ```

2.2. Python Dependencies
- Once your NVIDIA depenedencies are in order, you can proceed to install Python related dependencies using the following commands:

```bash bash scripts/server_setup.sh

reload the .bashrc file to make sure conda and poetry are in your PATH

source ~/.bashrc ```

Optional If you intend to use comet.ml and other optional tools, copy the env.template file to .env and fill in the required fields

3. Pre-Train a sample Large Language Model (LLM)

```bash make train_lite

This will run a sample training session on a sample dataset under `demos/data

Depending on how powerful your GPU is, training can take anywhere from a few minutes to a few hours

```

If you wish to reproduce all the experiments, you can use dvc repro as follows:
This will run all the experiments in the dvc.yaml file
- Start by downloading the full training set by following the instructions in the Get Data
- Then run the following command from the project root directory: bash dvc repro

4. Fine-tune the Pre-Trained LLM on a downstream task

bash make fake_news_detection # TODO:

5. Evaluate & Visualize the results

bash make evaluate_fake_news_detection #TODO:

Hint: Refer to the Makefile for details on the commands used in the QuickStart guide. You can easily modify the commands to suit your specific use case.

Project Structure

.
├── LICENSE              <- File that specifies the license under which the project is distributed.
├── README.md            <- This file
├── data/
│   ├── raw/             <- The original, immutable data dump.
│   ├── interim/         <- Intermediate data that has been transformed.
│   └── processed/       <- Clean datasets that can be used out of the box without further processing.

├── demos/               <- Contains demonstration scripts or examples related to the project.
├── notebooks/           <- Contains Jupyter notebooks for running interactive analysis or experiments.

├── docs/                <- Contains detailed documentation files for loading data, provisioning servers, installing dependencies, etc.
├── infra/               <- Quick start Infrastructure as Code (IaC) scripts for deploying a GPU powered VMs in the cloud.

├── configs/             <- YAML configuration files for training different model variants.
├── scripts/             <- Contains utility bash scripts used to consolidate multiple python commands into a single bash command.

├── train_masked/        <- Directory related to training with masked data.
├── tokenize/            <- Directory related to tokenization tasks or scripts.

├── .dvc/                <- Hidden directory used by DVC for storing metadata and cache.
├── .git/                <- Hidden directory used by Git for version control.

├── .dvcignore           <- Specifies files and directories that should be ignored by Data Version Control (DVC).
├── dvc.lock             <- Lock file generated by DVC to ensure reproducibility of the pipeline.
├── params.yaml          <- DVC YAML parameters and hyperparameters used for training the models. Defining parameters here makes it easy to track how they affect the model's performance.
├── dvc.yaml             <- DVC pipeline configuration file that defines the stages and dependencies of the training pipeline

├── .gitignore           <- Specifies files and directories that should be ignored by Git version control.
├── pyproject.toml       <- Configuration file for Python projects that specifies dependencies and build settings.
├── poetry.lock          <- Lock file generated by Poetry package manager to ensure deterministic dependencies.

├── environment.yml      <- Conda environment file that specifies the project's dependencies.
├── Makefile             <- File that contains automation rules and commands for building and managing the project.
├── requirements.txt     <- File that lists the project's Python dependencies synced from the `pyproject.toml` file.

├── env.template         <- A template file for creating a new `.env` file for storing environment secrets
└── .env                 <- User-specific secrets based on the `env.template` file. DO NOT COMMIT TO GIT!!!

Train on your own data

Once you have successfully trained the model on the sample dataset, you can proceed to train the model on your own dataset by following these steps:
Download your dataset, which is expected to be a list of text files in a folder. Each text file should contain a single sentence per line.
The accepted naming convention for the file is somefile.whatever.<language-code>.txt where <language-code> is the ISO 639-3 code for the language of the text file.
You can optionally include you own custom configs under configs folder or use the defaults provided.
Once you are ready, you can train the model on your own dataset by running a command similar to the one below:

```bash

first, train your sentencepiece tokenizer

remember to change any of the parameters to suit your specific use case

/bin/bash scripts/trainsptokenizer.sh --input-texts-path somefolder/mydocument.ven.txt \ --sampled-texts-path data/temp/stagingdir/0 \ --seed 47 \ --alpha 0.5 \ --tokenizer-model-type unigram \ --vocab-size 70000 \ --tokenizer-output-dir data/tokenizers/my-awesome-tokenizer-70k

then, train your model

/bin/bash scripts/trainmaskedxlm.sh --config configs/my-custom-or-existing.yml \ --trainingdata demos/data \ --experimentname myfirst-xlmr-experiment \ --tokenizer_path data/tokenizers/my-awesome-tokenizer-70k \ --epochs 5

```

Documentation

There are two ways to access the documentation in this repository:
- Visit the project website
- Run the following command in your terminal to build the documentation site locally:
bash make docs
If you encounter any issues or have any questions, please feel free to open an issue on the GitHub repository

Contributing

We welcome contributions to this project. Please refer to the Contributing Guide for more details on how to contribute.

Citation

tex @misc{nemakhavhani-2024-ZabantuBeta, title = {Training Lite Cross-Lingual Language Models for South African Bantu Languages - Preview}, author = {Ndamulelo Nemakhavhani, Vukosi Marivate, Jocelyn Mazarura}, year = {2024}, url= {https://github.com/ndamulelonemakh/zabantu-beta}, keywords = {NLP, BERT, Low-resource, XLM-R, Bantu} }

Troubleshooting

nvcc: command not found..
- This indicates that the CUDA Toolkit is not installed or not in your PATH. You can install the CUDA Toolkit manually by following the instructions provided in the scripts/nvidia_setup.sh script.
CondaError: Run 'conda init' before 'conda activate'
- This error occurs when you have not initialized conda in your shell. You can fix this by running the following command:

bash conda init bash

scripts/server_setup.sh: script did not complete successfully
Although this is rare, try running the script again to see if it completes successfully. If the error persists, please open an issue on the GitHub repository
Another possible solution, is to run the commands in the script manually in your terminal.
Unable to push/pull to DVC Google Drive remote - file not found
- This is usually just a permission error
- Ensure that the service account you are using has the necessary permissions to access the Google Drive folder, i.e. Share the folder with the service account email address as you would with any other Google Drive user.
- You can also run dvc pull or dvc push with the --verbose flag to get more details on the error.

Made somewhere in 🌍 by N Nemakhavhani❤️

Owner

Name: Data Science for Social Impact Research Group @ University of Pretoria
Login: dsfsi
Kind: organization
Email: vukosi.marivate@cs.up.ac.za
Location: University of Pretoria, South Africa

Website: https://dsfsi.github.io
Twitter: dsfsi_research
Repositories: 14
Profile: https://github.com/dsfsi

We are the Data Science for Social Impact research group at the Computer Science Department, University of Pretoria.

Citation (citation.tex)

@misc{nemakhavhani-2024-ZabantuBeta,
  title   = {Training Lite Cross-Lingual Language Models for South African Bantu Languages - Preview},
  author  = {Ndamulelo Nemakhavhani},
  year    = {2024},
  url= {https://github.com/ndamulelonemakh/zabantu-beta},
  keywords = {NLP, BERT, Low-resource, XLM-R, Bantu}
}

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 13
Total Committers: 1
Avg Commits per committer: 13.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 13
Committers: 1
Avg Commits per committer: 13.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
ndamulelo	n**e@g**m	13

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

poetry.lock pypi

233 dependencies

pyproject.toml pypi

coverage ^7.4.1 develop
flake8 ^7.0.0 develop
jupyterlab ^4.0.12 develop
pytest ^8.0.0 develop
sphinx ^7.2.6 develop
click ^8.1.7
comet-ml ^3.39.2
datasets ^2.18.0
dvc ^3.49.0
dvc-gdrive ^3.0.1
evaluate ^0.4.1
huggingface-hub ^0.22.2
nltk ^3.8.1
numpy ^1.26.3
pandas ^2.2.0
protobuf ^5.26.1
python ^3.11 < 4.0
python-dotenv ^1.0.1
scikit-learn ^1.4.0
sentencepiece ^0.2.0
seqeval ^1.2.2
tokenizers ^0.15.2
transformers ^4.39.2

requirements.txt pypi

aiohttp ==3.9.3
aiohttp-retry ==2.8.3
aiosignal ==1.3.1
amqp ==5.2.0
annotated-types ==0.6.0
antlr4-python3-runtime ==4.9.3
appdirs ==1.4.4
asyncssh ==2.14.2
atpublic ==4.0
attrs ==23.2.0
billiard ==4.2.0
cachetools ==5.3.3
celery ==5.3.6
certifi ==2024.2.2
cffi ==1.16.0
charset-normalizer ==3.3.2
click ==8.1.7
click-didyoumean ==0.3.1
click-plugins ==1.1.1
click-repl ==0.3.0
colorama ==0.4.6
comet-ml ==3.39.2
configobj ==5.0.8
cryptography ==42.0.5
datasets ==2.18.0
dictdiffer ==0.9.0
dill ==0.3.8
diskcache ==5.6.3
distro ==1.9.0
dpath ==2.1.6
dulwich ==0.21.7
dvc ==3.49.0
dvc-data ==3.15.1
dvc-gdrive ==3.0.1
dvc-http ==2.32.0
dvc-objects ==5.1.0
dvc-render ==1.0.1
dvc-studio-client ==0.20.0
dvc-task ==0.4.0
entrypoints ==0.4
evaluate ==0.4.1
everett ==3.1.0
filelock ==3.13.3
flatten-dict ==0.4.2
flufl-lock ==7.1.1
frozenlist ==1.4.1
fsspec ==2024.2.0
funcy ==2.0
gitdb ==4.0.11
gitpython ==3.1.42
google-api-core ==2.8.0
google-api-python-client ==2.125.0
google-auth ==2.29.0
google-auth-httplib2 ==0.2.0
googleapis-common-protos ==1.56.1
grandalf ==0.8
gto ==1.7.1
httplib2 ==0.22.0
huggingface-hub ==0.22.2
hydra-core ==1.3.2
idna ==3.6
iterative-telemetry ==0.0.8
joblib ==1.3.2
jsonschema ==4.21.1
jsonschema-specifications ==2023.12.1
kombu ==5.3.6
markdown-it-py ==3.0.0
mdurl ==0.1.2
multidict ==6.0.5
multiprocess ==0.70.16
networkx ==3.2.1
nltk ==3.8.1
numpy ==1.26.3
oauth2client ==4.1.3
omegaconf ==2.3.0
orjson ==3.10.0
packaging ==23.2
pandas ==2.2.0
pathspec ==0.12.1
platformdirs ==3.11.0
prompt-toolkit ==3.0.43
protobuf ==5.26.1
psutil ==5.9.8
pyarrow ==15.0.2
pyarrow-hotfix ==0.6
pyasn1 ==0.6.0
pyasn1-modules ==0.4.0
pycparser ==2.21
pydantic ==2.6.4
pydantic-core ==2.16.3
pydot ==2.0.0
pydrive2 ==1.19.0
pygit2 ==1.14.1
pygments ==2.17.2
pygtrie ==2.5.0
pyopenssl ==24.1.0
pyparsing ==3.1.2
python-box ==6.1.0
python-dateutil ==2.8.2
python-dotenv ==1.0.1
pytz ==2024.1
pywin32 ==306
pyyaml ==6.0.1
referencing ==0.33.0
regex ==2023.12.25
requests ==2.31.0
requests-toolbelt ==1.0.0
responses ==0.18.0
rich ==13.7.1
rpds-py ==0.17.1
rsa ==4.9
ruamel-yaml ==0.18.6
ruamel-yaml-clib ==0.2.8
safetensors ==0.4.2
scikit-learn ==1.4.0
scipy ==1.12.0
scmrepo ==3.3.1
semantic-version ==2.10.0
semver ==3.0.2
sentencepiece ==0.2.0
sentry-sdk ==1.44.0
seqeval ==1.2.2
setuptools ==69.2.0
shortuuid ==1.0.13
shtab ==1.7.1
simplejson ==3.19.2
six ==1.16.0
smmap ==5.0.1
sqltrie ==0.11.0
tabulate ==0.9.0
threadpoolctl ==3.2.0
tokenizers ==0.15.2
tomlkit ==0.12.4
tqdm ==4.66.2
transformers ==4.39.2
typer ==0.11.1
typing-extensions ==4.9.0
tzdata ==2023.4
uritemplate ==4.1.1
urllib3 ==2.2.0
vine ==5.1.0
voluptuous ==0.14.2
wcwidth ==0.2.13
websocket-client ==1.3.3
wrapt ==1.16.0
wurlitzer ==3.0.3
xxhash ==3.4.1
yarl ==1.9.4
zc-lockfile ==3.0.post1

setup.py pypi

environment.yml pypi

click *
python-dotenv *

zabantu-beta

Science Score: 18.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ZaBantu Beta

Training Lite Cross-Lingual Language Models for South African Bantu Languages - Preview

Pre-requisites

(Recommended) Cloud Data Science VMs

QuickStart

1. Clone the repository

2. Install dependencies

On the first run, the script will reboot your machine to load the NVIDIA drivers

After rebooting, run the script again to install the CUDA Toolkit

reload the .bashrc file to make sure the CUDA Toolkit is in your PATH

reload the .bashrc file to make sure conda and poetry are in your PATH

3. Pre-Train a sample Large Language Model (LLM)

This will run a sample training session on a sample dataset under `demos/data

Depending on how powerful your GPU is, training can take anywhere from a few minutes to a few hours

4. Fine-tune the Pre-Trained LLM on a downstream task

5. Evaluate & Visualize the results

Project Structure

Train on your own data

first, train your sentencepiece tokenizer

remember to change any of the parameters to suit your specific use case

then, train your model

Documentation

Contributing

Citation

Troubleshooting

Owner

Citation (citation.tex)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

ZaBantu ^Beta