tamil-llama

A New Tamil Large Language Model (LLM) Based on Llama 2

https://github.com/abhinand5/tamil-llama

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

A New Tamil Large Language Model (LLM) Based on Llama 2

Basic Info

Host: GitHub
Owner: abhinand5
License: gpl-3.0
Language: Python
Default Branch: main
Size: 2.04 MB

Statistics

Stars: 302
Watchers: 14
Forks: 44
Open Issues: 6
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

Tamil-Llama: A Family of LLaMA-based LLMs focused on Tamil Language

Tamil LLaMA Image

Description

This repository contains the code and models for "Tamil-Llama", a project focused on enhancing the performance of language models for the Tamil language. It builds upon the open-source LLaMA model, introducing additional Tamil tokens and employing the LoRA methodology for efficient training. Please read the technical report for more details.

Technical Report: https://arxiv.org/abs/2311.05845

If you appreciate this work and would like to support its continued development, consider buying me a coffee. Your support is invaluable and greatly appreciated.

Updates

Feb 25, 2024

Google's Gemma 2B Model was adapter for Tamil (Experimental Release) based on the same framework with a few changes. More info in this LinkedIn post.

Note: I have migrated to Llama-Factory for pretraining and Axolotl for finetuning.

No expansion in vocab for Gemma as it already has 256k vocab size and minnescule amounts of Tamil tokens.
Continually pretrain on all available Tamil Wikipedia data for 3 epochs.
Finetune on Tamil Alpaca + English Alpaca mix for 5 epochs
Model tops Open LLM Leaderboard for models under 3B params as of Feb 2023.

Download Links:

Jan 23, 2024

For more details, please read the detailed blog post here.

Tamil LLaMA v0.2 models are out. It is a significant upgrade compared to the earlier version.
- Tamil LLaMA is now bilingual, it can fluently respond in both English and Tamil.
- Better tokenizer.
- Better base model.
- Better fine tuning dataset and performance.
- Our models match or betters the performance of Meta's LLaMA 2 is almost all the benchmarks.
Following the same methodology the first ever Telugu and Malayam LLaMA models are also released.

Available Models
Benchmark Scores
Demo
Getting Started
Datasets
Prompting Format
Usage Note
Contributions
License
Citation
Contact

Available Models

| Model | Type | Data | Base Model | # Params | Download Links | |--------------------------|-----------------------------|-------------------|----------------------|------|------------------------------------------------------------------------| | Tamil LLaMA 7B Base | Base model | 12GB | LLaMA 7B | 7B | HF Hub | | Tamil LLaMA 13B Base | Base model | 4GB | LLaMA 13B | 13B | HF Hub | | Tamil LLaMA 7B Instruct | Instruction following model | 145k instructions | Tamil LLaMA 7B Base | 7B | HF Hub | | Tamil LLaMA 13B Instruct | Instruction following model | 145k instructions | Tamil LLaMA 13B Base | 13B | HF Hub |

Quantized Version of Available Models

| Model | Format | Bits | Download Links | |--------------------------|--------|----------------------|------------------------------------------------------------------------------| | Tamil LLaMA 7B Base | GGUF | Q4KM, Q5KM, Q80 | HF Hub | | Tamil LLaMA 13B Base | GGUF | Q4KM, Q5KM, Q80 | HF Hub | | Tamil LLaMA 7B Instruct | GGUF | Q4KM, Q5KM, Q80 | HF Hub | | Tamil LLaMA 13B Instruct | GGUF | Q4KM, Q5KM, Q80 | HF Hub |

Benchmark Scores

Scores are calculated using the HuggingFace Open LLM Leaderboard.

Note: The benchmarks test the model's capabilities in English reasoning, although the Tamil LLaMA models were not trained on quality reasoning tasks in English it shows decent performance across most benchmarks.

| Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | |--------------------------|---------|-------|-----------|-------|------------|------------|-------| | Tamil LLaMA 13B Instruct | 51.59 | 54.52 | 79.35 | 50.37 | 41.22 | 76.56 | 7.51 | | Tamil LLaMA 13B Base | 49.5 | 52.82 | 79.95 | 52.05 | 36.56 | 75.61 | 0 | | Tamil LLaMA 7B Instruct | 45.52 | 48.04 | 70.97 | 39.95 | 41.7 | 70.64 | 1.82 | | Tamil LLaMA 7B Base | 44.52 | 46.67 | 72.85 | 40.95 | 35.93 | 70.72 | 0 |

Demo

Update: There is now a Google Colab demo for Tamil/Telugu/Malayalam LLaMAs part of this project. Click Here to open the Colab Notebook.

A simple interactive demo of Tamil-LLaMA-7B-Instruct-v0.1 is hosted in the HuggingFace Space here -> abhinand/tamil-llama-playground

Tamil LLaMA Image

Getting Started

Using LMStudio:

LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023.

Download and Install LM Studio: Begin by downloading LM Studio from the official website.
Locate the Tamil Llama Model: After installation, open LM Studio and use the search bar to find the "Tamil Llama" model. Alternatively, if you have the GGUF model ID, paste it directly into the search bar.
Download the Appropriate Model Variant: Depending on your system's specifications, select the appropriate variant of the Tamil Llama model. Click on the 'Download' button to start the download process.
Import the Preset JSON File: Once the model is downloaded, navigate to the 'Chat' tab in LM Studio. In the settings, find the 'Preset' menu and click on the dropdown. Select "Import Preset From File" and import the preset JSON file located at config/lmstudio/modelconfig.json in the repository.
Select and Load the Model: Click on "Select a model to load" located on the top bar. From the list, choose the Tamil Llama variant that you previously downloaded.
Initiate Conversations with the Model: The Tamil Llama model is now ready to use. You can start engaging in conversations in the chat area of LM Studio.

Using with Ollama:

Verify Ollama Installation: First, ensure that Ollama is correctly installed on your system. If not, install it from the official source.
Download the Modelfile: Access the GitHub repository and download the Modelfile. This file is necessary for setting up the Tamil Llama model in Ollama.
Prepare the Working Directory: Place the downloaded Modelfile and the model's GGUF file in the same directory. To work in this directory, use the cd command in your terminal to change to the appropriate directory.
Download the Tamil Llama Model: Execute the following command in your terminal to download the desired Tamil Llama model from the GitHub repository:

bash curl -L https://huggingface.co/abhinand/tamil-llama-7b-instruct-v0.1-gguf/resolve/main/tamil-llama-7b-v0.1-q8_0.gguf -o tamil-llama.gguf

This command downloads the Tamil Llama model GGUF file and saves it as tamil-llama.gguf in your current directory.

Import and Run the Model in Ollama: After downloading the model, use the following command to create and run the Tamil Llama model in Ollama:

bash ollama create tamil-llama -f Modelfile

This command imports the Tamil Llama model into Ollama and prepares it for use.

Optionally, depending upon your system's capabilities make sure to configure these parameters in the Modelfile too:

PARAMETER num_thread 8 PARAMETER num_gpu 0

For more information regarding the Modelfile's available parameters check out the official docs.

Datasets

The repository includes a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset, which are used for instruction fine-tuning and evaluation.

Tamil Alpaca: abhinand/tamil-alpaca

Tamil Alpaca Orca: abhinand/tamil-alpaca-orca

Tamil LLaMA Eval: abhinand/tamil-llama-eval

Prompting Format for Instruction Models

Prompt Template Without Input

``` {system_prompt}

Instruction:

{instruction or query}

Response:

{response} ```

Prompt Template With Input

``` {system_prompt}

Instruction:

{instruction or query}

Input:

{input}

Response:

{response} ```

Usage Note

It's important to note that the models have not undergone detoxification. Therefore, while they possess impressive linguistic capabilities, there is a possibility for them to generate content that could be deemed harmful or offensive. We urge users to exercise discretion and supervise the model's outputs closely, especially in public or sensitive applications.

Contributions

We welcome contributions to this project. If you have suggestions or improvements, please open an issue or a pull request.

License

This project is licensed under the GNU GPL v3.0 license - see the LICENSE.md file for details.

IMPORTANT: The GPL 3.0 License is applicable solely to the source code and datasets provided. As this project is a derivative of Meta's LLaMA 2 model, it is subject to the original licensing of LLaMA 2, which cannot be altered. Therefore, for comprehensive details regarding the licensing of the model, please consult the LLAMA2-LICENSE file.

Citation

If you use this model or the Tamil-Llama dataset in your research, please cite:

bibtex @misc{balachandran2023tamilllama, title={Tamil-Llama: A New Tamil Language Model Based on Llama 2}, author={Abhinand Balachandran}, year={2023}, eprint={2311.05845}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Contact

For any queries regarding the codebase or research, please reach out to Abhinand Balachandran at abhinandb.ml@gmail.com.

Owner

Name: Abhinand
Login: abhinand5
Kind: user
Location: Chennai, India

Website: https://blog.abhinandb.com
Twitter: abhinand58
Repositories: 71
Profile: https://github.com/abhinand5

ML Engineer | Kaggle Master | Programmer

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'Tamil-Llama: A New Tamil Language Model Based on Llama 2'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Abhinand
    family-names: Balachandran
    email: abhinandb.ml@gmail.com
    orcid: 'https://orcid.org/0009-0004-9692-8432'
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2311.05845'
    description: arXiv
repository-code: 'https://github.com/abhinand5/tamil-llama/tree/main'
abstract: >-
  Language modeling has witnessed remarkable advancements in
  recent years, with Large Language Models (LLMs) like
  ChatGPT setting unparalleled benchmarks in human-like text
  generation. However, a prevailing limitation is the
  underrepresentation of languages like Tamil in these
  cutting-edge models, leading to suboptimal performance in
  diverse linguistic contexts. This paper addresses this
  lacuna, enhancing the open-source LLaMA model with an
  addition of 16,000 Tamil tokens, aiming to achieve
  superior text generation and comprehension in the Tamil
  language. We strategically employ the LoRA methodology for
  efficient model training on a comprehensive Tamil corpus,
  ensuring computational feasibility and model robustness.
  Moreover, we introduce a Tamil-translated version of the
  Alpaca dataset and a subset of the OpenOrca dataset
  tailored for instruction fine-tuning. Our results showcase
  significant performance improvements in Tamil text
  generation, with potential implications for the broader
  landscape of LLMs in Indian languages. We further
  underscore our commitment to open research by making our
  models, datasets, and code publicly accessible, fostering
  further innovations in language modeling.
keywords:
  - large language models
  - natural language processing
  - machine learning
  - deep learning
  - llama 2
  - tamil language model
license: GPL-3.0
date-released: '2023-11-12'

GitHub Events

Total

Watch event: 48
Fork event: 10

Last Year

Watch event: 48
Fork event: 10

Issues and Pull Requests

Last synced: about 2 years ago

All Time

Total issues: 10
Total pull requests: 1
Average time to close issues: 28 days
Average time to close pull requests: N/A
Total issue authors: 9
Total pull request authors: 1
Average comments per issue: 2.7
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 10
Pull requests: 1
Average time to close issues: 28 days
Average time to close pull requests: N/A
Issue authors: 9
Pull request authors: 1
Average comments per issue: 2.7
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Cyberblackstone (2)
winstondcosta (1)
alishafique3 (1)
VishnuPJ (1)
wenhui-huang (1)
kdcyberdude (1)
sazzad1779 (1)
almugabo (1)
SmartManoj (1)
bharathdpv (1)

Pull Request Authors

SmartManoj (2)
ke-lara (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

absl-py =2.0.0=pypi_0
accelerate =0.21.0=pypi_0
aiofiles =23.2.1=pypi_0
aiohttp =3.8.6=pypi_0
aiosignal =1.3.1=pypi_0
albumentations =1.3.1=pypi_0
alembic =1.12.0=pypi_0
altair =5.1.2=pypi_0
anyio =3.7.1=pypi_0
arrow =1.3.0=pypi_0
asttokens =2.4.1=pyhd8ed1ab_0
async-timeout =4.0.3=pypi_0
attrs =23.1.0=pypi_0
autotrain-advanced =0.6.37=pypi_0
backcall =0.2.0=pyh9f0ad1d_0
backports =1.0=pyhd8ed1ab_3
backports.functools_lru_cache =1.6.5=pyhd8ed1ab_0
bitsandbytes =0.40.2=pypi_0
bzip2 =1.0.8=h7b6447c_0
ca-certificates =2023.7.22=hbcca054_0
cachetools =5.3.1=pypi_0
certifi =2023.7.22=pypi_0
charset-normalizer =3.3.0=pypi_0
click =8.1.7=pypi_0
cmaes =0.10.0=pypi_0
codecarbon =2.2.3=pypi_0
colorlog =6.7.0=pypi_0
comm =0.1.4=pyhd8ed1ab_0
contourpy =1.1.1=pypi_0
cycler =0.12.1=pypi_0
datasets =2.14.5=pypi_0
debugpy =1.6.7=py310h6a678d5_0
decorator =5.1.1=pyhd8ed1ab_0
diffusers =0.21.4=pypi_0
dill =0.3.7=pypi_0
docstring-parser =0.15=pypi_0
einops =0.6.1=pypi_0
entrypoints =0.4=pyhd8ed1ab_0
evaluate =0.3.0=pypi_0
exceptiongroup =1.1.3=pyhd8ed1ab_0
executing =2.0.1=pyhd8ed1ab_0
fastapi =0.104.0=pypi_0
ffmpy =0.3.1=pypi_0
filelock =3.12.4=pypi_0
flash-attn =2.3.3=pypi_0
fonttools =4.43.1=pypi_0
frozenlist =1.4.0=pypi_0
fsspec =2023.6.0=pypi_0
fuzzywuzzy =0.18.0=pypi_0
google-auth =2.23.3=pypi_0
google-auth-oauthlib =1.1.0=pypi_0
gradio =3.41.0=pypi_0
gradio-client =0.5.0=pypi_0
greenlet =3.0.0=pypi_0
grpcio =1.59.0=pypi_0
h11 =0.14.0=pypi_0
httpcore =0.18.0=pypi_0
httpx =0.25.0=pypi_0
huggingface-hub =0.17.3=pypi_0
idna =3.4=pypi_0
imageio =2.31.5=pypi_0
importlib-metadata =6.8.0=pypi_0
importlib-resources =6.1.0=pypi_0
inquirerpy =0.3.4=pypi_0
invisible-watermark =0.2.0=pypi_0
ipadic =1.0.0=pypi_0
ipykernel =6.26.0=pyhf8b6a83_0
ipython =8.16.1=pyh0d859eb_0
jedi =0.19.1=pyhd8ed1ab_0
jinja2 =3.1.2=pypi_0
jiwer =3.0.2=pypi_0
joblib =1.3.1=pypi_0
jsonschema =4.19.1=pypi_0
jsonschema-specifications =2023.7.1=pypi_0
jupyter_client =7.3.4=pyhd8ed1ab_0
jupyter_core =5.5.0=py310hff52083_0
kiwisolver =1.4.5=pypi_0
lazy-loader =0.3=pypi_0
ld_impl_linux-64 =2.38=h1181459_1
libffi =3.4.4=h6a678d5_0
libgcc-ng =11.2.0=h1234567_1
libgomp =11.2.0=h1234567_1
libsodium =1.0.18=h36c2ea0_1
libstdcxx-ng =11.2.0=h1234567_1
libuuid =1.41.5=h5eee18b_0
loguru =0.7.0=pypi_0
mako =1.2.4=pypi_0
markdown =3.5=pypi_0
markdown-it-py =3.0.0=pypi_0
markupsafe =2.1.3=pypi_0
matplotlib =3.8.0=pypi_0
matplotlib-inline =0.1.6=pyhd8ed1ab_0
mdurl =0.1.2=pypi_0
mpmath =1.3.0=pypi_0
multidict =6.0.4=pypi_0
multiprocess =0.70.15=pypi_0
ncurses =6.4=h6a678d5_0
nest-asyncio =1.5.8=pyhd8ed1ab_0
networkx =3.2=pypi_0
ninja =1.11.1.1=pypi_0
numpy =1.26.1=pypi_0
nvidia-cublas-cu12 =12.1.3.1=pypi_0
nvidia-cuda-cupti-cu12 =12.1.105=pypi_0
nvidia-cuda-nvrtc-cu12 =12.1.105=pypi_0
nvidia-cuda-runtime-cu12 =12.1.105=pypi_0
nvidia-cudnn-cu12 =8.9.2.26=pypi_0
nvidia-cufft-cu12 =11.0.2.54=pypi_0
nvidia-curand-cu12 =10.3.2.106=pypi_0
nvidia-cusolver-cu12 =11.4.5.107=pypi_0
nvidia-cusparse-cu12 =12.1.0.106=pypi_0
nvidia-nccl-cu12 =2.18.1=pypi_0
nvidia-nvjitlink-cu12 =12.3.52=pypi_0
nvidia-nvtx-cu12 =12.1.105=pypi_0
oauthlib =3.2.2=pypi_0
opencv-python =4.8.1.78=pypi_0
opencv-python-headless =4.8.1.78=pypi_0
openssl =3.0.11=h7f8727e_2
optuna =3.3.0=pypi_0
orjson =3.9.9=pypi_0
packaging =23.1=pypi_0
pandas =2.1.1=pypi_0
parso =0.8.3=pyhd8ed1ab_0
peft =0.4.0=pypi_0
pexpect =4.8.0=pyh1a96a4e_2
pfzy =0.3.4=pypi_0
pickleshare =0.7.5=py_1003
pillow =10.0.0=pypi_0
pip =23.3=py310h06a4308_0
platformdirs =3.11.0=pyhd8ed1ab_0
prompt-toolkit =3.0.39=pyha770c72_0
prompt_toolkit =3.0.39=hd8ed1ab_0
protobuf =4.23.4=pypi_0
psutil =5.9.6=pypi_0
ptyprocess =0.7.0=pyhd3deb0d_0
pure_eval =0.2.2=pyhd8ed1ab_0
py-cpuinfo =9.0.0=pypi_0
pyarrow =13.0.0=pypi_0
pyasn1 =0.5.0=pypi_0
pyasn1-modules =0.3.0=pypi_0
pydantic =1.10.11=pypi_0
pydub =0.25.1=pypi_0
pygments =2.16.1=pyhd8ed1ab_0
pynvml =11.5.0=pypi_0
pyparsing =3.1.1=pypi_0
python =3.10.13=h955ad1f_0
python-dateutil =2.8.2=pyhd8ed1ab_0
python-multipart =0.0.6=pypi_0
python_abi =3.10=2_cp310
pytz =2023.3.post1=pypi_0
pywavelets =1.4.1=pypi_0
pyyaml =6.0.1=pypi_0
pyzmq =25.1.0=py310h6a678d5_0
qudida =0.0.4=pypi_0
rapidfuzz =2.13.7=pypi_0
readline =8.2=h5eee18b_0
referencing =0.30.2=pypi_0
regex =2023.10.3=pypi_0
requests =2.31.0=pypi_0
requests-oauthlib =1.3.1=pypi_0
responses =0.18.0=pypi_0
rich =13.6.0=pypi_0
rpds-py =0.10.6=pypi_0
rsa =4.9=pypi_0
sacremoses =0.0.53=pypi_0
safetensors =0.4.0=pypi_0
scikit-image =0.22.0=pypi_0
scikit-learn =1.3.0=pypi_0
scipy =1.11.3=pypi_0
semantic-version =2.10.0=pypi_0
sentencepiece =0.1.99=pypi_0
setuptools =68.0.0=py310h06a4308_0
shtab =1.6.4=pypi_0
six =1.16.0=pyh6c4a22f_0
sniffio =1.3.0=pypi_0
sqlalchemy =2.0.22=pypi_0
sqlite =3.41.2=h5eee18b_0
stack-data =0.6.3=pypi_0
stack_data =0.6.2=pyhd8ed1ab_0
starlette =0.27.0=pypi_0
sympy =1.12=pypi_0
tensorboard =2.15.0=pypi_0
tensorboard-data-server =0.7.1=pypi_0
threadpoolctl =3.2.0=pypi_0
tifffile =2023.9.26=pypi_0
tiktoken =0.5.1=pypi_0
tk =8.6.12=h1ccaba5_0
tokenizers =0.13.3=pypi_0
toolz =0.12.0=pypi_0
torch =2.1.0=pypi_0
tornado =6.1=py310h5764c6d_3
tqdm =4.65.0=pypi_0
traitlets =5.12.0=pypi_0
transformers =4.31.0=pypi_0
triton =2.1.0=pypi_0
trl =0.4.7=pypi_0
types-python-dateutil =2.8.19.14=pypi_0
typing-extensions =4.8.0=hd8ed1ab_0
typing_extensions =4.8.0=pyha770c72_0
tyro =0.5.10=pypi_0
tzdata =2023.3=pypi_0
urllib3 =2.0.7=pypi_0
uvicorn =0.23.2=pypi_0
wcwidth =0.2.8=pypi_0
websockets =11.0.3=pypi_0
werkzeug =2.3.6=pypi_0
wheel =0.41.2=py310h06a4308_0
xformers =0.0.22.post4=pypi_0
xgboost =1.7.6=pypi_0
xxhash =3.4.1=pypi_0
xz =5.4.2=h5eee18b_0
yarl =1.9.2=pypi_0
zeromq =4.3.4=h2531618_0
zipp =3.17.0=pypi_0
zlib =1.2.13=h5eee18b_0

tamil-llama

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Tamil-Llama: A Family of LLaMA-based LLMs focused on Tamil Language

Description

Updates

Feb 25, 2024

Jan 23, 2024

Table of Contents

Available Models

Quantized Version of Available Models

Benchmark Scores

Demo

Getting Started

Using LMStudio:

Using with Ollama:

Datasets

Prompting Format for Instruction Models

Instruction:

Response:

Instruction:

Input:

Response:

Usage Note

Contributions

License

Citation

Contact

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies