tamil-llama
A New Tamil Large Language Model (LLM) Based on Llama 2
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
A New Tamil Large Language Model (LLM) Based on Llama 2
Basic Info
- Host: GitHub
- Owner: abhinand5
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 2.04 MB
Statistics
- Stars: 302
- Watchers: 14
- Forks: 44
- Open Issues: 6
- Releases: 0
Metadata Files
README.md
Tamil-Llama: A Family of LLaMA-based LLMs focused on Tamil Language

Description
This repository contains the code and models for "Tamil-Llama", a project focused on enhancing the performance of language models for the Tamil language. It builds upon the open-source LLaMA model, introducing additional Tamil tokens and employing the LoRA methodology for efficient training. Please read the technical report for more details.
Technical Report: https://arxiv.org/abs/2311.05845
If you appreciate this work and would like to support its continued development, consider buying me a coffee. Your support is invaluable and greatly appreciated.
Updates
Feb 25, 2024
Google's Gemma 2B Model was adapter for Tamil (Experimental Release) based on the same framework with a few changes. More info in this LinkedIn post.
Note: I have migrated to Llama-Factory for pretraining and Axolotl for finetuning.
- No expansion in vocab for Gemma as it already has 256k vocab size and minnescule amounts of Tamil tokens.
- Continually pretrain on all available Tamil Wikipedia data for 3 epochs.
- Finetune on Tamil Alpaca + English Alpaca mix for 5 epochs
- Model tops Open LLM Leaderboard for models under 3B params as of Feb 2023.
Download Links:
Jan 23, 2024
For more details, please read the detailed blog post here.
- Tamil LLaMA v0.2 models are out. It is a significant upgrade compared to the earlier version.
- Tamil LLaMA is now bilingual, it can fluently respond in both English and Tamil.
- Better tokenizer.
- Better base model.
- Better fine tuning dataset and performance.
- Our models match or betters the performance of Meta's LLaMA 2 is almost all the benchmarks.
- Following the same methodology the first ever Telugu and Malayam LLaMA models are also released.
Table of Contents
- Available Models
- Benchmark Scores
- Demo
- Getting Started
- Datasets
- Prompting Format
- Usage Note
- Contributions
- License
- Citation
- Contact
Available Models
| Model | Type | Data | Base Model | # Params | Download Links | |--------------------------|-----------------------------|-------------------|----------------------|------|------------------------------------------------------------------------| | Tamil LLaMA 7B Base | Base model | 12GB | LLaMA 7B | 7B | HF Hub | | Tamil LLaMA 13B Base | Base model | 4GB | LLaMA 13B | 13B | HF Hub | | Tamil LLaMA 7B Instruct | Instruction following model | 145k instructions | Tamil LLaMA 7B Base | 7B | HF Hub | | Tamil LLaMA 13B Instruct | Instruction following model | 145k instructions | Tamil LLaMA 13B Base | 13B | HF Hub |
Quantized Version of Available Models
| Model | Format | Bits | Download Links | |--------------------------|--------|----------------------|------------------------------------------------------------------------------| | Tamil LLaMA 7B Base | GGUF | Q4KM, Q5KM, Q80 | HF Hub | | Tamil LLaMA 13B Base | GGUF | Q4KM, Q5KM, Q80 | HF Hub | | Tamil LLaMA 7B Instruct | GGUF | Q4KM, Q5KM, Q80 | HF Hub | | Tamil LLaMA 13B Instruct | GGUF | Q4KM, Q5KM, Q80 | HF Hub |
Benchmark Scores
Scores are calculated using the HuggingFace Open LLM Leaderboard.
Note: The benchmarks test the model's capabilities in English reasoning, although the Tamil LLaMA models were not trained on quality reasoning tasks in English it shows decent performance across most benchmarks.
| Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | |--------------------------|---------|-------|-----------|-------|------------|------------|-------| | Tamil LLaMA 13B Instruct | 51.59 | 54.52 | 79.35 | 50.37 | 41.22 | 76.56 | 7.51 | | Tamil LLaMA 13B Base | 49.5 | 52.82 | 79.95 | 52.05 | 36.56 | 75.61 | 0 | | Tamil LLaMA 7B Instruct | 45.52 | 48.04 | 70.97 | 39.95 | 41.7 | 70.64 | 1.82 | | Tamil LLaMA 7B Base | 44.52 | 46.67 | 72.85 | 40.95 | 35.93 | 70.72 | 0 |
Demo
Update: There is now a Google Colab demo for Tamil/Telugu/Malayalam LLaMAs part of this project. Click Here to open the Colab Notebook.
A simple interactive demo of Tamil-LLaMA-7B-Instruct-v0.1 is hosted in the HuggingFace Space here -> abhinand/tamil-llama-playground

Getting Started
Using LMStudio:
LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023.
Download and Install LM Studio: Begin by downloading LM Studio from the official website.
Locate the Tamil Llama Model: After installation, open LM Studio and use the search bar to find the "Tamil Llama" model. Alternatively, if you have the GGUF model ID, paste it directly into the search bar.
Download the Appropriate Model Variant: Depending on your system's specifications, select the appropriate variant of the Tamil Llama model. Click on the 'Download' button to start the download process.
Import the Preset JSON File: Once the model is downloaded, navigate to the 'Chat' tab in LM Studio. In the settings, find the 'Preset' menu and click on the dropdown. Select "Import Preset From File" and import the preset JSON file located at config/lmstudio/modelconfig.json in the repository.
Select and Load the Model: Click on "Select a model to load" located on the top bar. From the list, choose the Tamil Llama variant that you previously downloaded.
Initiate Conversations with the Model: The Tamil Llama model is now ready to use. You can start engaging in conversations in the chat area of LM Studio.
Using with Ollama:
Verify Ollama Installation: First, ensure that Ollama is correctly installed on your system. If not, install it from the official source.
Download the Modelfile: Access the GitHub repository and download the Modelfile. This file is necessary for setting up the Tamil Llama model in Ollama.
Prepare the Working Directory: Place the downloaded
Modelfileand the model's GGUF file in the same directory. To work in this directory, use thecdcommand in your terminal to change to the appropriate directory.Download the Tamil Llama Model: Execute the following command in your terminal to download the desired Tamil Llama model from the GitHub repository:
bash
curl -L https://huggingface.co/abhinand/tamil-llama-7b-instruct-v0.1-gguf/resolve/main/tamil-llama-7b-v0.1-q8_0.gguf -o tamil-llama.gguf
This command downloads the Tamil Llama model GGUF file and saves it as tamil-llama.gguf in your current directory.
- Import and Run the Model in Ollama: After downloading the model, use the following command to create and run the Tamil Llama model in Ollama:
bash
ollama create tamil-llama -f Modelfile
This command imports the Tamil Llama model into Ollama and prepares it for use.
Optionally, depending upon your system's capabilities make sure to configure these parameters in the Modelfile too:
PARAMETER num_thread 8
PARAMETER num_gpu 0
For more information regarding the Modelfile's available parameters check out the official docs.
Datasets
The repository includes a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset, which are used for instruction fine-tuning and evaluation.
Tamil Alpaca: abhinand/tamil-alpaca
Tamil Alpaca Orca: abhinand/tamil-alpaca-orca
Tamil LLaMA Eval: abhinand/tamil-llama-eval
Prompting Format for Instruction Models
Prompt Template Without Input
``` {system_prompt}
Instruction:
{instruction or query}
Response:
{response} ```
Prompt Template With Input
``` {system_prompt}
Instruction:
{instruction or query}
Input:
{input}
Response:
{response} ```
Usage Note
It's important to note that the models have not undergone detoxification. Therefore, while they possess impressive linguistic capabilities, there is a possibility for them to generate content that could be deemed harmful or offensive. We urge users to exercise discretion and supervise the model's outputs closely, especially in public or sensitive applications.
Contributions
We welcome contributions to this project. If you have suggestions or improvements, please open an issue or a pull request.
License
This project is licensed under the GNU GPL v3.0 license - see the LICENSE.md file for details.
IMPORTANT: The GPL 3.0 License is applicable solely to the source code and datasets provided. As this project is a derivative of Meta's LLaMA 2 model, it is subject to the original licensing of LLaMA 2, which cannot be altered. Therefore, for comprehensive details regarding the licensing of the model, please consult the LLAMA2-LICENSE file.
Citation
If you use this model or the Tamil-Llama dataset in your research, please cite:
bibtex
@misc{balachandran2023tamilllama,
title={Tamil-Llama: A New Tamil Language Model Based on Llama 2},
author={Abhinand Balachandran},
year={2023},
eprint={2311.05845},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact
For any queries regarding the codebase or research, please reach out to Abhinand Balachandran at abhinandb.ml@gmail.com.
Owner
- Name: Abhinand
- Login: abhinand5
- Kind: user
- Location: Chennai, India
- Website: https://blog.abhinandb.com
- Twitter: abhinand58
- Repositories: 71
- Profile: https://github.com/abhinand5
ML Engineer | Kaggle Master | Programmer
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: 'Tamil-Llama: A New Tamil Language Model Based on Llama 2'
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Abhinand
family-names: Balachandran
email: abhinandb.ml@gmail.com
orcid: 'https://orcid.org/0009-0004-9692-8432'
identifiers:
- type: url
value: 'https://arxiv.org/abs/2311.05845'
description: arXiv
repository-code: 'https://github.com/abhinand5/tamil-llama/tree/main'
abstract: >-
Language modeling has witnessed remarkable advancements in
recent years, with Large Language Models (LLMs) like
ChatGPT setting unparalleled benchmarks in human-like text
generation. However, a prevailing limitation is the
underrepresentation of languages like Tamil in these
cutting-edge models, leading to suboptimal performance in
diverse linguistic contexts. This paper addresses this
lacuna, enhancing the open-source LLaMA model with an
addition of 16,000 Tamil tokens, aiming to achieve
superior text generation and comprehension in the Tamil
language. We strategically employ the LoRA methodology for
efficient model training on a comprehensive Tamil corpus,
ensuring computational feasibility and model robustness.
Moreover, we introduce a Tamil-translated version of the
Alpaca dataset and a subset of the OpenOrca dataset
tailored for instruction fine-tuning. Our results showcase
significant performance improvements in Tamil text
generation, with potential implications for the broader
landscape of LLMs in Indian languages. We further
underscore our commitment to open research by making our
models, datasets, and code publicly accessible, fostering
further innovations in language modeling.
keywords:
- large language models
- natural language processing
- machine learning
- deep learning
- llama 2
- tamil language model
license: GPL-3.0
date-released: '2023-11-12'
GitHub Events
Total
- Watch event: 48
- Fork event: 10
Last Year
- Watch event: 48
- Fork event: 10
Issues and Pull Requests
Last synced: about 2 years ago
All Time
- Total issues: 10
- Total pull requests: 1
- Average time to close issues: 28 days
- Average time to close pull requests: N/A
- Total issue authors: 9
- Total pull request authors: 1
- Average comments per issue: 2.7
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 10
- Pull requests: 1
- Average time to close issues: 28 days
- Average time to close pull requests: N/A
- Issue authors: 9
- Pull request authors: 1
- Average comments per issue: 2.7
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Cyberblackstone (2)
- winstondcosta (1)
- alishafique3 (1)
- VishnuPJ (1)
- wenhui-huang (1)
- kdcyberdude (1)
- sazzad1779 (1)
- almugabo (1)
- SmartManoj (1)
- bharathdpv (1)
Pull Request Authors
- SmartManoj (2)
- ke-lara (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- absl-py =2.0.0=pypi_0
- accelerate =0.21.0=pypi_0
- aiofiles =23.2.1=pypi_0
- aiohttp =3.8.6=pypi_0
- aiosignal =1.3.1=pypi_0
- albumentations =1.3.1=pypi_0
- alembic =1.12.0=pypi_0
- altair =5.1.2=pypi_0
- anyio =3.7.1=pypi_0
- arrow =1.3.0=pypi_0
- asttokens =2.4.1=pyhd8ed1ab_0
- async-timeout =4.0.3=pypi_0
- attrs =23.1.0=pypi_0
- autotrain-advanced =0.6.37=pypi_0
- backcall =0.2.0=pyh9f0ad1d_0
- backports =1.0=pyhd8ed1ab_3
- backports.functools_lru_cache =1.6.5=pyhd8ed1ab_0
- bitsandbytes =0.40.2=pypi_0
- bzip2 =1.0.8=h7b6447c_0
- ca-certificates =2023.7.22=hbcca054_0
- cachetools =5.3.1=pypi_0
- certifi =2023.7.22=pypi_0
- charset-normalizer =3.3.0=pypi_0
- click =8.1.7=pypi_0
- cmaes =0.10.0=pypi_0
- codecarbon =2.2.3=pypi_0
- colorlog =6.7.0=pypi_0
- comm =0.1.4=pyhd8ed1ab_0
- contourpy =1.1.1=pypi_0
- cycler =0.12.1=pypi_0
- datasets =2.14.5=pypi_0
- debugpy =1.6.7=py310h6a678d5_0
- decorator =5.1.1=pyhd8ed1ab_0
- diffusers =0.21.4=pypi_0
- dill =0.3.7=pypi_0
- docstring-parser =0.15=pypi_0
- einops =0.6.1=pypi_0
- entrypoints =0.4=pyhd8ed1ab_0
- evaluate =0.3.0=pypi_0
- exceptiongroup =1.1.3=pyhd8ed1ab_0
- executing =2.0.1=pyhd8ed1ab_0
- fastapi =0.104.0=pypi_0
- ffmpy =0.3.1=pypi_0
- filelock =3.12.4=pypi_0
- flash-attn =2.3.3=pypi_0
- fonttools =4.43.1=pypi_0
- frozenlist =1.4.0=pypi_0
- fsspec =2023.6.0=pypi_0
- fuzzywuzzy =0.18.0=pypi_0
- google-auth =2.23.3=pypi_0
- google-auth-oauthlib =1.1.0=pypi_0
- gradio =3.41.0=pypi_0
- gradio-client =0.5.0=pypi_0
- greenlet =3.0.0=pypi_0
- grpcio =1.59.0=pypi_0
- h11 =0.14.0=pypi_0
- httpcore =0.18.0=pypi_0
- httpx =0.25.0=pypi_0
- huggingface-hub =0.17.3=pypi_0
- idna =3.4=pypi_0
- imageio =2.31.5=pypi_0
- importlib-metadata =6.8.0=pypi_0
- importlib-resources =6.1.0=pypi_0
- inquirerpy =0.3.4=pypi_0
- invisible-watermark =0.2.0=pypi_0
- ipadic =1.0.0=pypi_0
- ipykernel =6.26.0=pyhf8b6a83_0
- ipython =8.16.1=pyh0d859eb_0
- jedi =0.19.1=pyhd8ed1ab_0
- jinja2 =3.1.2=pypi_0
- jiwer =3.0.2=pypi_0
- joblib =1.3.1=pypi_0
- jsonschema =4.19.1=pypi_0
- jsonschema-specifications =2023.7.1=pypi_0
- jupyter_client =7.3.4=pyhd8ed1ab_0
- jupyter_core =5.5.0=py310hff52083_0
- kiwisolver =1.4.5=pypi_0
- lazy-loader =0.3=pypi_0
- ld_impl_linux-64 =2.38=h1181459_1
- libffi =3.4.4=h6a678d5_0
- libgcc-ng =11.2.0=h1234567_1
- libgomp =11.2.0=h1234567_1
- libsodium =1.0.18=h36c2ea0_1
- libstdcxx-ng =11.2.0=h1234567_1
- libuuid =1.41.5=h5eee18b_0
- loguru =0.7.0=pypi_0
- mako =1.2.4=pypi_0
- markdown =3.5=pypi_0
- markdown-it-py =3.0.0=pypi_0
- markupsafe =2.1.3=pypi_0
- matplotlib =3.8.0=pypi_0
- matplotlib-inline =0.1.6=pyhd8ed1ab_0
- mdurl =0.1.2=pypi_0
- mpmath =1.3.0=pypi_0
- multidict =6.0.4=pypi_0
- multiprocess =0.70.15=pypi_0
- ncurses =6.4=h6a678d5_0
- nest-asyncio =1.5.8=pyhd8ed1ab_0
- networkx =3.2=pypi_0
- ninja =1.11.1.1=pypi_0
- numpy =1.26.1=pypi_0
- nvidia-cublas-cu12 =12.1.3.1=pypi_0
- nvidia-cuda-cupti-cu12 =12.1.105=pypi_0
- nvidia-cuda-nvrtc-cu12 =12.1.105=pypi_0
- nvidia-cuda-runtime-cu12 =12.1.105=pypi_0
- nvidia-cudnn-cu12 =8.9.2.26=pypi_0
- nvidia-cufft-cu12 =11.0.2.54=pypi_0
- nvidia-curand-cu12 =10.3.2.106=pypi_0
- nvidia-cusolver-cu12 =11.4.5.107=pypi_0
- nvidia-cusparse-cu12 =12.1.0.106=pypi_0
- nvidia-nccl-cu12 =2.18.1=pypi_0
- nvidia-nvjitlink-cu12 =12.3.52=pypi_0
- nvidia-nvtx-cu12 =12.1.105=pypi_0
- oauthlib =3.2.2=pypi_0
- opencv-python =4.8.1.78=pypi_0
- opencv-python-headless =4.8.1.78=pypi_0
- openssl =3.0.11=h7f8727e_2
- optuna =3.3.0=pypi_0
- orjson =3.9.9=pypi_0
- packaging =23.1=pypi_0
- pandas =2.1.1=pypi_0
- parso =0.8.3=pyhd8ed1ab_0
- peft =0.4.0=pypi_0
- pexpect =4.8.0=pyh1a96a4e_2
- pfzy =0.3.4=pypi_0
- pickleshare =0.7.5=py_1003
- pillow =10.0.0=pypi_0
- pip =23.3=py310h06a4308_0
- platformdirs =3.11.0=pyhd8ed1ab_0
- prompt-toolkit =3.0.39=pyha770c72_0
- prompt_toolkit =3.0.39=hd8ed1ab_0
- protobuf =4.23.4=pypi_0
- psutil =5.9.6=pypi_0
- ptyprocess =0.7.0=pyhd3deb0d_0
- pure_eval =0.2.2=pyhd8ed1ab_0
- py-cpuinfo =9.0.0=pypi_0
- pyarrow =13.0.0=pypi_0
- pyasn1 =0.5.0=pypi_0
- pyasn1-modules =0.3.0=pypi_0
- pydantic =1.10.11=pypi_0
- pydub =0.25.1=pypi_0
- pygments =2.16.1=pyhd8ed1ab_0
- pynvml =11.5.0=pypi_0
- pyparsing =3.1.1=pypi_0
- python =3.10.13=h955ad1f_0
- python-dateutil =2.8.2=pyhd8ed1ab_0
- python-multipart =0.0.6=pypi_0
- python_abi =3.10=2_cp310
- pytz =2023.3.post1=pypi_0
- pywavelets =1.4.1=pypi_0
- pyyaml =6.0.1=pypi_0
- pyzmq =25.1.0=py310h6a678d5_0
- qudida =0.0.4=pypi_0
- rapidfuzz =2.13.7=pypi_0
- readline =8.2=h5eee18b_0
- referencing =0.30.2=pypi_0
- regex =2023.10.3=pypi_0
- requests =2.31.0=pypi_0
- requests-oauthlib =1.3.1=pypi_0
- responses =0.18.0=pypi_0
- rich =13.6.0=pypi_0
- rpds-py =0.10.6=pypi_0
- rsa =4.9=pypi_0
- sacremoses =0.0.53=pypi_0
- safetensors =0.4.0=pypi_0
- scikit-image =0.22.0=pypi_0
- scikit-learn =1.3.0=pypi_0
- scipy =1.11.3=pypi_0
- semantic-version =2.10.0=pypi_0
- sentencepiece =0.1.99=pypi_0
- setuptools =68.0.0=py310h06a4308_0
- shtab =1.6.4=pypi_0
- six =1.16.0=pyh6c4a22f_0
- sniffio =1.3.0=pypi_0
- sqlalchemy =2.0.22=pypi_0
- sqlite =3.41.2=h5eee18b_0
- stack-data =0.6.3=pypi_0
- stack_data =0.6.2=pyhd8ed1ab_0
- starlette =0.27.0=pypi_0
- sympy =1.12=pypi_0
- tensorboard =2.15.0=pypi_0
- tensorboard-data-server =0.7.1=pypi_0
- threadpoolctl =3.2.0=pypi_0
- tifffile =2023.9.26=pypi_0
- tiktoken =0.5.1=pypi_0
- tk =8.6.12=h1ccaba5_0
- tokenizers =0.13.3=pypi_0
- toolz =0.12.0=pypi_0
- torch =2.1.0=pypi_0
- tornado =6.1=py310h5764c6d_3
- tqdm =4.65.0=pypi_0
- traitlets =5.12.0=pypi_0
- transformers =4.31.0=pypi_0
- triton =2.1.0=pypi_0
- trl =0.4.7=pypi_0
- types-python-dateutil =2.8.19.14=pypi_0
- typing-extensions =4.8.0=hd8ed1ab_0
- typing_extensions =4.8.0=pyha770c72_0
- tyro =0.5.10=pypi_0
- tzdata =2023.3=pypi_0
- urllib3 =2.0.7=pypi_0
- uvicorn =0.23.2=pypi_0
- wcwidth =0.2.8=pypi_0
- websockets =11.0.3=pypi_0
- werkzeug =2.3.6=pypi_0
- wheel =0.41.2=py310h06a4308_0
- xformers =0.0.22.post4=pypi_0
- xgboost =1.7.6=pypi_0
- xxhash =3.4.1=pypi_0
- xz =5.4.2=h5eee18b_0
- yarl =1.9.2=pypi_0
- zeromq =4.3.4=h2531618_0
- zipp =3.17.0=pypi_0
- zlib =1.2.13=h5eee18b_0
