https://github.com/ai4bharat/vocabadaptation_llm

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AI4Bharat
Language: Python
Default Branch: main
Size: 3.88 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme

README.md

Vocabulary Adaptation MPT and BLOOM model

Tokenizer-embed Pipeline

To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
To evaluate the resulting tokenizer follow tokenizer_evaluation directory
To get embedding using wechsel follow Wechsel_Setup
To initialize the word embedding layer of model follow InitializationWordEmbed

Result

Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD6K47kHh5ojd0/edit

File specification

seeddatatest_split.py contains code to split seed dataset for train(90%) and test(10%)
mergetrainingseed.py -> code to merge the training data
tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
train_tokenizer.py -> train tokenizer from scratch
MPTinference.py and IndicMPTinference.py -> code to calculate the perplexity score of just inferncing(no training)
MPTtrain.py and IndicMPTtrain.py -> contains code to train LoRA adapetr and Word Embedding layer of model

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Watch event: 5
Fork event: 1

Last Year

Watch event: 5
Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science