https://github.com/ai4bharat/vocabadaptation_llm
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.3%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: AI4Bharat
- Language: Python
- Default Branch: main
- Size: 3.88 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Created almost 3 years ago
· Last pushed almost 2 years ago
Metadata Files
Readme
README.md
Vocabulary Adaptation MPT and BLOOM model
Tokenizer-embed Pipeline
- To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
- To evaluate the resulting tokenizer follow tokenizer_evaluation directory
- To get embedding using wechsel follow Wechsel_Setup
- To initialize the word embedding layer of model follow InitializationWordEmbed
Result
- Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
- Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD6K47kHh5ojd0/edit
File specification
- seeddatatest_split.py contains code to split seed dataset for train(90%) and test(10%)
- mergetrainingseed.py -> code to merge the training data
- tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
- combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
- train_tokenizer.py -> train tokenizer from scratch
- MPTinference.py and IndicMPTinference.py -> code to calculate the perplexity score of just inferncing(no training)
- MPTtrain.py and IndicMPTtrain.py -> contains code to train LoRA adapetr and Word Embedding layer of model
Owner
- Name: AI4Bhārat
- Login: AI4Bharat
- Kind: organization
- Email: opensource@ai4bharat.org
- Location: India
- Website: https://ai4bharat.org
- Twitter: AI4Bharat
- Repositories: 37
- Profile: https://github.com/AI4Bharat
Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!
GitHub Events
Total
- Watch event: 5
- Fork event: 1
Last Year
- Watch event: 5
- Fork event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- slivering (1)