https://github.com/ai4bharat/indic-bart
Pre-trained, multilingual sequence-to-sequence models for Indian languages
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Repository
Pre-trained, multilingual sequence-to-sequence models for Indian languages
Basic Info
- Host: GitHub
- Owner: AI4Bharat
- License: mit
- Language: Python
- Default Branch: main
- Size: 79.1 KB
Statistics
- Stars: 42
- Watchers: 6
- Forks: 5
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
IndicBART
Pre-trained, multilingual sequence-to-sequence models for Indian languages
You can read more about IndicBART here. IndicBART is part of the AI4Bharat tools for Indian languages.
You can read more about IndicBART in this paper.
Installation
- Install the YANMTT toolkit. Checkout the v1.0 release via "git checkout v1.0". Make sure to create a new conda or virtual environment to ensure things work smoothly.
- Download the following:
- **v1** [(Vocabulary)](https://storage.googleapis.com/ai4bharat-indicnlg-public/indic-bart-v1/albert-indicunified64k.zip)
[(Model)](https://storage.googleapis.com/ai4bharat-indicnlg-public/indic-bart-v1/indicbart_model.ckpt)
- Decompress the vocabulary zip:
unzip albert-indicunified64k.zip
Finetuning IndicBART for NMT
Sample training corpora
Sample development set
Script conversion
- The Indic side of the data needs to converted to the Devanagari script. You may use the indic_scriptmap.py script.
- This script depends on Indic NLP Library and Indic NLP Resources which should be manually installed.
- Following this, change the paths in lines 13 and 16 in indic_scriptmap.py.
- Usage: python indicscriptmap.py <inputfile>
- Example: python indic_scriptmap.py input.txt output.txt ta hi
- This will map the script in the input.txt file from Tamil to Hindi.
- The sample data provided above has already been converted to the Devanagari script, so you can use it as is.
Fine-tuning command
python PATH-TO-YANMTT/train_nmt.py --train_slang hi,bn --train_tlang en,en \
--dev_slang hi,bn --dev_tlang en,en --train_src train.en-hi.hi,train.en-bn.bn \
--train_tgt train.en-hi.en,train.en-bn.en --dev_src dev.hi,dev.bn --dev_tgt dev.en,dev.en \
--model_path model.ft --encoder_layers 6 --decoder_layers 6 --label_smoothing 0.1 \
--dropout 0.1 --attention_dropout 0.1 --activation_dropout 0.1 --encoder_attention_heads 16 \
--decoder_attention_heads 16 --encoder_ffn_dim 4096 --decoder_ffn_dim 4096 \
--d_model 1024 --tokenizer_name_or_path albert-indicunified64k --warmup_steps 16000 \
--weight_decay 0.00001 --lr 0.001 --max_gradient_clip_value 1.0 --dev_batch_size 128 \
--port 22222 --shard_files --hard_truncate_length 256 --pretrained_model indicbart_model.ckpt &> log
At the end of training, you should find the model with the highest BLEU score for a given language pair. This will be model.ft.bestdevbleu.
Decoding command
``` decmod=BEST-CHECKPOINT-NAME
python PATH-TO-YANMTT/decodenmt.py --modelpath $decmod --slang hi --tlang en \ --testsrc dev.hi --testtgt dev.trans --port 23352 --encoderlayers 6 --decoderlayers 6 \ --encoderattentionheads 16 --decoderattentionheads 16 --encoderffndim 4096 \ --decoderffndim 4096 --dmodel 1024 --tokenizernameorpath albert-indicunified64k \ --beamsize 4 --lengthpenalty 0.8 ```
Notes:
If you want to use an IndicBART model with language specific scripts, we provide that variant as well: (Vocabulary) (Model)
If you want to perform additional pre-training of IndicBART or train your own then follow the instructions in: https://github.com/prajdabre/yanmtt/blob/main/examples/trainmbartmodel.sh
For advanced training options, look at the examples in: https://github.com/prajdabre/yanmtt/blob/main/examples
Finetuning IndicBART for Summarization
Sample Corpus
Fine-tuning command
python PATH-TO-YANMTT/train_nmt.py --train_slang hi --train_tlang hi --dev_slang hi --dev_tlang hi \
--train_src train.text.hi --train_tgt train.summary.hi --dev_src dev.text.hi \
--dev_tgt dev.summary.hi --model_path model.ft --encoder_layers 6 --decoder_layers 6 \
--label_smoothing 0.1 --dropout 0.1 --attention_dropout 0.1 --activation_dropout 0.1 \
--encoder_attention_heads 16 --decoder_attention_heads 16 --encoder_ffn_dim 4096 \
--decoder_ffn_dim 4096 --d_model 1024 --tokenizer_name_or_path albert-indicunified64k \
--warmup_steps 16000 --weight_decay 0.00001 --lr 0.0003 --max_gradient_clip_value 1.0 \
--dev_batch_size 128 --port 22222 --shard_files --hard_truncate_length 512 \
--pretrained_model indicbart_model.ckpt --max_src_length 384 --max_tgt_length 40 \
--is_summarization --dev_batch_size 64 --max_decode_length_multiplier -60 \
--min_decode_length_multiplier -10 --no_repeat_ngram_size 4 --length_penalty 1.0 \
--max_eval_batches 20 --hard_truncate_length 512
Decoding command
``` decmod=BEST-CHECKPOINT-NAME
python PATH-TO-YANMTT/decodenmt.py --modelpath $decmod --slang hi --tlang en \ --testsrc dev.text.hi --testtgt dev.trans --port 23352 --encoderlayers 6 \ --decoderlayers 6 --encoderattentionheads 16 --decoderattentionheads 16 \ --encoderffndim 4096 --decoderffndim 4096 --dmodel 1024 \ --tokenizernameorpath albert-indicunified64k --beamsize 4 \ --maxsrclength 384 --maxdecodelengthmultiplier -60 --mindecodelengthmultiplier -10 \ --norepeatngramsize 4 --lengthpenalty 1.0 --hardtruncate_length 512 ```
Contributors
- Raj Dabre
- Himani Shrotriya
- Anoop Kunchukuttan
- Ratish Puduppully
- Mitesh M. Khapra
- Pratyush Kumar
Citing
If you use IndicBART, please cite the following paper:
@misc{dabre2021indicbart,
title={IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages},
author={Raj Dabre and Himani Shrotriya and Anoop Kunchukuttan and Ratish Puduppully and Mitesh M. Khapra and Pratyush Kumar},
year={2021},
eprint={2109.02903},
archivePrefix={arXiv},
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
publisher = "Association for Computational Linguistics",
primaryClass={cs.CL}
}
License
IndicBART is licensed under the MIT License
Owner
- Name: AI4Bhārat
- Login: AI4Bharat
- Kind: organization
- Email: opensource@ai4bharat.org
- Location: India
- Website: https://ai4bharat.org
- Twitter: AI4Bharat
- Repositories: 37
- Profile: https://github.com/AI4Bharat
Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!
GitHub Events
Total
- Watch event: 5
Last Year
- Watch event: 5
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 12
- Total pull requests: 2
- Average time to close issues: 23 days
- Average time to close pull requests: about 1 month
- Total issue authors: 8
- Total pull request authors: 2
- Average comments per issue: 3.42
- Average comments per pull request: 0.5
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ShivprasadSagare (4)
- Aniruddha-JU (2)
- ThanmayJ (1)
- Ayush1702 (1)
- arijitx (1)
- varadhbhatnagar (1)
- santha96 (1)
- tushar117 (1)
Pull Request Authors
- ShivprasadSagare (1)
- Ayush1702 (1)