bangla-bert

Bangla-Bert is a pretrained bert model for Bengali language

https://github.com/sagorbrur/bangla-bert

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Keywords

bangla bangla-nlp bert lm nlp transformers

Last synced: 6 months ago · JSON representation ·

Repository

Bangla-Bert is a pretrained bert model for Bengali language

Basic Info

Host: GitHub
Owner: sagorbrur
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage: https://huggingface.co/sagorsarker/bangla-bert-base
Size: 4.11 MB

Statistics

Stars: 80
Watchers: 4
Forks: 24
Open Issues: 0
Releases: 0

Topics

bangla bangla-nlp bert lm nlp transformers

Created over 5 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

Bangla BERT Base

A long way passed. Here is our Bangla-Bert! It is now available in huggingface model hub.

Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Download Model

| | TF Version | Pytorch Version | Vocab | | ----- | ------ | ------- | --------| | Bangla BERT Base | ----- | Huggingface Hub| Vocab

Pretrain Corpus Details

Corpus was downloaded from two main sources:

Bengali commoncrawl copurs downloaded from OSCAR
Bengali Wikipedia Dump Dataset

After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.

``` sentence 1 sentence 2

sentence 1 sentence 2

```

Building Vocab

We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.

Training Details

Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
Total Training Steps: 1 Million
The model was trained on a single Google Cloud GPU

Evaluation Results

LM Evaluation Results

After training 1 millions steps here is the evaluation resutls.

``` globalstep = 1000000 loss = 2.2406516 maskedlmaccuracy = 0.60641736 maskedlmloss = 2.201459 nextsentenceaccuracy = 0.98625 nextsentence_loss = 0.040997364 perplexity = numpy.exp(2.2406516) = 9.393331287442784 Loss for final step: 2.426227

```

Downstream Task Evaluation Results

Evaluation on Bengali Classification Benchmark Datasets

Huge Thanks to Nick Doiron for providing evalution results of classification task. He used Bengali Classification Benchmark datasets for classification task. Comparing to Nick's Bengali electra and multi-lingual BERT, Bangla BERT Base achieves state of the art result. Here is the evaluation script. Check comparison between Bangla-BERT with recent other Bengali BERT here

| Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average | | ----- | -------------------| ---------------- | --------------- | ------- | | mBERT | 68.15 | 52.32 | 72.27 | 64.25 | | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 | | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |

Evaluation on Wikiann Datasets

We evaluated Bangla-BERT-Base with Wikiann Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT).
Bangla-BERT-Base got a third-place where mBERT got first and XML-R got second place after training these models 5 epochs.

| Base Pre-trained Model | F1 Score | Accuracy | | ----- | -------------------| ---------------- | | mBERT-uncased | 97.11 | 97.68 | | XLM-R | 96.22 | 97.03 | | Indic-BERT| 92.66 | 94.74 | | Bangla-BERT-Base | 95.57 | 97.49 |

All four model trained with transformers-token-classification notebook. You can find all models evaluation results here

Also you can check these below paper list. They evaluated this model on their datasets.

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Check Bangla BERT Visualize

bertviz

How to Use

Bangla BERT Tokenizer ```py from transformers import AutoTokenizer, AutoModel

bnberttokenizer = AutoTokenizer.frompretrained("sagorsarker/bangla-bert-base") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text)

['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

```

MASK Generation

You can use this model directly with a pipeline for masked language modeling:

```py from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.frompretrained("sagorsarker/bangla-bert-base") tokenizer = BertTokenizer.frompretrained("sagorsarker/bangla-bert-base") nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer) for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"): print(pred)

{'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

```

Author

Sagor Sarker

Acknowledgements

Thanks to Google TensorFlow Research Cloud (TFRC) for providing the free GPU credits - thank you!
Thank to all the people around, who always helping us to build something for Bengali.

Reference

https://github.com/google-research/bert

Citation

If you find this model helpful, please cite this.

``` @misc{Sagor_2020, title = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understanding}, author = {Sagor Sarker}, year = {2020}, url = {https://github.com/sagorbrur/bangla-bert} }

```

Owner

Name: Sagor Sarker
Login: sagorbrur
Kind: user
Location: Dhaka, Bangladesh

Website: https://sagorbrur.github.io
Twitter: sagor_sarker
Repositories: 13
Profile: https://github.com/sagorbrur

An enthusiastic NLP/AI/ML practitioner.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sarker"
  given-names: "Sagor"
title: "bangla-bert"
version: 1.0.0
date-released: 2020-09-08
url: "https://github.com/sagorbrur/bangla-bert"

GitHub Events

Total

Watch event: 7
Push event: 2
Fork event: 1

Last Year

Watch event: 7
Push event: 2
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 54
Total Committers: 1
Avg Commits per committer: 54.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 2
Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Sagor Sarker	s**2@g**m	54

Issues and Pull Requests