bangla-bert

Bangla-Bert is a pretrained bert model for Bengali language

https://github.com/sagorbrur/bangla-bert

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary

Keywords

bangla bangla-nlp bert lm nlp transformers
Last synced: 6 months ago · JSON representation ·

Repository

Bangla-Bert is a pretrained bert model for Bengali language

Basic Info
Statistics
  • Stars: 80
  • Watchers: 4
  • Forks: 24
  • Open Issues: 0
  • Releases: 0
Topics
bangla bangla-nlp bert lm nlp transformers
Created over 5 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Bangla BERT Base

A long way passed. Here is our Bangla-Bert! It is now available in huggingface model hub.

Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Download Model

| | TF Version | Pytorch Version | Vocab | | ----- | ------ | ------- | --------| | Bangla BERT Base | ----- | Huggingface Hub| Vocab

Pretrain Corpus Details

Corpus was downloaded from two main sources:

After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.

``` sentence 1 sentence 2

sentence 1 sentence 2

```

Building Vocab

We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.

Training Details

  • Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
  • Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
  • Total Training Steps: 1 Million
  • The model was trained on a single Google Cloud GPU

Evaluation Results

LM Evaluation Results

After training 1 millions steps here is the evaluation resutls.

``` globalstep = 1000000 loss = 2.2406516 maskedlmaccuracy = 0.60641736 maskedlmloss = 2.201459 nextsentenceaccuracy = 0.98625 nextsentence_loss = 0.040997364 perplexity = numpy.exp(2.2406516) = 9.393331287442784 Loss for final step: 2.426227

```

Downstream Task Evaluation Results

  • Evaluation on Bengali Classification Benchmark Datasets

Huge Thanks to Nick Doiron for providing evalution results of classification task. He used Bengali Classification Benchmark datasets for classification task. Comparing to Nick's Bengali electra and multi-lingual BERT, Bangla BERT Base achieves state of the art result. Here is the evaluation script. Check comparison between Bangla-BERT with recent other Bengali BERT here

| Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average | | ----- | -------------------| ---------------- | --------------- | ------- | | mBERT | 68.15 | 52.32 | 72.27 | 64.25 | | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 | | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |

We evaluated Bangla-BERT-Base with Wikiann Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT).
Bangla-BERT-Base got a third-place where mBERT got first and XML-R got second place after training these models 5 epochs.

| Base Pre-trained Model | F1 Score | Accuracy | | ----- | -------------------| ---------------- | | mBERT-uncased | 97.11 | 97.68 | | XLM-R | 96.22 | 97.03 | | Indic-BERT| 92.66 | 94.74 | | Bangla-BERT-Base | 95.57 | 97.49 |

All four model trained with transformers-token-classification notebook. You can find all models evaluation results here

Also you can check these below paper list. They evaluated this model on their datasets.

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Check Bangla BERT Visualize

bertviz

How to Use

Bangla BERT Tokenizer ```py from transformers import AutoTokenizer, AutoModel

bnberttokenizer = AutoTokenizer.frompretrained("sagorsarker/bangla-bert-base") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text)

['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

```

MASK Generation

You can use this model directly with a pipeline for masked language modeling:

```py from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.frompretrained("sagorsarker/bangla-bert-base") tokenizer = BertTokenizer.frompretrained("sagorsarker/bangla-bert-base") nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer) for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"): print(pred)

{'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

```

Author

Sagor Sarker

Acknowledgements

  • Thanks to Google TensorFlow Research Cloud (TFRC) for providing the free GPU credits - thank you!
  • Thank to all the people around, who always helping us to build something for Bengali.

Reference

  • https://github.com/google-research/bert

Citation

If you find this model helpful, please cite this.

``` @misc{Sagor_2020, title = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understanding}, author = {Sagor Sarker}, year = {2020}, url = {https://github.com/sagorbrur/bangla-bert} }

```

Owner

  • Name: Sagor Sarker
  • Login: sagorbrur
  • Kind: user
  • Location: Dhaka, Bangladesh

An enthusiastic NLP/AI/ML practitioner.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sarker"
  given-names: "Sagor"
title: "bangla-bert"
version: 1.0.0
date-released: 2020-09-08
url: "https://github.com/sagorbrur/bangla-bert"

GitHub Events

Total
  • Watch event: 7
  • Push event: 2
  • Fork event: 1
Last Year
  • Watch event: 7
  • Push event: 2
  • Fork event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 54
  • Total Committers: 1
  • Avg Commits per committer: 54.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Sagor Sarker s****2@g****m 54

Issues and Pull Requests

Last synced: 7 months ago