bert-tagalog-pos-tagger
Fine-tuned BERT Tagalog Base Uncased model for Filipino part of speech tagging
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Repository
Fine-tuned BERT Tagalog Base Uncased model for Filipino part of speech tagging
Basic Info
- Host: GitHub
- Owner: syke9p3
- Language: Python
- Default Branch: main
- Homepage: https://huggingface.co/spaces/syke9p3/bert-tagalog-base-uncased-part-of-speech-tagger
- Size: 400 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
BERT Tagalog Part of Speech Tagger (BERTTPOST)
Cite this repository
Saya-ang, K., Hamor, M. G., Gozum, D. J., & Mabansag, R. K. Bidirectional Encoder Representation from Transformer Tagalog Part of Speech Tagger [Computer software]. https://github.com/syke9p3/bert-tagalog-pos-tagger

This repository contains the training and testing Python files for fine-tuning gklmip/bert-tagalog-base-uncased model for Tagalog part of speech tagging
- Developed by: Saya-ang, Kenth G. (@syke9p3) | Gozum, Denise Julianne S. (@Xenoxianne) | Hamor, Mary Grizelle D. (@mnemoria) | Mabansag, Ria Karen B. (@riavx)
- Model type: BERT Tagalog Base Uncased
- Programming Language: Python
- Languages (NLP): Tagalog, Filipino
- Dataset: Sagum et. al.'s annotated Tagalog Corpora based on MGNN Tagset convention. This model was trained in 800 sentences and evaluated with 200 sentences.
- Finetuned from model: Jiang et. al.'s pre-trained bert-tagalog-base-uncased model
HuggingFace
Try the model: HuggingFace Spaces
Model source code: HuggingFace
Python Libraries
- PyTorch
- Regular Expressions
- Transformers
- SKLearn Metrics
- Datasets
- tqdm
Dataset and Preprocessing
A corpus was used containing tagged sentences in Tagalog language. The dataset comprises sentences with each word annotated with its corresponding POS tag in the format of <TAG word>. To prepare the corpus for training, the following preprocessing steps were performed:
1. Removal of Line Identifier: the line identifier, such as SNT.108970.2066, was removed from each tagged sentence.
2. Symbol Conversion: for the BERT model, certain special symbols like hyphens, quotes, commas, etc., were converted into special tokens (PMP, PMS, PMC) to preserve their meaning during tokenization.
3. Alignment of Tokenization: the BERT tokenized words and their corresponding POS tags were aligned to ensure that the tokenization and tagging are consistent.
Training
The BERT Tagalog POS Tagger were trained using PyTorch library with the following hyperparameters set:
| Hyperparamter | Value |
|---------------- |---------
| Batch Size | 8 |
| Training Epoch | 5 |
| Learning-rate | 2e-5 |
| Optimizer | Adam |
Inference
For the test sentences, almost the same preprocessing and tokenization steps as in training were performed, but without the need to extract POS tags from the sentence. The trained model was loaded to generate the tags for the input sentence along with Gradio to provide an interface for displaying the POS tag results.
Owner
- Login: syke9p3
- Kind: user
- Repositories: 2
- Profile: https://github.com/syke9p3
Citation (citation.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Bidirectional Encoder Representation from Transformer
Tagalog Part of Speech Tagger
message: 'If you use this software, please cite it as below.'
type: software
authors:
- given-names: Kenth
family-names: Saya-ang
email: gargarkenth93@gmail.com
- given-names: Mary Grizelle
family-names: Hamor
- given-names: Denise Julianne
family-names: Gozum
- given-names: Ria Karen
family-names: Mabansag
repository-code: 'https://github.com/syke9p3/bert-tagalog-pos-tagger'
abstract: >-
This model addresses the need for an efficient and
accurate Part-of-Speech (POS) tagger for the Filipino
language, by utilizing the Bidirectional Encoder
Representations from Transformers (BERT) model’s
capability for contextual analysis of languages. The
methodology involved fine-tuning Jiang’s pre-trained
Tagalog BERT Base Uncased model for POS tagging,
subsequently, resulting in an impressive accuracy of
96.4835%, greatly highlighting BERT's ability to capture
the syntactic structures and contextual nuances found in
the Filipino language. Nevertheless, concerns about
potential overfitting arose, limiting the model’s
generalizability beyond the specific dataset used for
training and evaluation.
keywords:
- ' Natural Language Processing '
- Bidirectional Encoder Representation of Transformers
- Part-of-Speech Tagging
- Filipino