bert-tagalog-pos-tagger

Fine-tuned BERT Tagalog Base Uncased model for Filipino part of speech tagging

https://github.com/syke9p3/bert-tagalog-pos-tagger

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Fine-tuned BERT Tagalog Base Uncased model for Filipino part of speech tagging

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Citation

README.md

BERT Tagalog Part of Speech Tagger (BERTTPOST)

Cite this repository

Saya-ang, K., Hamor, M. G., Gozum, D. J., & Mabansag, R. K. Bidirectional Encoder Representation from Transformer Tagalog Part of Speech Tagger [Computer software]. https://github.com/syke9p3/bert-tagalog-pos-tagger

https://github.com/syke9p3/bert-tagalog-pos-tagger/main/BERTTPOST%20Screenshot.jpg?raw=true

This repository contains the training and testing Python files for fine-tuning gklmip/bert-tagalog-base-uncased model for Tagalog part of speech tagging

  • Developed by: Saya-ang, Kenth G. (@syke9p3) | Gozum, Denise Julianne S. (@Xenoxianne) | Hamor, Mary Grizelle D. (@mnemoria) | Mabansag, Ria Karen B. (@riavx)
  • Model type: BERT Tagalog Base Uncased
  • Programming Language: Python
  • Languages (NLP): Tagalog, Filipino
  • Dataset: Sagum et. al.'s annotated Tagalog Corpora based on MGNN Tagset convention. This model was trained in 800 sentences and evaluated with 200 sentences.
  • Finetuned from model: Jiang et. al.'s pre-trained bert-tagalog-base-uncased model

HuggingFace

Try the model: HuggingFace Spaces

Model source code: HuggingFace

Python Libraries

  1. PyTorch
  2. Regular Expressions
  3. Transformers
  4. SKLearn Metrics
  5. Datasets
  6. tqdm

Dataset and Preprocessing

A corpus was used containing tagged sentences in Tagalog language. The dataset comprises sentences with each word annotated with its corresponding POS tag in the format of <TAG word>. To prepare the corpus for training, the following preprocessing steps were performed: 1. Removal of Line Identifier: the line identifier, such as SNT.108970.2066, was removed from each tagged sentence. 2. Symbol Conversion: for the BERT model, certain special symbols like hyphens, quotes, commas, etc., were converted into special tokens (PMP, PMS, PMC) to preserve their meaning during tokenization. 3. Alignment of Tokenization: the BERT tokenized words and their corresponding POS tags were aligned to ensure that the tokenization and tagging are consistent.

Training

The BERT Tagalog POS Tagger were trained using PyTorch library with the following hyperparameters set:

| Hyperparamter | Value |
|---------------- |--------- | Batch Size | 8 | | Training Epoch | 5 | | Learning-rate | 2e-5 | | Optimizer | Adam |

Inference

For the test sentences, almost the same preprocessing and tokenization steps as in training were performed, but without the need to extract POS tags from the sentence. The trained model was loaded to generate the tags for the input sentence along with Gradio to provide an interface for displaying the POS tag results.

Owner

  • Login: syke9p3
  • Kind: user

Citation (citation.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Bidirectional Encoder Representation from Transformer
  Tagalog Part of Speech Tagger
message: 'If you use this software, please cite it as below.'
type: software
authors:
  - given-names: Kenth
    family-names: Saya-ang
    email: gargarkenth93@gmail.com
  - given-names: Mary Grizelle
    family-names: Hamor
  - given-names: Denise Julianne
    family-names: Gozum
  - given-names: Ria Karen
    family-names: Mabansag
repository-code: 'https://github.com/syke9p3/bert-tagalog-pos-tagger'
abstract: >-
  This model addresses the need for an efficient and
  accurate Part-of-Speech (POS) tagger for the Filipino
  language, by utilizing the Bidirectional Encoder
  Representations from Transformers (BERT) model’s
  capability for contextual analysis of languages. The
  methodology involved fine-tuning Jiang’s pre-trained
  Tagalog BERT Base Uncased model for POS tagging,
  subsequently, resulting in an impressive accuracy of
  96.4835%, greatly highlighting BERT's ability to capture
  the syntactic structures and contextual nuances found in
  the Filipino language. Nevertheless, concerns about
  potential overfitting arose, limiting the model’s
  generalizability beyond the specific dataset used for
  training and evaluation. 
keywords:
  - ' Natural Language Processing '
  - Bidirectional Encoder Representation of Transformers
  - Part-of-Speech Tagging
  - Filipino

GitHub Events

Total
Last Year