bert-sms-classification

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

https://github.com/fzn0x/bert-sms-classification

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 38% confidence
Last synced: 4 months ago · JSON representation ·

Repository

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

How to use this model?

```py from transformers import BertTokenizer, BertForSequenceClassification import torch

tokenizer = BertTokenizer.frompretrained('fzn0x/bert-spam-classification-model') model = BertForSequenceClassification.frompretrained('fzn0x/bert-spam-classification-model') ```

Check scripts/predict.py for full example (You just need to modify the argument of from_pretrained).

✅ Install requirements

Install required dependencies

sh pip install --upgrade pip pip install -r requirements.txt

✅ Add BERT virtual env

write the command below

```sh

✅ Create and activate a virtual environment

python -m venv bert-env source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate ```

✅ Install CUDA

Check if your GPU supports CUDA:

sh nvidia-smi

Then:

sh pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False

🔧 How to use

  • Check your device and CUDA availability:

sh python check_device.py

:warning: Using CPU is not advisable, prefer check your CUDA availability.

  • Train the model:

sh python scripts/train.py

:warning: Remove unneeded checkpoint in models/pretrained to save your storage after training

  • Run prediction:

sh python scripts/predict.py

✅ Dataset Location: data/spam.csv, modify the dataset to enhance the model based on your needs.

📚 Citations

If you use this repository or its ideas, please cite the following:

See citations.bib for full BibTeX entries.

  • Wolf et al., Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020. ACL Anthology
  • Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 2011.
  • Almeida & Gómez Hidalgo, SMS Spam Collection v.1, UCI Machine Learning Repository (2011). Kaggle Link

🧠 Credits and Libraries Used

License and Usage

License under MIT license.


Leave a ⭐ if you think this project is helpful, contributions are welcome.


Owner

  • Name: Fauzan
  • Login: fzn0x
  • Kind: user

Citation (citations.bib)

@inproceedings{wolf-etal-2020-transformers,
  title     = {Transformers: State-of-the-Art Natural Language Processing},
  author    = {Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Brew, Jamie},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  month     = oct,
  year      = {2020},
  publisher = {Association for Computational Linguistics},
  pages     = {38--45},
  url       = {https://www.aclweb.org/anthology/2020.emnlp-demos.6}
}

@article{scikit-learn,
  title   = {Scikit-learn: Machine Learning in Python},
  author  = {Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and Duchesnay, {\'E}douard},
  journal = {Journal of Machine Learning Research},
  volume  = {12},
  pages   = {2825--2830},
  year    = {2011}
}

@misc{smsspamcollection,
  author       = {Tiago A. Almeida and José María Gómez Hidalgo},
  title        = {SMS Spam Collection v.1},
  year         = {2011},
  howpublished = {\url{https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset}},
  note         = {UCI Machine Learning Repository}
}

GitHub Events

Total
  • Watch event: 1
  • Push event: 5
  • Fork event: 1
  • Create event: 2
Last Year
  • Watch event: 1
  • Push event: 5
  • Fork event: 1
  • Create event: 2

Dependencies

requirements.txt pypi
  • nltk ==3.8.1
  • pandas ==2.0.1
  • scikit-learn ==1.2.2
  • torch ==2.6.0
  • transformers ==4.43.0