bert-sms-classification
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Scientific Fields
Repository
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.
Basic Info
- Host: GitHub
- Owner: fzn0x
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://open.spotify.com/track/6s8WSX1MxNThrot8ThI6fG?si=ee460386b3e54552
- Size: 208 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.
How to use this model?
```py from transformers import BertTokenizer, BertForSequenceClassification import torch
tokenizer = BertTokenizer.frompretrained('fzn0x/bert-spam-classification-model') model = BertForSequenceClassification.frompretrained('fzn0x/bert-spam-classification-model') ```
Check scripts/predict.py for full example (You just need to modify the argument of from_pretrained).
✅ Install requirements
Install required dependencies
sh
pip install --upgrade pip
pip install -r requirements.txt
✅ Add BERT virtual env
write the command below
```sh
✅ Create and activate a virtual environment
python -m venv bert-env source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate ```
✅ Install CUDA
Check if your GPU supports CUDA:
sh
nvidia-smi
Then:
sh
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
🔧 How to use
- Check your device and CUDA availability:
sh
python check_device.py
:warning: Using CPU is not advisable, prefer check your CUDA availability.
- Train the model:
sh
python scripts/train.py
:warning: Remove unneeded checkpoint in models/pretrained to save your storage after training
- Run prediction:
sh
python scripts/predict.py
✅ Dataset Location: data/spam.csv, modify the dataset to enhance the model based on your needs.
📚 Citations
If you use this repository or its ideas, please cite the following:
See citations.bib for full BibTeX entries.
- Wolf et al., Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020. ACL Anthology
- Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 2011.
- Almeida & Gómez Hidalgo, SMS Spam Collection v.1, UCI Machine Learning Repository (2011). Kaggle Link
🧠 Credits and Libraries Used
- Hugging Face Transformers – model, tokenizer, and training utilities
- scikit-learn – metrics and preprocessing
- Logging silencing inspired by Hugging Face GitHub discussions
- Dataset from UCI SMS Spam Collection
- Inspiration from Kaggle Notebook by Suyash Khare
License and Usage
License under MIT license.
Leave a ⭐ if you think this project is helpful, contributions are welcome.
Owner
- Name: Fauzan
- Login: fzn0x
- Kind: user
- Repositories: 29
- Profile: https://github.com/fzn0x
Citation (citations.bib)
@inproceedings{wolf-etal-2020-transformers,
title = {Transformers: State-of-the-Art Natural Language Processing},
author = {Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Brew, Jamie},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
month = oct,
year = {2020},
publisher = {Association for Computational Linguistics},
pages = {38--45},
url = {https://www.aclweb.org/anthology/2020.emnlp-demos.6}
}
@article{scikit-learn,
title = {Scikit-learn: Machine Learning in Python},
author = {Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and Duchesnay, {\'E}douard},
journal = {Journal of Machine Learning Research},
volume = {12},
pages = {2825--2830},
year = {2011}
}
@misc{smsspamcollection,
author = {Tiago A. Almeida and José María Gómez Hidalgo},
title = {SMS Spam Collection v.1},
year = {2011},
howpublished = {\url{https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset}},
note = {UCI Machine Learning Repository}
}
GitHub Events
Total
- Watch event: 1
- Push event: 5
- Fork event: 1
- Create event: 2
Last Year
- Watch event: 1
- Push event: 5
- Fork event: 1
- Create event: 2
Dependencies
- nltk ==3.8.1
- pandas ==2.0.1
- scikit-learn ==1.2.2
- torch ==2.6.0
- transformers ==4.43.0