https://github.com/cltl-students/csenge_szabo_topicclassification_clientfeedback_governance_domain
Multi-Label Topic Classification of Client Feedback in the Governance Domain
https://github.com/cltl-students/csenge_szabo_topicclassification_clientfeedback_governance_domain
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
Multi-Label Topic Classification of Client Feedback in the Governance Domain
Basic Info
Statistics
- Stars: 0
- Watchers: 7
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Multi-Label Topic Classification of Client Feedback in the Governance Domain
Master's Degree in "Linguistics: Text Mining", VU Amsterdam, 2023/2024.
Overview
This repository belongs to the Master's Thesis Project "Multi-Label Topic Classification of Client Feedback in the Governance Domain" by Csenge Szabó, supervised by Dr. Ilia Markov and Sandra Blok. The project was carried out in collaboration with the company MarketResponse, a Dutch software company specializing in Customer Experience analytics in order to bridge the gap between customer feedback and business performance.
The thesis focuses on the multi-label topic classification of written client feedback collected from the governance domain through surveys. Multi-label topic classification involves assigning more than one topic label to a particular text instance from a predefined set of topics. For this purpose, we compare a traditional machine learning classifier (Support Vector Machines) with a more recent transformer-based model (fine-tuned BERT) that currently shows state-of-the-art performance for the majority of Natural Language Processing tasks. Since the topic labels in the dataset are structured into main topics and corresponding subtopics, we experiment with one-step and two-step classification approaches. The former implies the classification of instances for all topic labels at once, while the latter means first predicting main topic labels and then subtopic labels. In order to address the imbalanced nature of the dataset, various data adaptation and data balancing techniques are explored, namely (i) undersampling aimed to reduce the prevalence of overrepresented subtopic classes to the average distribution, and (ii) oversampling aimed to generate synthetic data for underrepresented subtopic classes using a generative large language model, GPT-4. We aim to determine the best approach for multi-label topic classification on the provided dataset using a combination of the aforementioned approaches.
The motivation, methodology, results and discussion of the results can be found in the Thesis Report.
Note: Since the data cannot be shared with third-parties, it is not published in this repository. Outputs that give an indication about its content have been hidden.
Project structure
Thesis Project Structure
└───code
| preprocessing.py
| data_adaptation.py
| SVMs_1step.py
| SVMs_2step.py
| BERT_1step.ipynb
| BERT_2step.ipynb
| statistics.ipynb
| utils.py
| error_analysis.py
└───data
│ │ example_dataset.csv
└───figures
│ └───SVMs_1step
│ └───SVMs_2step
│ └───SVMs_2step_oversampled
│ └───SVMs_2step_undersampled
│ └───BERT_1step
│ └───BERT_2step
│ └───BERT_2step_oversampled
│ └───BERT_2step_undersampled
└───hyperparameters
└───model_predictions
└───models
└───results
│ └───BERT
│ └───SVMs
│ LICENSE
│ README.md
│ requirements.txt
\code
The code folder contains the scripts and notebooks required to reproduce this study. In order to reproduce the experiments, follow the order of the files listed below:
preprocessing.pycleans the dataset from privacy-sensitive information (names, dates, times, locations, URLs, e-mail addresses). It pre-processes the dataset by applying lowercasing and stop words removal. It implements Binary Relevance problem transformation in order to convert the labels into 0 and 1. It conducts stratified data splitting in a ratio of 80-10-10 (train-validation-test).statistics.ipynbgenerates statistics regarding the full annotated dataset. It also includes code to inspect the label distribution in the split data (train-validation-test).data_adaptation.pyundersamples the training dataset for the overrepresented main topics until they reach the average distribution, or can be used to pre-process the synthetic data generated by GPT-4.SVMs_1step.pytrains a Support Vector Machines classifier, conducts hyper-parameter tuning and predicts the labels on the test set using TF-IDF feature representation. The script is designed for one-step classification, i.e. main topics and subtopics are learned and predicted in a single step. The script can be used with the original, undersampled or oversampled training data. The default for this script is the original training set.SVMs_2step.pytrains a Support Vector Machines classifier, conducts hyper-parameter tuning and predicts the labels on the test set using TF-IDF feature representation. The script is designed for two-step classification, i.e. first the model is used to predict main topics, then to predict the corresponding subtopics using the output of the first phase. The script can be used with the original, undersampled or oversampled training data. The default for this script is the original training set.BERT_1step.ipynbfine-tunes a pre-trained BERT model on the training set, which has to be specified within the notebook by uncommenting certain parts. The default for this notebook is the original training set. The script is designed for one-step classification, i.e. main topics and subtopics are learned and predicted in a single step. The fine-tuned model is then saved to the models folder.- This script uses the helper functions stored in
utils.py
- This script uses the helper functions stored in
BERT_2step.ipynbfine-tunes a pre-trained BERT model on the training set, which has to be specified within the notebook by uncommenting certain parts. The default for this notebook is the original training set. The script is designed for two-step classification, i.e. first the model is used to predict main topics, then to predict the corresponding subtopics using the output of the first phase. The fine-tuned model is then saved to the models folder.- This script uses the helper functions stored in
utils.py
- This script uses the helper functions stored in
error_analysis.pyallows you to inspect often confused topic labels, and can be used to look up a specific test instance (based on instance ID) to check its gold and predicted labels.
\data
The data folder only contains the example_dataset.csv since the data is not allowed to be shared due to the confidentiality agreement. However, the CSV file represents the structure of the data.
\figures
The figures folder is used to store the figures, for instance, the confusion matrices of each classification approach. It is separated into the different approaches and each contains the corresponding confusion matrices:
- \BERT_1step
- \BERT_2step
- \BERT2stepoversampled
- \BERT2stepundersampled
- \SVMs_1step
- \SVMs_2step
- \SVMs2stepoversampled
- \SVMs2stepundersampled
\hyperparameters
The hyperparameters folder is used to store pickle files, which contain information about the optimal hyper-parameter settings for the one-step and two-step SVMs model.
\model_predictions
The model_predictions folder contains the model outputs, i.e. CSV files with the feedback statements from the test data, and their predicted labels. Due to the confidentiality restrictions, the files were not uploaded.
\models
The models folder is a location for the trained models. Due to the confidentiality restrictions and the size of the models, the models were not uploaded.
\results
The results folder contains the results, i.e., the classification reports. The reports can be found in \BERT and \SVMs.
* \BERT
* test_report_1step.csv
* test_report_2step.csv
* test_report_2step_undersampled.csv
* test_report_2step_oversampled.csv
* \SVMs
* test_report_1step.csv
* test_report_2step.csv
* test_report_2step_undersampled.csv
* test_report_2step_oversampled.csv
requirements.txt
The required Python 3.7.8 packages for running the code contained in this repository can be found in the requirements.txt file and can be installed directly through pip.
Thesis Report
MA_Thesis_Csenge_Szabo.pdf
The pdf file contains the full Thesis Report.
References
The code for the conventional machine learning approach was partially inspired by Dr. Piek Vossen, Source Code.
The code to fine-tune the BERT model was adapted from Abhishek Kumar Mishra, who used it for multi-label text classification of toxic data.
Owner
- Name: Computational Lexicology and & Terminology Lab
- Login: cltl-students
- Kind: organization
- Email: p.t.j.m.vossen@vu.nl
- Location: Amsterdam
- Website: http://www.cltl.nl/teaching/
- Repositories: 13
- Profile: https://github.com/cltl-students
Thesis and student projects @cltl
GitHub Events
Total
Last Year
Dependencies
- autocorrect ==2.6.1
- iterative-stratification ==0.1.7
- matplotlib ==3.5.3
- nltk ==3.8.1
- numpy ==1.21.6
- openpyxl ==3.1.2
- pandas ==1.3.5
- seaborn ==0.12.2
- sklearn ==1.0.2
- skmultilearn ==0.2.0
- spacy ==3.7.4
- torch ==1.13.1
- tqdm ==4.66.2
- transformers ==4.30.2