boardgames-aspect-extraction

Project "What do you like in boardgames?" of NLP Unimi 2023/2024

https://github.com/ubriacopo/boardgames-aspect-extraction

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Keywords

aspect-extraction board-game nlp

Last synced: 10 months ago · JSON representation

Repository

Project "What do you like in boardgames?" of NLP Unimi 2023/2024

Basic Info

Host: GitHub
Owner: Ubriacopo
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 155 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 3
Releases: 0

Topics

aspect-extraction board-game nlp

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

Project #4 of the NLP: What do you like in boardgames?

For the project I chose to elaborate the proposal number 4 as I have a personal interest in the topic of the domain, being boardgames.

All proposals are present in the repository under resources/

How to run

To run the solution dependencies shall be installed. They are listed in the requirements.
Also install:

python -m spacy download encorewebmd
python -m spacy download encorewebsm

No script has been written as I believed notebooks to be better at guiding the thought process.

First run the main/dataset/bgg_corpus_service.ipynb or download the dataset directly from: https://www.kaggle.com/datasets/jacopofichera/bgg-scrapped-reviews

To run preprocessing go to main/dataset/pre_processing.ipynb. This will generate various pre-processed datasets based on the starting one.

For LDA simply refer to main/lda/final_model.ipynb to launch training on the best found configuration of hyperparameters. The model is then created under \output in the same directory being an LdaMulticore instance of Gensim that can be reloaded.

For ABAE it is the same but in the abae folder. It creates more files being one for the word embeddings model, one for the initialization of aspect weight matrix before training and the keras instance model. To load and manipulate the model please refer to the ABAEManager class that holds methods based on what output is needed (if classify or loss evaluation).

Inference is left to be done by hand but using class #todo you can save it as part of the model output definition to be reloaded and used with #todo class to infer correct labels

References

My reference paper I think:

Paper: https://aclanthology.org/P17-1036.pdf
Repo :https://github.com/ruidan/Unsupervised-Aspect-Extraction/blob/master/code/train.py

Another interesting useful reference for an indepth application:

https://www.kaggle.com/code/nkitgupta/aspect-based-sentiment-analysis
Explains well how to do all. Nice insight on Emojis and Unicode normalization

Approach?

In an unsupervised paradigm for aspect extraction, you don't rely on labeled data. Instead, you can use clustering and topic modeling techniques to identify and extract aspects. Heres how you can approach it:

Data Collection and Preprocessing:
    Collect Data: Gather a large corpus of text related to your domain.
    Preprocess Text: Tokenize the text, remove stop words, and perform other cleaning steps.

Text Representation:
    Word Embeddings: Use pre-trained embeddings like Word2Vec, GloVe, or contextual embeddings like BERT embeddings.
    Document Embeddings: Represent each document as a vector, for instance by averaging word embeddings or using sentence embeddings from models like Sentence-BERT.

Aspect Extraction Techniques: ABAE, LDA

Project Setup and Instllation

python -m spacy download encoreweb_trf

Owner

Name: Jacopo Fichera
Login: Ubriacopo
Kind: user
Location: Bergamo, Italy

Repositories: 1
Profile: https://github.com/Ubriacopo

SW dev @ Team Quality Srl / CS Major Student @ UNIMI

GitHub Events

Total

Push event: 5

Last Year

Push event: 5

Dependencies

requirements.txt pypi

jupyter *
pandas *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science