xsnts-analyzer

Application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.

https://github.com/bgnatowski/xsnts-analyzer

Keywords

lda-topic-modeling lemmatization mallet polish-nlp sentiment-analysis spring-boot-3 text-normalization twitter-scraper

Last synced: 6 months ago · JSON representation ·

Repository

Application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.

Basic Info

Host: GitHub
Owner: bgnatowski
License: mit
Language: Java
Default Branch: main
Homepage:
Size: 30.8 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

lda-topic-modeling lemmatization mallet polish-nlp sentiment-analysis spring-boot-3 text-normalization twitter-scraper

Created about 1 year ago · Last pushed 7 months ago

Metadata Files

License Citation

https://github.com/bgnatowski/XSNTS-Analyzer/blob/main/

# XSNTS-Analyzer by Bartosz Gnatowski

### PL ([English version below > click me](#en))

XSNTS-Analyzer (X.com Social Network Topic-Sentiment Analyzer) jest aplikacj w ramach projektu do pracy magisterskiej na Uniwersytecie Ekonomicznym w Krakowie.

**Tytu:**  
* PL: *Implementacja aplikacji do analizy danych z platformy X.com: analiza tematyki i sentymentu w badaniu opinii publicznej*
* EN: *Design and Implementation of an X.com Data-Analysis Application: Topic Modelling and Sentiment Analysis for Public-Opinion Research*

**Autor:** Bartosz Gnatowski\
**Promotor pracy:** dr hab. in. Janusz Morajda\
**Kolegium:** Kolegium Nauk o Zarzdzaniu i Jakoci\
**Instytut:** Instytut Informatyki, Rachunkowoci i Controllingu\
**Kierunek:** Informatyka Stosowana\
**Specjalno:** Systemy inteligentne

## Spis treci
1. [Opis projektu](#opis-projektu)
2. [Najwaniejsze funkcjonalnoci](#najwa%C5%BCniejsze-funkcjonalno%C5%9Bci)
3. [Stos technologiczny](#stos-technologiczny)
4. [Wymagania wstpne](#wymagania-wst%C4%99pne)
5. [Szybki start](#szybki-start)
    * [Uruchomienie w IDE](#uruchomienie-w-ide)
    * [Uruchomienie przez Docker Compose](#uruchomienie-przez-docker-compose)
6. [REST API](#rest-api)
7. [Struktura katalogw](#struktura-katalog%C3%B3w)
8. [Przykadowe scenariusze uycia](#przyk%C5%82adowe-scenariusze-u%C5%BCycia)
9. [Plany rozwoju](#plany-rozwoju)
10. [Licencja](#licencja)

## Opis projektu

XSNTS-Analyzer to mikrousuga **back-end** do pozyskiwania, oczyszczania i analizy treci publikowanych na platformie **X.com** (dawniej Twitter).
Celem aplikacji jest:
* zaplanowane, automatyczne zbieranie tweetw i zapis do bazy danych (selenium + ADSPower Global)
* selekcja i normalizacja danych w jzyku polskim 
* automatyczne **zgrupowanie** tweetw w dokumenty (np. hashtagiem lub oknem czasowym),
* wytrenowanie modeli **LDA (MALLET)** w celu identyfikacji kluczowych tematw,
* przypisanie **sentymentu** do tweetw (prosty sownik PL lub model jzykowy),
* eksport wynikw do plikw **CSV** celem dalszej eksploracji w narzdziach statystycznych.

## Najwaniejsze funkcjonalnoci

| Modu | Opis                                                                                            |
| :-- |:------------------------------------------------------------------------------------------------|
| **Scraper** | Pobiera publiczne tweety (selenium + ADSPower Global)                                           |
| **Normalization** | normalizacja, anonimizacja @mention, tokenizacja, lematyzacja PL                                |
| **Topic modeling** | MALLET ParallelTopicModel (LDA, / optymalizowane co 10 iteracji)                              |
| **Sentiment** | reguowy sownik PL (SlownikWydzwieku) lub model `tabularisai/multilingual-sentiment-analysis`. |
| **Export** | Eksport danych do CSV  `output/csv/`                                                          |
| **REST API** | Endpointy do sterowania caym pipelineem.                                                      |

## Stos technologiczny
| Warstwa | Biblioteki / Wersje                                                                                                                  |
|---------|--------------------------------------------------------------------------------------------------------------------------------------|
| Jzyk  | **Java 17**                                                                                                                          |
| Framework | Spring Boot 3.4.1 (`starter-web`, `starter-data-jpa`, `starter-webflux`)                                                             |
| Baza | PostgreSQL 15 (kontener dockera)                                                                                                     |
| ORM  | Hibernate 6 / Spring Data JPA                                                                                                        |
| Scraping | Selenium 4.27 + JSoup 1.16 + Apache HttpClient 4.5                                                                                   |
| NLP  | MALLET 2.0.8, Morfologik 2.1.9, Lingua 1.2.2, serwis pythonowy na dockerze z modelem: `tabularisai/multilingual-sentiment-analysis`. |
| Statystyka | Apache Commons-Math3 3.6.1                                                                                                           |
| Inne | Lombok 1.18.36, MapStruct 1.5.5                                                                                                      |
| Build | **Maven** + `spring-boot-maven-plugin`                                                                                               |
| Conteneryzacja | Docker / Docker Compose                                                                                                              |

W POM znajduj si rwnie biblioteki **Kafka**  obecnie **nieaktywne** (placeholder pod przyszy streaming tweetw).

## Wymagania wstpne

* **JDK 17+**
* **Docker 20.10+**
* min. 4 GB RAM
## Szybki start

### Uruchomienie w IDE 
* np IntellijIDEA Community

### Uruchomienie przez Docker Compose (wymaga zainstalowania Dockera i AdsPowerGlobal oraz uzupenienia pliku docker-compose.yaml o propertiesy)
1. Sklonuj repo i wcz aplikacje poprzez docker compose
```bash
git clone 
cd xsnts-analyzer
docker compose up -d --build
# Aplikacja bdzie dostpna na http://localhost:8080
```
2.  Przenie `tweets_anonymized.csv` do katalogu dockera (z poziomu gwnego katalogu):
```bash
docker cp ./db-dump/tweets_anonymized.csv scrapper-db:/tmp/tweets_anonymized.csv 
```
3. Wejdz do kontenera PostgreSQL
```bash
docker exec -it scrapper-db bash 
```
4. Wacz klienta
```bash
psql -U postgres_scrapper -d scrapper_db
```
5. Zaduj plik:
```bash
COPY tweet(id, username, content, link, like_count, repost_count, comment_count, views, media_links, post_date, creation_date, update_date, needs_refresh)
FROM '/tmp/tweets_anonymized.csv' WITH CSV HEADER;
```
6. Wyjdz: 
```bash 
exit
```
7. Korzystaj z endpointw [XSNTS-Analyzer.postman_collection.json](XSNTS-Analyzer.postman_collection.json)

## REST API

```
GET    /api/export/processed
GET    /api/export/sentiment
GET    /api/export/topic-results/{modelId}
GET    /api/export/topic-sentiment/{modelId}

DELETE /api/processing/cleanup-empty
GET    /api/processing/empty-count
GET    /api/processing/empty-records
POST   /api/processing/process-all
GET    /api/processing/stats

POST   /api/sentiment/analyze-all
DELETE /api/sentiment

POST   /api/topic-modeling/lda/train
GET    /api/topic-modeling/models
GET    /api/topic-modeling/models/{modelId}

```

_Pena dokumentacja Swagger dostpna pod `/swagger-ui.html`._ (TBD)

## Struktura katalogw

```
xsnts-analyzer/
 src/main/java/pl/bgnat/master/xsnts
     config/ # globalny konfig
     exporter/ # eksport CSV
     kafka/ # configuracja kafki
     scrapper/ # modu scrappera
     normalization/ # modu normalizacji
     sentiment/ # modu analizy sentymentu
     topicmodeling/ # modu analizy tematycznej
 sentiment-hf # aplikacja wystawiajca endpoint do obsugi modelu jzykowego analizy sentymentu
 docker-compose.yml
 .gitignore
 LICENSE
 THIRD_PARTY_LICENSES
 NOTICE
 README.md
 pom.xml

```

## Przykadowe scenariusze uycia

1. **Pobierz i przetwrz wszystkie tweety**

```
POST /api/processing/process-all
```

2. **Wytrenuj model LDA (20 tematw, lematyzacja, bez mentions)**

```POST /api/topic-modeling/lda/train```

``` json
{
  "tokenStrategy": "lemmatized",
  "topicModel": "LDA",
  "isUseBigrams": false,
  "numberOfTopics": 10,
  "poolingStrategy": "hashtag",
  "minDocumentSize": 10,
  "maxIterations": 3000,
  "modelName": "LDA_lemmatized_hashtag_v1_2025",
  "startDate": "2025-01-01T00:00:00",
  "endDate": "2025-12-31T23:59:59",
  "skipMentions": true
}
```

3. **Przypisz kademu tweetowi sentyment**

```
POST /api/sentiment/analyze-all
```
``` json
{
    "tokenStrategy": "LEMMATIZED",
    "sentimentModelStrategy": "STANDARD"
} 
```

4. **Zbierz dane model tematycznej-sentyment**
```GET /api/sentiment/{{modelId}}/stats```

5. **Wyekportuj dane**
```GET /api/export/topic-sentiment/{{modelId}}```

## Plany rozwoju

* integracja **Kafka Streams**  automatyczne update tweetw
* frontend

## Licencja

Kod rdowy udostpniony na licencji **MIT**.
Nazwy, logotypy i znaki towarowe platformy **X.com** s wasnoci odpowiednich wacicieli.

### EN

XSNTS-Analyzer (X.com Social Network Topic-Sentiment Analyzer) is an application developed for a masters thesis at the Cracow University of Economics.

**Title:**
* EN: *Implementation of an Application for X.com Data Analysis: Topic Modeling and Sentiment Analysis in Public Opinion Research*
* PL (org): *Implementacja aplikacji do analizy danych z platformy X.com: analiza tematyki i sentymentu w badaniu opinii publicznej*

**Author:** Bartosz Gnatowski\
**Thesis Advisor:** Dr Hab. In. Janusz Morajda\
**College:** College of Management and Quality Sciences
**Institute:** Institute of Informatics, Accounting and Controlling\
**Field of Study:** Applied Informatics\
**Specialization:** Intelligent Systems

## Table of Contents

1. [Project Overview](#project-overview)
2. [Key Features](#key-features)
3. [Technology Stack](#technology-stack)
4. [Prerequisites](#prerequisites)
5. [Quick Start](#quick-start)
    * [Running in an IDE](#running-in-an-ide)
    * [Running with Docker Compose](#running-with-docker-compose)
6. [REST API](#rest-api)
7. [Directory Structure](#directory-structure)
8. [Sample Use-Cases](#sample-use-cases)
9. [Roadmap](#roadmap)
10. [License](#license)

## Project Overview

XSNTS-Analyzer is a **back-end microservice** for scrapping, cleansing and analysing content published on **X.com** (formerly Twitter).

Main goals:

* Scheduled, automated harvesting of tweets and database storage (Selenium + ADSPower Global).
* Selection and normalization of Polish-language data.
* Automatic **grouping** of tweets into documents (e.g., by hashtag or time window).
* Training **LDA models (MALLET)** to identify key topics.
* Assigning **sentiment** to tweets (simple Polish lexicon or language model).
* Exporting results to **CSV** for further exploration in statistical tools.


## Key Features

| Module | Description |
| :-- | :-- |
| **Scraper** | Fetches public tweets (Selenium + ADSPower Global). |
| **Normalization** | Normalization, @mention anonymization, tokenization, Polish lemmatization. |
| **Topic Modeling** | MALLET `ParallelTopicModel` (LDA, / optimized every 10 iterations). |
| **Sentiment** | Rule-based Polish lexicon (SlownikWydzwieku) or model `tabularisai/multilingual-sentiment-analysis`. |
| **Export** | Data export to CSV  `output/csv/`. |
| **REST API** | Endpoints controlling the entire pipeline. |

## Technology Stack

| Layer | Libraries / Versions |
| :-- | :-- |
| Language | **Java 17** |
| Framework | Spring Boot 3.4.1 (`starter-web`, `starter-data-jpa`, `starter-webflux`) |
| Database | PostgreSQL 15 (Docker container) |
| ORM | Hibernate 6 / Spring Data JPA |
| Scraping | Selenium 4.27, JSoup 1.16, Apache HttpClient 4.5 |
| NLP | MALLET 2.0.8, Morfologik 2.1.9, Lingua 1.2.2, Python service with model `tabularisai/multilingual-sentiment-analysis` |
| Statistics | Apache Commons-Math3 3.6.1 |
| Other | Lombok 1.18.36, MapStruct 1.5.5 |
| Build | **Maven** + `spring-boot-maven-plugin` |
| Containerisation | Docker / Docker Compose |

Kafka libraries are present in the **POM** but currently **inactive** (placeholder for future tweet streaming).

## Prerequisites

* **JDK 17+**
* **Docker 20.10+**
* At least 4 GB RAM


## Quick Start

### Running in an IDE

* e.g., IntelliJ IDEA Community


### Running with Docker Compose

1. Clone repo and run application via docker compose
```bash
git clone 
cd xsnts-analyzer
docker compose up -d --build
# The application will be available at http://localhost:8080
```
2. Copy `tweets_anonymized.csv` to the Docker container (from the project root directory):
```bash
docker cp ./db-dump/tweets_anonymized.csv scrapper-db:/tmp/tweets_anonymized.csv 
```
3. Enter the PostgreSQL container:
```bash
docker exec -it scrapper-db bash 
```
4. Start the PostgreSQL client:
```bash
psql -U postgres_scrapper -d scrapper_db
```
5. Load the CSV file into the database:
```bash
COPY tweet(id, username, content, link, like_count, repost_count, comment_count, views, media_links, post_date, creation_date, update_date, needs_refresh)
FROM '/tmp/tweets_anonymized.csv' WITH CSV HEADER;
```
6. Exit the container shell:
```bash 
exit
```
7. Use the API endpoints as described in [XSNTS-Analyzer.postman_collection.json](XSNTS-Analyzer.postman_collection.json)


## REST API

```
GET    /api/export/processed
GET    /api/export/sentiment
GET    /api/export/topic-results/{modelId}
GET    /api/export/topic-sentiment/{modelId}

DELETE /api/processing/cleanup-empty
GET    /api/processing/empty-count
GET    /api/processing/empty-records
POST   /api/processing/process-all
GET    /api/processing/stats

POST   /api/sentiment/analyze-all
DELETE /api/sentiment

POST   /api/topic-modeling/lda/train
GET    /api/topic-modeling/models
GET    /api/topic-modeling/models/{modelId}
```

*Full Swagger documentation will be available at `/swagger-ui.html` (TBD).*

## Directory Structure

```
xsnts-analyzer/
 src/main/java/pl/bgnat/master/xsnts
     config/           # global configuration
     exporter/         # CSV export
     kafka/            # Kafka config
     scrapper/         # scraper module
     normalization/    # normalization module
     sentiment/        # sentiment analysis module
     topicmodeling/    # topic-model module
 sentiment-hf          # Python service exposing sentiment model
 docker-compose.yml
 .gitignore
 LICENSE
 THIRD_PARTY_LICENSES
 NOTICE
 README.md
 pom.xml
```


## Sample Use-Cases

1. **Fetch and process all tweets**
```
POST /api/processing/process-all
```

2. **Train an LDA model (10 topics, lemmatized, no mentions)**
```
POST /api/topic-modeling/lda/train
```

```json
{
  "tokenStrategy": "lemmatized",
  "topicModel": "LDA",
  "isUseBigrams": false,
  "numberOfTopics": 10,
  "poolingStrategy": "hashtag",
  "minDocumentSize": 10,
  "maxIterations": 3000,
  "modelName": "LDA_lemmatized_hashtag_v1_2025",
  "startDate": "2025-01-01T00:00:00",
  "endDate": "2025-12-31T23:59:59",
  "skipMentions": true
}
```

3. **Assign sentiment to every tweet**
```
POST /api/sentiment/analyze-all
```

```json
{
  "tokenStrategy": "LEMMATIZED",
  "sentimentModelStrategy": "STANDARD"
}
```

4. **Collect topic-sentiment data**
```
GET /api/sentiment/{{modelId}}/stats
```

5. **Export results**
```
GET /api/export/topic-sentiment/{{modelId}}
```


## Roadmap

* Integrate **Kafka Streams**  automated tweet updates
* Front-end UI


## License

Source code released under the **MIT License**.
All names, logos and trademarks of **X.com** belong to their respective owners.

Owner

Name: Bartosz Gnatowski
Login: bgnatowski
Kind: user
Location: Kraków, Poland

Repositories: 2
Profile: https://github.com/bgnatowski

Citation (CITATION.cff)

@misc{tabularisai_2025,
    author       = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
    title        = { multilingual-sentiment-analysis (Revision 69afb83) },
    year         = 2025,
    url          = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
    doi          = { 10.57967/hf/5968 },
    publisher    = { Hugging Face }
}

GitHub Events

Total

Push event: 9
Pull request event: 2
Create event: 1

Last Year

Push event: 9
Pull request event: 2
Create event: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

xsnts-analyzer

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

https://github.com/bgnatowski/XSNTS-Analyzer/blob/main/

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies