xsnts-analyzer

Application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.

https://github.com/bgnatowski/xsnts-analyzer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.2%) to scientific vocabulary

Keywords

lda-topic-modeling lemmatization mallet polish-nlp sentiment-analysis spring-boot-3 text-normalization twitter-scraper
Last synced: 6 months ago · JSON representation ·

Repository

Application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.

Basic Info
  • Host: GitHub
  • Owner: bgnatowski
  • License: mit
  • Language: Java
  • Default Branch: main
  • Homepage:
  • Size: 30.8 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
lda-topic-modeling lemmatization mallet polish-nlp sentiment-analysis spring-boot-3 text-normalization twitter-scraper
Created about 1 year ago · Last pushed 7 months ago
Metadata Files
License Citation

https://github.com/bgnatowski/XSNTS-Analyzer/blob/main/

# XSNTS-Analyzer by Bartosz Gnatowski

### PL ([English version below > click me](#en))

XSNTS-Analyzer (X.com Social Network Topic-Sentiment Analyzer) jest aplikacj w ramach projektu do pracy magisterskiej na Uniwersytecie Ekonomicznym w Krakowie.

**Tytu:**  
* PL: *Implementacja aplikacji do analizy danych z platformy X.com: analiza tematyki i sentymentu w badaniu opinii publicznej*
* EN: *Design and Implementation of an X.com Data-Analysis Application: Topic Modelling and Sentiment Analysis for Public-Opinion Research*

**Autor:** Bartosz Gnatowski\
**Promotor pracy:** dr hab. in. Janusz Morajda\
**Kolegium:** Kolegium Nauk o Zarzdzaniu i Jakoci\
**Instytut:** Instytut Informatyki, Rachunkowoci i Controllingu\
**Kierunek:** Informatyka Stosowana\
**Specjalno:** Systemy inteligentne

## Spis treci
1. [Opis projektu](#opis-projektu)
2. [Najwaniejsze funkcjonalnoci](#najwa%C5%BCniejsze-funkcjonalno%C5%9Bci)
3. [Stos technologiczny](#stos-technologiczny)
4. [Wymagania wstpne](#wymagania-wst%C4%99pne)
5. [Szybki start](#szybki-start)
    * [Uruchomienie w IDE](#uruchomienie-w-ide)
    * [Uruchomienie przez Docker Compose](#uruchomienie-przez-docker-compose)
6. [REST API](#rest-api)
7. [Struktura katalogw](#struktura-katalog%C3%B3w)
8. [Przykadowe scenariusze uycia](#przyk%C5%82adowe-scenariusze-u%C5%BCycia)
9. [Plany rozwoju](#plany-rozwoju)
10. [Licencja](#licencja)

## Opis projektu

XSNTS-Analyzer to mikrousuga **back-end** do pozyskiwania, oczyszczania i analizy treci publikowanych na platformie **X.com** (dawniej Twitter).
Celem aplikacji jest:
* zaplanowane, automatyczne zbieranie tweetw i zapis do bazy danych (selenium + ADSPower Global)
* selekcja i normalizacja danych w jzyku polskim 
* automatyczne **zgrupowanie** tweetw w dokumenty (np. hashtagiem lub oknem czasowym),
* wytrenowanie modeli **LDA (MALLET)** w celu identyfikacji kluczowych tematw,
* przypisanie **sentymentu** do tweetw (prosty sownik PL lub model jzykowy),
* eksport wynikw do plikw **CSV** celem dalszej eksploracji w narzdziach statystycznych.

## Najwaniejsze funkcjonalnoci

| Modu | Opis                                                                                            |
| :-- |:------------------------------------------------------------------------------------------------|
| **Scraper** | Pobiera publiczne tweety (selenium + ADSPower Global)                                           |
| **Normalization** | normalizacja, anonimizacja @mention, tokenizacja, lematyzacja PL                                |
| **Topic modeling** | MALLET ParallelTopicModel (LDA, / optymalizowane co 10 iteracji)                              |
| **Sentiment** | reguowy sownik PL (SlownikWydzwieku) lub model `tabularisai/multilingual-sentiment-analysis`. |
| **Export** | Eksport danych do CSV  `output/csv/`                                                          |
| **REST API** | Endpointy do sterowania caym pipelineem.                                                      |

## Stos technologiczny
| Warstwa | Biblioteki / Wersje                                                                                                                  |
|---------|--------------------------------------------------------------------------------------------------------------------------------------|
| Jzyk  | **Java 17**                                                                                                                          |
| Framework | Spring Boot 3.4.1 (`starter-web`, `starter-data-jpa`, `starter-webflux`)                                                             |
| Baza | PostgreSQL 15 (kontener dockera)                                                                                                     |
| ORM  | Hibernate 6 / Spring Data JPA                                                                                                        |
| Scraping | Selenium 4.27 + JSoup 1.16 + Apache HttpClient 4.5                                                                                   |
| NLP  | MALLET 2.0.8, Morfologik 2.1.9, Lingua 1.2.2, serwis pythonowy na dockerze z modelem: `tabularisai/multilingual-sentiment-analysis`. |
| Statystyka | Apache Commons-Math3 3.6.1                                                                                                           |
| Inne | Lombok 1.18.36, MapStruct 1.5.5                                                                                                      |
| Build | **Maven** + `spring-boot-maven-plugin`                                                                                               |
| Conteneryzacja | Docker / Docker Compose                                                                                                              |

W POM znajduj si rwnie biblioteki **Kafka**  obecnie **nieaktywne** (placeholder pod przyszy streaming tweetw).

## Wymagania wstpne

* **JDK 17+**
* **Docker 20.10+**
* min. 4 GB RAM
## Szybki start

### Uruchomienie w IDE 
* np IntellijIDEA Community

### Uruchomienie przez Docker Compose (wymaga zainstalowania Dockera i AdsPowerGlobal oraz uzupenienia pliku docker-compose.yaml o propertiesy)
1. Sklonuj repo i wcz aplikacje poprzez docker compose
```bash
git clone 
cd xsnts-analyzer
docker compose up -d --build
# Aplikacja bdzie dostpna na http://localhost:8080
```
2.  Przenie `tweets_anonymized.csv` do katalogu dockera (z poziomu gwnego katalogu):
```bash
docker cp ./db-dump/tweets_anonymized.csv scrapper-db:/tmp/tweets_anonymized.csv 
```
3. Wejdz do kontenera PostgreSQL
```bash
docker exec -it scrapper-db bash 
```
4. Wacz klienta
```bash
psql -U postgres_scrapper -d scrapper_db
```
5. Zaduj plik:
```bash
COPY tweet(id, username, content, link, like_count, repost_count, comment_count, views, media_links, post_date, creation_date, update_date, needs_refresh)
FROM '/tmp/tweets_anonymized.csv' WITH CSV HEADER;
```
6. Wyjdz: 
```bash 
exit
```
7. Korzystaj z endpointw [XSNTS-Analyzer.postman_collection.json](XSNTS-Analyzer.postman_collection.json)

## REST API

```
GET    /api/export/processed
GET    /api/export/sentiment
GET    /api/export/topic-results/{modelId}
GET    /api/export/topic-sentiment/{modelId}

DELETE /api/processing/cleanup-empty
GET    /api/processing/empty-count
GET    /api/processing/empty-records
POST   /api/processing/process-all
GET    /api/processing/stats

POST   /api/sentiment/analyze-all
DELETE /api/sentiment

POST   /api/topic-modeling/lda/train
GET    /api/topic-modeling/models
GET    /api/topic-modeling/models/{modelId}

```

_Pena dokumentacja Swagger dostpna pod `/swagger-ui.html`._ (TBD)

## Struktura katalogw

```
xsnts-analyzer/
 src/main/java/pl/bgnat/master/xsnts
     config/ # globalny konfig
     exporter/ # eksport CSV
     kafka/ # configuracja kafki
     scrapper/ # modu scrappera
     normalization/ # modu normalizacji
     sentiment/ # modu analizy sentymentu
     topicmodeling/ # modu analizy tematycznej
 sentiment-hf # aplikacja wystawiajca endpoint do obsugi modelu jzykowego analizy sentymentu
 docker-compose.yml
 .gitignore
 LICENSE
 THIRD_PARTY_LICENSES
 NOTICE
 README.md
 pom.xml

```

## Przykadowe scenariusze uycia

1. **Pobierz i przetwrz wszystkie tweety**

```
POST /api/processing/process-all
```

2. **Wytrenuj model LDA (20 tematw, lematyzacja, bez mentions)**

```POST /api/topic-modeling/lda/train```

``` json
{
  "tokenStrategy": "lemmatized",
  "topicModel": "LDA",
  "isUseBigrams": false,
  "numberOfTopics": 10,
  "poolingStrategy": "hashtag",
  "minDocumentSize": 10,
  "maxIterations": 3000,
  "modelName": "LDA_lemmatized_hashtag_v1_2025",
  "startDate": "2025-01-01T00:00:00",
  "endDate": "2025-12-31T23:59:59",
  "skipMentions": true
}
```

3. **Przypisz kademu tweetowi sentyment**

```
POST /api/sentiment/analyze-all
```
``` json
{
    "tokenStrategy": "LEMMATIZED",
    "sentimentModelStrategy": "STANDARD"
} 
```

4. **Zbierz dane model tematycznej-sentyment**
```GET /api/sentiment/{{modelId}}/stats```

5. **Wyekportuj dane**
```GET /api/export/topic-sentiment/{{modelId}}```

## Plany rozwoju

* integracja **Kafka Streams**  automatyczne update tweetw
* frontend

## Licencja

Kod rdowy udostpniony na licencji **MIT**.
Nazwy, logotypy i znaki towarowe platformy **X.com** s wasnoci odpowiednich wacicieli.

### EN

XSNTS-Analyzer (X.com Social Network Topic-Sentiment Analyzer) is an application developed for a masters thesis at the Cracow University of Economics.

**Title:**
* EN: *Implementation of an Application for X.com Data Analysis: Topic Modeling and Sentiment Analysis in Public Opinion Research*
* PL (org): *Implementacja aplikacji do analizy danych z platformy X.com: analiza tematyki i sentymentu w badaniu opinii publicznej*

**Author:** Bartosz Gnatowski\
**Thesis Advisor:** Dr Hab. In. Janusz Morajda\
**College:** College of Management and Quality Sciences
**Institute:** Institute of Informatics, Accounting and Controlling\
**Field of Study:** Applied Informatics\
**Specialization:** Intelligent Systems

## Table of Contents

1. [Project Overview](#project-overview)
2. [Key Features](#key-features)
3. [Technology Stack](#technology-stack)
4. [Prerequisites](#prerequisites)
5. [Quick Start](#quick-start)
    * [Running in an IDE](#running-in-an-ide)
    * [Running with Docker Compose](#running-with-docker-compose)
6. [REST API](#rest-api)
7. [Directory Structure](#directory-structure)
8. [Sample Use-Cases](#sample-use-cases)
9. [Roadmap](#roadmap)
10. [License](#license)

## Project Overview

XSNTS-Analyzer is a **back-end microservice** for scrapping, cleansing and analysing content published on **X.com** (formerly Twitter).

Main goals:

* Scheduled, automated harvesting of tweets and database storage (Selenium + ADSPower Global).
* Selection and normalization of Polish-language data.
* Automatic **grouping** of tweets into documents (e.g., by hashtag or time window).
* Training **LDA models (MALLET)** to identify key topics.
* Assigning **sentiment** to tweets (simple Polish lexicon or language model).
* Exporting results to **CSV** for further exploration in statistical tools.


## Key Features

| Module | Description |
| :-- | :-- |
| **Scraper** | Fetches public tweets (Selenium + ADSPower Global). |
| **Normalization** | Normalization, @mention anonymization, tokenization, Polish lemmatization. |
| **Topic Modeling** | MALLET `ParallelTopicModel` (LDA, / optimized every 10 iterations). |
| **Sentiment** | Rule-based Polish lexicon (SlownikWydzwieku) or model `tabularisai/multilingual-sentiment-analysis`. |
| **Export** | Data export to CSV  `output/csv/`. |
| **REST API** | Endpoints controlling the entire pipeline. |

## Technology Stack

| Layer | Libraries / Versions |
| :-- | :-- |
| Language | **Java 17** |
| Framework | Spring Boot 3.4.1 (`starter-web`, `starter-data-jpa`, `starter-webflux`) |
| Database | PostgreSQL 15 (Docker container) |
| ORM | Hibernate 6 / Spring Data JPA |
| Scraping | Selenium 4.27, JSoup 1.16, Apache HttpClient 4.5 |
| NLP | MALLET 2.0.8, Morfologik 2.1.9, Lingua 1.2.2, Python service with model `tabularisai/multilingual-sentiment-analysis` |
| Statistics | Apache Commons-Math3 3.6.1 |
| Other | Lombok 1.18.36, MapStruct 1.5.5 |
| Build | **Maven** + `spring-boot-maven-plugin` |
| Containerisation | Docker / Docker Compose |

Kafka libraries are present in the **POM** but currently **inactive** (placeholder for future tweet streaming).

## Prerequisites

* **JDK 17+**
* **Docker 20.10+**
* At least 4 GB RAM


## Quick Start

### Running in an IDE

* e.g., IntelliJ IDEA Community


### Running with Docker Compose

1. Clone repo and run application via docker compose
```bash
git clone 
cd xsnts-analyzer
docker compose up -d --build
# The application will be available at http://localhost:8080
```
2. Copy `tweets_anonymized.csv` to the Docker container (from the project root directory):
```bash
docker cp ./db-dump/tweets_anonymized.csv scrapper-db:/tmp/tweets_anonymized.csv 
```
3. Enter the PostgreSQL container:
```bash
docker exec -it scrapper-db bash 
```
4. Start the PostgreSQL client:
```bash
psql -U postgres_scrapper -d scrapper_db
```
5. Load the CSV file into the database:
```bash
COPY tweet(id, username, content, link, like_count, repost_count, comment_count, views, media_links, post_date, creation_date, update_date, needs_refresh)
FROM '/tmp/tweets_anonymized.csv' WITH CSV HEADER;
```
6. Exit the container shell:
```bash 
exit
```
7. Use the API endpoints as described in [XSNTS-Analyzer.postman_collection.json](XSNTS-Analyzer.postman_collection.json)


## REST API

```
GET    /api/export/processed
GET    /api/export/sentiment
GET    /api/export/topic-results/{modelId}
GET    /api/export/topic-sentiment/{modelId}

DELETE /api/processing/cleanup-empty
GET    /api/processing/empty-count
GET    /api/processing/empty-records
POST   /api/processing/process-all
GET    /api/processing/stats

POST   /api/sentiment/analyze-all
DELETE /api/sentiment

POST   /api/topic-modeling/lda/train
GET    /api/topic-modeling/models
GET    /api/topic-modeling/models/{modelId}
```

*Full Swagger documentation will be available at `/swagger-ui.html` (TBD).*

## Directory Structure

```
xsnts-analyzer/
 src/main/java/pl/bgnat/master/xsnts
     config/           # global configuration
     exporter/         # CSV export
     kafka/            # Kafka config
     scrapper/         # scraper module
     normalization/    # normalization module
     sentiment/        # sentiment analysis module
     topicmodeling/    # topic-model module
 sentiment-hf          # Python service exposing sentiment model
 docker-compose.yml
 .gitignore
 LICENSE
 THIRD_PARTY_LICENSES
 NOTICE
 README.md
 pom.xml
```


## Sample Use-Cases

1. **Fetch and process all tweets**
```
POST /api/processing/process-all
```

2. **Train an LDA model (10 topics, lemmatized, no mentions)**
```
POST /api/topic-modeling/lda/train
```

```json
{
  "tokenStrategy": "lemmatized",
  "topicModel": "LDA",
  "isUseBigrams": false,
  "numberOfTopics": 10,
  "poolingStrategy": "hashtag",
  "minDocumentSize": 10,
  "maxIterations": 3000,
  "modelName": "LDA_lemmatized_hashtag_v1_2025",
  "startDate": "2025-01-01T00:00:00",
  "endDate": "2025-12-31T23:59:59",
  "skipMentions": true
}
```

3. **Assign sentiment to every tweet**
```
POST /api/sentiment/analyze-all
```

```json
{
  "tokenStrategy": "LEMMATIZED",
  "sentimentModelStrategy": "STANDARD"
}
```

4. **Collect topic-sentiment data**
```
GET /api/sentiment/{{modelId}}/stats
```

5. **Export results**
```
GET /api/export/topic-sentiment/{{modelId}}
```


## Roadmap

* Integrate **Kafka Streams**  automated tweet updates
* Front-end UI


## License

Source code released under the **MIT License**.
All names, logos and trademarks of **X.com** belong to their respective owners.

Owner

  • Name: Bartosz Gnatowski
  • Login: bgnatowski
  • Kind: user
  • Location: Kraków, Poland

Citation (CITATION.cff)

@misc{tabularisai_2025,
    author       = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
    title        = { multilingual-sentiment-analysis (Revision 69afb83) },
    year         = 2025,
    url          = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
    doi          = { 10.57967/hf/5968 },
    publisher    = { Hugging Face }
}

GitHub Events

Total
  • Push event: 9
  • Pull request event: 2
  • Create event: 1
Last Year
  • Push event: 9
  • Pull request event: 2
  • Create event: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • bgnatowski (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pom.xml maven
  • org.postgresql:postgresql
  • org.projectlombok:lombok
  • org.springframework.boot:spring-boot-starter-data-jpa
  • org.springframework.boot:spring-boot-starter-web
  • org.springframework.boot:spring-boot-starter-test test
sentiment-hf/Dockerfile docker
  • python 3.11-slim build
sentiment-hf/requirements.txt pypi
  • fastapi ==0.111
  • torch >=2.0
  • transformers ==4.41
  • uvicorn ==0.29