xsnts-analyzer
Application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.2%) to scientific vocabulary
Keywords
lda-topic-modeling
lemmatization
mallet
polish-nlp
sentiment-analysis
spring-boot-3
text-normalization
twitter-scraper
Last synced: 6 months ago
·
JSON representation
·
Repository
Application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
lda-topic-modeling
lemmatization
mallet
polish-nlp
sentiment-analysis
spring-boot-3
text-normalization
twitter-scraper
Created about 1 year ago
· Last pushed 7 months ago
Metadata Files
License
Citation
https://github.com/bgnatowski/XSNTS-Analyzer/blob/main/
# XSNTS-Analyzer by Bartosz Gnatowski
### PL ([English version below > click me](#en))
XSNTS-Analyzer (X.com Social Network Topic-Sentiment Analyzer) jest aplikacj w ramach projektu do pracy magisterskiej na Uniwersytecie Ekonomicznym w Krakowie.
**Tytu:**
* PL: *Implementacja aplikacji do analizy danych z platformy X.com: analiza tematyki i sentymentu w badaniu opinii publicznej*
* EN: *Design and Implementation of an X.com Data-Analysis Application: Topic Modelling and Sentiment Analysis for Public-Opinion Research*
**Autor:** Bartosz Gnatowski\
**Promotor pracy:** dr hab. in. Janusz Morajda\
**Kolegium:** Kolegium Nauk o Zarzdzaniu i Jakoci\
**Instytut:** Instytut Informatyki, Rachunkowoci i Controllingu\
**Kierunek:** Informatyka Stosowana\
**Specjalno:** Systemy inteligentne
## Spis treci
1. [Opis projektu](#opis-projektu)
2. [Najwaniejsze funkcjonalnoci](#najwa%C5%BCniejsze-funkcjonalno%C5%9Bci)
3. [Stos technologiczny](#stos-technologiczny)
4. [Wymagania wstpne](#wymagania-wst%C4%99pne)
5. [Szybki start](#szybki-start)
* [Uruchomienie w IDE](#uruchomienie-w-ide)
* [Uruchomienie przez Docker Compose](#uruchomienie-przez-docker-compose)
6. [REST API](#rest-api)
7. [Struktura katalogw](#struktura-katalog%C3%B3w)
8. [Przykadowe scenariusze uycia](#przyk%C5%82adowe-scenariusze-u%C5%BCycia)
9. [Plany rozwoju](#plany-rozwoju)
10. [Licencja](#licencja)
## Opis projektu
XSNTS-Analyzer to mikrousuga **back-end** do pozyskiwania, oczyszczania i analizy treci publikowanych na platformie **X.com** (dawniej Twitter).
Celem aplikacji jest:
* zaplanowane, automatyczne zbieranie tweetw i zapis do bazy danych (selenium + ADSPower Global)
* selekcja i normalizacja danych w jzyku polskim
* automatyczne **zgrupowanie** tweetw w dokumenty (np. hashtagiem lub oknem czasowym),
* wytrenowanie modeli **LDA (MALLET)** w celu identyfikacji kluczowych tematw,
* przypisanie **sentymentu** do tweetw (prosty sownik PL lub model jzykowy),
* eksport wynikw do plikw **CSV** celem dalszej eksploracji w narzdziach statystycznych.
## Najwaniejsze funkcjonalnoci
| Modu | Opis |
| :-- |:------------------------------------------------------------------------------------------------|
| **Scraper** | Pobiera publiczne tweety (selenium + ADSPower Global) |
| **Normalization** | normalizacja, anonimizacja @mention, tokenizacja, lematyzacja PL |
| **Topic modeling** | MALLET ParallelTopicModel (LDA, / optymalizowane co 10 iteracji) |
| **Sentiment** | reguowy sownik PL (SlownikWydzwieku) lub model `tabularisai/multilingual-sentiment-analysis`. |
| **Export** | Eksport danych do CSV `output/csv/` |
| **REST API** | Endpointy do sterowania caym pipelineem. |
## Stos technologiczny
| Warstwa | Biblioteki / Wersje |
|---------|--------------------------------------------------------------------------------------------------------------------------------------|
| Jzyk | **Java 17** |
| Framework | Spring Boot 3.4.1 (`starter-web`, `starter-data-jpa`, `starter-webflux`) |
| Baza | PostgreSQL 15 (kontener dockera) |
| ORM | Hibernate 6 / Spring Data JPA |
| Scraping | Selenium 4.27 + JSoup 1.16 + Apache HttpClient 4.5 |
| NLP | MALLET 2.0.8, Morfologik 2.1.9, Lingua 1.2.2, serwis pythonowy na dockerze z modelem: `tabularisai/multilingual-sentiment-analysis`. |
| Statystyka | Apache Commons-Math3 3.6.1 |
| Inne | Lombok 1.18.36, MapStruct 1.5.5 |
| Build | **Maven** + `spring-boot-maven-plugin` |
| Conteneryzacja | Docker / Docker Compose |
W POM znajduj si rwnie biblioteki **Kafka** obecnie **nieaktywne** (placeholder pod przyszy streaming tweetw).
## Wymagania wstpne
* **JDK 17+**
* **Docker 20.10+**
* min. 4 GB RAM
## Szybki start
### Uruchomienie w IDE
* np IntellijIDEA Community
### Uruchomienie przez Docker Compose (wymaga zainstalowania Dockera i AdsPowerGlobal oraz uzupenienia pliku docker-compose.yaml o propertiesy)
1. Sklonuj repo i wcz aplikacje poprzez docker compose
```bash
git clone
cd xsnts-analyzer
docker compose up -d --build
# Aplikacja bdzie dostpna na http://localhost:8080
```
2. Przenie `tweets_anonymized.csv` do katalogu dockera (z poziomu gwnego katalogu):
```bash
docker cp ./db-dump/tweets_anonymized.csv scrapper-db:/tmp/tweets_anonymized.csv
```
3. Wejdz do kontenera PostgreSQL
```bash
docker exec -it scrapper-db bash
```
4. Wacz klienta
```bash
psql -U postgres_scrapper -d scrapper_db
```
5. Zaduj plik:
```bash
COPY tweet(id, username, content, link, like_count, repost_count, comment_count, views, media_links, post_date, creation_date, update_date, needs_refresh)
FROM '/tmp/tweets_anonymized.csv' WITH CSV HEADER;
```
6. Wyjdz:
```bash
exit
```
7. Korzystaj z endpointw [XSNTS-Analyzer.postman_collection.json](XSNTS-Analyzer.postman_collection.json)
## REST API
```
GET /api/export/processed
GET /api/export/sentiment
GET /api/export/topic-results/{modelId}
GET /api/export/topic-sentiment/{modelId}
DELETE /api/processing/cleanup-empty
GET /api/processing/empty-count
GET /api/processing/empty-records
POST /api/processing/process-all
GET /api/processing/stats
POST /api/sentiment/analyze-all
DELETE /api/sentiment
POST /api/topic-modeling/lda/train
GET /api/topic-modeling/models
GET /api/topic-modeling/models/{modelId}
```
_Pena dokumentacja Swagger dostpna pod `/swagger-ui.html`._ (TBD)
## Struktura katalogw
```
xsnts-analyzer/
src/main/java/pl/bgnat/master/xsnts
config/ # globalny konfig
exporter/ # eksport CSV
kafka/ # configuracja kafki
scrapper/ # modu scrappera
normalization/ # modu normalizacji
sentiment/ # modu analizy sentymentu
topicmodeling/ # modu analizy tematycznej
sentiment-hf # aplikacja wystawiajca endpoint do obsugi modelu jzykowego analizy sentymentu
docker-compose.yml
.gitignore
LICENSE
THIRD_PARTY_LICENSES
NOTICE
README.md
pom.xml
```
## Przykadowe scenariusze uycia
1. **Pobierz i przetwrz wszystkie tweety**
```
POST /api/processing/process-all
```
2. **Wytrenuj model LDA (20 tematw, lematyzacja, bez mentions)**
```POST /api/topic-modeling/lda/train```
``` json
{
"tokenStrategy": "lemmatized",
"topicModel": "LDA",
"isUseBigrams": false,
"numberOfTopics": 10,
"poolingStrategy": "hashtag",
"minDocumentSize": 10,
"maxIterations": 3000,
"modelName": "LDA_lemmatized_hashtag_v1_2025",
"startDate": "2025-01-01T00:00:00",
"endDate": "2025-12-31T23:59:59",
"skipMentions": true
}
```
3. **Przypisz kademu tweetowi sentyment**
```
POST /api/sentiment/analyze-all
```
``` json
{
"tokenStrategy": "LEMMATIZED",
"sentimentModelStrategy": "STANDARD"
}
```
4. **Zbierz dane model tematycznej-sentyment**
```GET /api/sentiment/{{modelId}}/stats```
5. **Wyekportuj dane**
```GET /api/export/topic-sentiment/{{modelId}}```
## Plany rozwoju
* integracja **Kafka Streams** automatyczne update tweetw
* frontend
## Licencja
Kod rdowy udostpniony na licencji **MIT**.
Nazwy, logotypy i znaki towarowe platformy **X.com** s wasnoci odpowiednich wacicieli.
### EN
XSNTS-Analyzer (X.com Social Network Topic-Sentiment Analyzer) is an application developed for a masters thesis at the Cracow University of Economics.
**Title:**
* EN: *Implementation of an Application for X.com Data Analysis: Topic Modeling and Sentiment Analysis in Public Opinion Research*
* PL (org): *Implementacja aplikacji do analizy danych z platformy X.com: analiza tematyki i sentymentu w badaniu opinii publicznej*
**Author:** Bartosz Gnatowski\
**Thesis Advisor:** Dr Hab. In. Janusz Morajda\
**College:** College of Management and Quality Sciences
**Institute:** Institute of Informatics, Accounting and Controlling\
**Field of Study:** Applied Informatics\
**Specialization:** Intelligent Systems
## Table of Contents
1. [Project Overview](#project-overview)
2. [Key Features](#key-features)
3. [Technology Stack](#technology-stack)
4. [Prerequisites](#prerequisites)
5. [Quick Start](#quick-start)
* [Running in an IDE](#running-in-an-ide)
* [Running with Docker Compose](#running-with-docker-compose)
6. [REST API](#rest-api)
7. [Directory Structure](#directory-structure)
8. [Sample Use-Cases](#sample-use-cases)
9. [Roadmap](#roadmap)
10. [License](#license)
## Project Overview
XSNTS-Analyzer is a **back-end microservice** for scrapping, cleansing and analysing content published on **X.com** (formerly Twitter).
Main goals:
* Scheduled, automated harvesting of tweets and database storage (Selenium + ADSPower Global).
* Selection and normalization of Polish-language data.
* Automatic **grouping** of tweets into documents (e.g., by hashtag or time window).
* Training **LDA models (MALLET)** to identify key topics.
* Assigning **sentiment** to tweets (simple Polish lexicon or language model).
* Exporting results to **CSV** for further exploration in statistical tools.
## Key Features
| Module | Description |
| :-- | :-- |
| **Scraper** | Fetches public tweets (Selenium + ADSPower Global). |
| **Normalization** | Normalization, @mention anonymization, tokenization, Polish lemmatization. |
| **Topic Modeling** | MALLET `ParallelTopicModel` (LDA, / optimized every 10 iterations). |
| **Sentiment** | Rule-based Polish lexicon (SlownikWydzwieku) or model `tabularisai/multilingual-sentiment-analysis`. |
| **Export** | Data export to CSV `output/csv/`. |
| **REST API** | Endpoints controlling the entire pipeline. |
## Technology Stack
| Layer | Libraries / Versions |
| :-- | :-- |
| Language | **Java 17** |
| Framework | Spring Boot 3.4.1 (`starter-web`, `starter-data-jpa`, `starter-webflux`) |
| Database | PostgreSQL 15 (Docker container) |
| ORM | Hibernate 6 / Spring Data JPA |
| Scraping | Selenium 4.27, JSoup 1.16, Apache HttpClient 4.5 |
| NLP | MALLET 2.0.8, Morfologik 2.1.9, Lingua 1.2.2, Python service with model `tabularisai/multilingual-sentiment-analysis` |
| Statistics | Apache Commons-Math3 3.6.1 |
| Other | Lombok 1.18.36, MapStruct 1.5.5 |
| Build | **Maven** + `spring-boot-maven-plugin` |
| Containerisation | Docker / Docker Compose |
Kafka libraries are present in the **POM** but currently **inactive** (placeholder for future tweet streaming).
## Prerequisites
* **JDK 17+**
* **Docker 20.10+**
* At least 4 GB RAM
## Quick Start
### Running in an IDE
* e.g., IntelliJ IDEA Community
### Running with Docker Compose
1. Clone repo and run application via docker compose
```bash
git clone
cd xsnts-analyzer
docker compose up -d --build
# The application will be available at http://localhost:8080
```
2. Copy `tweets_anonymized.csv` to the Docker container (from the project root directory):
```bash
docker cp ./db-dump/tweets_anonymized.csv scrapper-db:/tmp/tweets_anonymized.csv
```
3. Enter the PostgreSQL container:
```bash
docker exec -it scrapper-db bash
```
4. Start the PostgreSQL client:
```bash
psql -U postgres_scrapper -d scrapper_db
```
5. Load the CSV file into the database:
```bash
COPY tweet(id, username, content, link, like_count, repost_count, comment_count, views, media_links, post_date, creation_date, update_date, needs_refresh)
FROM '/tmp/tweets_anonymized.csv' WITH CSV HEADER;
```
6. Exit the container shell:
```bash
exit
```
7. Use the API endpoints as described in [XSNTS-Analyzer.postman_collection.json](XSNTS-Analyzer.postman_collection.json)
## REST API
```
GET /api/export/processed
GET /api/export/sentiment
GET /api/export/topic-results/{modelId}
GET /api/export/topic-sentiment/{modelId}
DELETE /api/processing/cleanup-empty
GET /api/processing/empty-count
GET /api/processing/empty-records
POST /api/processing/process-all
GET /api/processing/stats
POST /api/sentiment/analyze-all
DELETE /api/sentiment
POST /api/topic-modeling/lda/train
GET /api/topic-modeling/models
GET /api/topic-modeling/models/{modelId}
```
*Full Swagger documentation will be available at `/swagger-ui.html` (TBD).*
## Directory Structure
```
xsnts-analyzer/
src/main/java/pl/bgnat/master/xsnts
config/ # global configuration
exporter/ # CSV export
kafka/ # Kafka config
scrapper/ # scraper module
normalization/ # normalization module
sentiment/ # sentiment analysis module
topicmodeling/ # topic-model module
sentiment-hf # Python service exposing sentiment model
docker-compose.yml
.gitignore
LICENSE
THIRD_PARTY_LICENSES
NOTICE
README.md
pom.xml
```
## Sample Use-Cases
1. **Fetch and process all tweets**
```
POST /api/processing/process-all
```
2. **Train an LDA model (10 topics, lemmatized, no mentions)**
```
POST /api/topic-modeling/lda/train
```
```json
{
"tokenStrategy": "lemmatized",
"topicModel": "LDA",
"isUseBigrams": false,
"numberOfTopics": 10,
"poolingStrategy": "hashtag",
"minDocumentSize": 10,
"maxIterations": 3000,
"modelName": "LDA_lemmatized_hashtag_v1_2025",
"startDate": "2025-01-01T00:00:00",
"endDate": "2025-12-31T23:59:59",
"skipMentions": true
}
```
3. **Assign sentiment to every tweet**
```
POST /api/sentiment/analyze-all
```
```json
{
"tokenStrategy": "LEMMATIZED",
"sentimentModelStrategy": "STANDARD"
}
```
4. **Collect topic-sentiment data**
```
GET /api/sentiment/{{modelId}}/stats
```
5. **Export results**
```
GET /api/export/topic-sentiment/{{modelId}}
```
## Roadmap
* Integrate **Kafka Streams** automated tweet updates
* Front-end UI
## License
Source code released under the **MIT License**.
All names, logos and trademarks of **X.com** belong to their respective owners.
Owner
- Name: Bartosz Gnatowski
- Login: bgnatowski
- Kind: user
- Location: Kraków, Poland
- Repositories: 2
- Profile: https://github.com/bgnatowski
Citation (CITATION.cff)
@misc{tabularisai_2025,
author = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
title = { multilingual-sentiment-analysis (Revision 69afb83) },
year = 2025,
url = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
doi = { 10.57967/hf/5968 },
publisher = { Hugging Face }
}
GitHub Events
Total
- Push event: 9
- Pull request event: 2
- Create event: 1
Last Year
- Push event: 9
- Pull request event: 2
- Create event: 1
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- bgnatowski (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
pom.xml
maven
- org.postgresql:postgresql
- org.projectlombok:lombok
- org.springframework.boot:spring-boot-starter-data-jpa
- org.springframework.boot:spring-boot-starter-web
- org.springframework.boot:spring-boot-starter-test test
sentiment-hf/Dockerfile
docker
- python 3.11-slim build
sentiment-hf/requirements.txt
pypi
- fastapi ==0.111
- torch >=2.0
- transformers ==4.41
- uvicorn ==0.29