anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali

anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali - Published in JOSS (2026)

https://github.com/vinayakdasgupta/anvay

Science Score: 87.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

bengali digital-humanities flask gensim lda text-analysis topic-modelling

Last synced: 5 months ago · JSON representation

Repository

anvay is a Flask-based Bengali text processing and topic modeling tool that uses Latent Dirichlet Allocation (LDA) to extract topics from uploaded text files.

Basic Info

Host: GitHub
Owner: vinayakdasgupta
License: mit
Language: HTML
Default Branch: main
Homepage:
Size: 22.9 MB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

bengali digital-humanities flask gensim lda text-analysis topic-modelling

Created over 1 year ago · Last pushed 7 months ago

Metadata Files

Readme Changelog Contributing License

anvay: A Bengali Topic Modelling Dashboard

https://github.com/user-attachments/assets/75327a2f-27fb-467a-8ebf-e1585a97e0ec

anvay is a web-based topic modelling interface built for exploring, analysing, and interpreting large corpora of Bengali text. Developed with a focus on literary and historical materials, anvay offers users fine-grained control over preprocessing options and presents results in a structured, interactive interface designed for both researchers and students. The application is modular, interpretable, and lightweight, making it suitable for public deployment and pedagogical use.

Overview

anvay takes plain-text .txt files in Bengali, performs preprocessing (tokenisation, stemming, stopword removal, frequency filtering, n-gram construction), builds a Latent Dirichlet Allocation (LDA) topic model using Gensim, and visualises the results across multiple tabs with topic-wise document insights.

The interface is designed to foreground interpretability over complexity: there is no reliance on neural networks, transformer embeddings, or LLMs. Every transformation is documented and controlled by the user.

Release 1.1.1

Clustering

Enhanced hierarchical clustering with BERTopic-style merged-cluster keyword tooltips.

Documentation

Updated documentation to explain that Heatmap and Bar Chart visualise the same topic–word weight matrix.

Visualisation

Added top-word hover tooltips across all visualisations for clearer topic interpretation.
Standardised global topic colour scheme across all charts. The automated marker colours are determined by golden-ratio hue jumping + lightness alternation
Reduced number of displayed terms in plots to prevent hidden tick labels; added hover-based x-axis details where needed.
Unified Plotly font styling using Roboto/Noto Bengali; reduced margins for a cleaner layout.

Quality-Of-Life

Clarified Topic Evolution axis (document upload order) and added filenames to hover output.
Added missing loading spinner to indicate processing during analysis.

Notes

Hovering on Plotly legends is unfortunately not supported; tooltips are therefore provided directly on the plots.

These changes significantly improve clarity, consistency, and user experience in the visualisation interface.

Features

Upload & Preprocessing

Upload up to 800 UTF-8 encoded .txt files at once (maximum total size: 100MB)
Corpus size and token thresholds enforced to ensure browser responsiveness
Preprocessing controls include:
- Standard + custom Bengali stopwords
- no_below and no_above frequency thresholds
- Top-N% most frequent tokens filter
- N-gram selection: unigrams, bigrams, trigrams
- Dictionary-based stemming

Dictionary-Based Stemming (v1.1.0)

Replaces earlier rule-based suffix stripping
Offers better semantic interpretability and topic-word clarity

Topic Modelling

Gensim's LdaMulticore implementation for fast, multicore topic modelling
Tunable parameters:
- Number of topics
- Passes and iterations
- Alpha and Eta priors
- Chunk size
- Minimum probability threshold

Visualisations (Tabbed UI)

Visualisations Tab: Bar chart, scatter plot, pie chart, heatmap, topic-word network graph
Report Tab: Training summary, top tokens, topic prevalence, representative documents
Downloads Tab: Export results as CSV and TXT
Guide Tab: Step-by-step interpretive instructions

Topic-Document Drilldown

Per-topic list of most representative documents
Context-aware sentence preview
Topic label and confidence indicator

Design Principles

Mobile-friendly and responsive layout

Documentation

anvay includes a fully integrated documentation panel accessible from the interface itself. The documentation is designed not merely as technical reference, but as a pedagogical aid that walks users through each stage of the topic modelling process — from corpus preparation and parameter selection to result interpretation. It explains preprocessing choices (e.g. stopword filtering, n-gram selection, stemming) in clear language, and provides visual examples and tooltips to guide first-time users. The documentation also includes a walkthrough of a sample run, highlighting what users can expect from the model outputs. Importantly, the documentation assumes no prior knowledge of machine learning, making anvay accessible to students, scholars, and corpus curators working with Bengali texts.

Screenshots

Upload interface screenshot

Documentation interface screenshot

Visualization interface screenshot

Report interface screenshot

Technical Stack

Backend: Python (Flask), Gensim, NLTK, NetworkX, Scikit-learn
Frontend: Bootstrap, jQuery, Plotly, Bokeh, Seaborn
Deployment: Designed to be hosted on a university or personal server (e.g., via Gunicorn)

Installation

anvay has been tested with Python 3.9-3.11 and Gensim 4.3.x.
Two installation methods are provided:

Standard installation (virtual environment)
Docker-based installation

Option 1: Standard installation (virtual environment)

This method installs anvay directly on your system using a Python virtual environment.

Prerequisites

Python 3.9-3.11
Git
pip (Python package installer)

Step-by-step instructions

Clone the repository:

bash git clone https://github.com/vinayakdasgupta/anvay.git cd anvay

Create and activate a virtual environment (recommended):

bash python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows

Install dependencies:

bash pip install -r requirements.txt

Run the application:

bash python app.py

Access the web interface:

http://localhost:5000

You can now upload .txt files and begin exploring topics.

Option 2: Docker-based installation (recommended for reproducibility)

Prerequisites

Docker (Docker Desktop on Windows/macOS)

Step-by-step instructions

Clone the repository:

bash git clone https://github.com/vinayakdasgupta/anvay.git cd anvay

Build the Docker image:

bash docker build -t anvay .

Run the container:

bash docker run -p 5000:5000 anvay

Access the web interface:

http://localhost:5000

You can now upload .txt files and begin exploring topics.

How to Cite anvay

If you use anvay in academic work, please cite it as follows:

Das Gupta, Vinayak. anvay: a web-based tool for interpretive topic modelling in bengali.
Zenodo. https://doi.org/10.5281/zenodo.18186215

Once a DOI or formal publication is available, this should be replaced with the appropriate citation.

Referenced Datasets and Libraries

The following tools, datasets, and libraries are used in anvay and should be cited as appropriate:

Lemmatization Dataset

```bibtex @inproceedings{chakrabarty-etal-2017-context, title = "Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks", author = "Chakrabarty, Abhisek and Pandit, Onkar Arun and Garain, Utpal", editor = "Barzilay, Regina and Kan, Min-Yen", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1136/", doi = "10.18653/v1/P17-1136", pages = "1481--1491" }

@article{alam2021review, title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models}, author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar}, journal={arXiv preprint arXiv:2107.03844}, year={2021} } ```

Gensim

bibtex @inproceedings{rehurek_lrec, author = {Řehůřek, Radim and Sojka, Petr}, title = {Software Framework for Topic Modelling with Large Corpora}, booktitle = {Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks}, pages = {45--50}, year = {2010}, publisher = {ELRA}, address = {Valletta, Malta}, }

NLTK

bibtex @book{bird2009natural, author = {Bird, Steven and Klein, Ewan and Loper, Edward}, title = {Natural Language Processing with Python}, year = {2009}, publisher = {O'Reilly Media, Inc.} }

NetworkX

bibtex @inproceedings{hagberg2008exploring, author = {Hagberg, Aric A. and Schult, Daniel A. and Swart, Pieter J.}, title = {Exploring Network Structure, Dynamics, and Function using NetworkX}, booktitle = {Proceedings of the 7th Python in Science Conference (SciPy2008)}, pages = {11--15}, year = {2008} }

Scikit-learn

bibtex @article{pedregosa2011scikit, author = {Pedregosa, Fabian et al.}, title = {Scikit-learn: Machine Learning in Python}, journal = {Journal of Machine Learning Research}, volume = {12}, pages = {2825--2830}, year = {2011} }

Plotly

Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC. https://plot.ly

Acknowledgements

anvay draws on multiple open-source projects: - Gensim – topic modelling - NLTK – stopword filtering - Plotly, Seaborn, Matplotlib – visualisation - NetworkX – topic-word graph - Scikit-learn – PCA and clustering - Flask – web application framework

License

anvay is released under the MIT License.

Contact

Vinayak Das Gupta
Shiv Nadar University [https://vinayakdasgupta.com]

For questions, suggestions, or scholarly collaborations, please open an issue or contact via GitHub.

Owner

Login: vinayakdasgupta
Kind: user

Repositories: 2
Profile: https://github.com/vinayakdasgupta

JOSS Publication

anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali

Published

February 17, 2026

DOI

10.21105/joss.08641

Volume 11, Issue 118, Page 8641

Authors

Vinayak Das Gupta

Shiv Nadar Institution of Eminence

Editor

Abhishek Tiwari

GitHub Events

Total

Release event: 1
Issues event: 2
Watch event: 4
Issue comment event: 2
Public event: 1
Push event: 52
Create event: 2

Last Year

Release event: 1
Issues event: 2
Watch event: 4
Issue comment event: 2
Push event: 39

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 0
Average time to close issues: about 1 month
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 3.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 0
Average time to close issues: about 1 month
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 3.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

x-tabdeveloping (4)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

MarkupSafe ==2.1.5
flask ==3.1.0
gensim ==4.3.3
itsdangerous ==2.2.0
jinja2 ==3.1.4
nltk ==3.9.1
numpy ==2.0.2
pyLDAvis ==3.4.1
six ==1.15.0
werkzeug ==3.1.3

anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali

Science Score: 87.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

anvay: A Bengali Topic Modelling Dashboard

Overview

Release 1.1.1

Clustering

Documentation

Visualisation

Quality-Of-Life

Notes

Features

Upload & Preprocessing

Dictionary-Based Stemming (v1.1.0)

Topic Modelling

Visualisations (Tabbed UI)

Topic-Document Drilldown

Design Principles

Documentation

Screenshots

Technical Stack

Installation

Option 1: Standard installation (virtual environment)

Prerequisites

Step-by-step instructions

Option 2: Docker-based installation (recommended for reproducibility)

Prerequisites

Step-by-step instructions

How to Cite anvay

Referenced Datasets and Libraries

Lemmatization Dataset

Gensim

NLTK

NetworkX

Scikit-learn

Plotly

Acknowledgements

License

Contact

Owner

JOSS Publication

anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali

Authors

Editor

Tags

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies