anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali
anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali - Published in JOSS (2026)
Science Score: 87.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
anvay is a Flask-based Bengali text processing and topic modeling tool that uses Latent Dirichlet Allocation (LDA) to extract topics from uploaded text files.
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
anvay: A Bengali Topic Modelling Dashboard
https://github.com/user-attachments/assets/75327a2f-27fb-467a-8ebf-e1585a97e0ec
anvay is a web-based topic modelling interface built for exploring, analysing, and interpreting large corpora of Bengali text. Developed with a focus on literary and historical materials, anvay offers users fine-grained control over preprocessing options and presents results in a structured, interactive interface designed for both researchers and students. The application is modular, interpretable, and lightweight, making it suitable for public deployment and pedagogical use.
Overview
anvay takes plain-text .txt files in Bengali, performs preprocessing (tokenisation, stemming, stopword removal, frequency filtering, n-gram construction), builds a Latent Dirichlet Allocation (LDA) topic model using Gensim, and visualises the results across multiple tabs with topic-wise document insights.
The interface is designed to foreground interpretability over complexity: there is no reliance on neural networks, transformer embeddings, or LLMs. Every transformation is documented and controlled by the user.
Release 1.1.1
Clustering
- Enhanced hierarchical clustering with BERTopic-style merged-cluster keyword tooltips.
Documentation
- Updated documentation to explain that Heatmap and Bar Chart visualise the same topic–word weight matrix.
Visualisation
- Added top-word hover tooltips across all visualisations for clearer topic interpretation.
- Standardised global topic colour scheme across all charts. The automated marker colours are determined by golden-ratio hue jumping + lightness alternation
- Reduced number of displayed terms in plots to prevent hidden tick labels; added hover-based x-axis details where needed.
- Unified Plotly font styling using Roboto/Noto Bengali; reduced margins for a cleaner layout.
Quality-Of-Life
- Clarified Topic Evolution axis (document upload order) and added filenames to hover output.
- Added missing loading spinner to indicate processing during analysis.
Notes
- Hovering on Plotly legends is unfortunately not supported; tooltips are therefore provided directly on the plots.
These changes significantly improve clarity, consistency, and user experience in the visualisation interface.
Features
Upload & Preprocessing
- Upload up to 800 UTF-8 encoded .txt files at once (maximum total size: 100MB)
- Corpus size and token thresholds enforced to ensure browser responsiveness
- Preprocessing controls include:
- Standard + custom Bengali stopwords
no_belowandno_abovefrequency thresholds- Top-N% most frequent tokens filter
- N-gram selection: unigrams, bigrams, trigrams
- Dictionary-based stemming
Dictionary-Based Stemming (v1.1.0)
- Replaces earlier rule-based suffix stripping
- Offers better semantic interpretability and topic-word clarity
Topic Modelling
- Gensim's
LdaMulticoreimplementation for fast, multicore topic modelling - Tunable parameters:
- Number of topics
- Passes and iterations
- Alpha and Eta priors
- Chunk size
- Minimum probability threshold
Visualisations (Tabbed UI)
- Visualisations Tab: Bar chart, scatter plot, pie chart, heatmap, topic-word network graph
- Report Tab: Training summary, top tokens, topic prevalence, representative documents
- Downloads Tab: Export results as CSV and TXT
- Guide Tab: Step-by-step interpretive instructions
Topic-Document Drilldown
- Per-topic list of most representative documents
- Context-aware sentence preview
- Topic label and confidence indicator
Design Principles
- Mobile-friendly and responsive layout
Documentation
anvay includes a fully integrated documentation panel accessible from the interface itself. The documentation is designed not merely as technical reference, but as a pedagogical aid that walks users through each stage of the topic modelling process — from corpus preparation and parameter selection to result interpretation. It explains preprocessing choices (e.g. stopword filtering, n-gram selection, stemming) in clear language, and provides visual examples and tooltips to guide first-time users. The documentation also includes a walkthrough of a sample run, highlighting what users can expect from the model outputs. Importantly, the documentation assumes no prior knowledge of machine learning, making anvay accessible to students, scholars, and corpus curators working with Bengali texts.
Screenshots
Upload interface screenshot
Documentation interface screenshot
Visualization interface screenshot
Report interface screenshot
Technical Stack
- Backend: Python (Flask), Gensim, NLTK, NetworkX, Scikit-learn
- Frontend: Bootstrap, jQuery, Plotly, Bokeh, Seaborn
- Deployment: Designed to be hosted on a university or personal server (e.g., via Gunicorn)
Installation
anvay has been tested with Python 3.9-3.11 and Gensim 4.3.x.
Two installation methods are provided:
- Standard installation (virtual environment)
- Docker-based installation
Option 1: Standard installation (virtual environment)
This method installs anvay directly on your system using a Python virtual environment.
Prerequisites
- Python 3.9-3.11
- Git
- pip (Python package installer)
Step-by-step instructions
Clone the repository:
bash
git clone https://github.com/vinayakdasgupta/anvay.git
cd anvay
Create and activate a virtual environment (recommended):
bash
python -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
Install dependencies:
bash
pip install -r requirements.txt
Run the application:
bash
python app.py
Access the web interface:
http://localhost:5000
You can now upload .txt files and begin exploring topics.
Option 2: Docker-based installation (recommended for reproducibility)
Prerequisites
- Docker (Docker Desktop on Windows/macOS)
Step-by-step instructions
Clone the repository:
bash
git clone https://github.com/vinayakdasgupta/anvay.git
cd anvay
Build the Docker image:
bash
docker build -t anvay .
Run the container:
bash
docker run -p 5000:5000 anvay
Access the web interface:
http://localhost:5000
You can now upload .txt files and begin exploring topics.
How to Cite anvay
If you use anvay in academic work, please cite it as follows:
Das Gupta, Vinayak. anvay: a web-based tool for interpretive topic modelling in bengali.
Zenodo. https://doi.org/10.5281/zenodo.18186215
Once a DOI or formal publication is available, this should be replaced with the appropriate citation.
Referenced Datasets and Libraries
The following tools, datasets, and libraries are used in anvay and should be cited as appropriate:
Lemmatization Dataset
```bibtex @inproceedings{chakrabarty-etal-2017-context, title = "Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks", author = "Chakrabarty, Abhisek and Pandit, Onkar Arun and Garain, Utpal", editor = "Barzilay, Regina and Kan, Min-Yen", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1136/", doi = "10.18653/v1/P17-1136", pages = "1481--1491" }
@article{alam2021review, title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models}, author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar}, journal={arXiv preprint arXiv:2107.03844}, year={2021} } ```
Gensim
bibtex
@inproceedings{rehurek_lrec,
author = {Řehůřek, Radim and Sojka, Petr},
title = {Software Framework for Topic Modelling with Large Corpora},
booktitle = {Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks},
pages = {45--50},
year = {2010},
publisher = {ELRA},
address = {Valletta, Malta},
}
NLTK
bibtex
@book{bird2009natural,
author = {Bird, Steven and Klein, Ewan and Loper, Edward},
title = {Natural Language Processing with Python},
year = {2009},
publisher = {O'Reilly Media, Inc.}
}
NetworkX
bibtex
@inproceedings{hagberg2008exploring,
author = {Hagberg, Aric A. and Schult, Daniel A. and Swart, Pieter J.},
title = {Exploring Network Structure, Dynamics, and Function using NetworkX},
booktitle = {Proceedings of the 7th Python in Science Conference (SciPy2008)},
pages = {11--15},
year = {2008}
}
Scikit-learn
bibtex
@article{pedregosa2011scikit,
author = {Pedregosa, Fabian et al.},
title = {Scikit-learn: Machine Learning in Python},
journal = {Journal of Machine Learning Research},
volume = {12},
pages = {2825--2830},
year = {2011}
}
Plotly
Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC. https://plot.ly
Acknowledgements
anvay draws on multiple open-source projects: - Gensim – topic modelling - NLTK – stopword filtering - Plotly, Seaborn, Matplotlib – visualisation - NetworkX – topic-word graph - Scikit-learn – PCA and clustering - Flask – web application framework
License
anvay is released under the MIT License.
Contact
Vinayak Das Gupta
Shiv Nadar University
[https://vinayakdasgupta.com]
For questions, suggestions, or scholarly collaborations, please open an issue or contact via GitHub.
Owner
- Login: vinayakdasgupta
- Kind: user
- Repositories: 2
- Profile: https://github.com/vinayakdasgupta
JOSS Publication
anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali
Tags
topic modelling bengali language natural language processing digital humanities interpretability pedagogyGitHub Events
Total
- Release event: 1
- Issues event: 2
- Watch event: 4
- Issue comment event: 2
- Public event: 1
- Push event: 52
- Create event: 2
Last Year
- Release event: 1
- Issues event: 2
- Watch event: 4
- Issue comment event: 2
- Push event: 39
Issues and Pull Requests
Last synced: about 1 month ago
All Time
- Total issues: 4
- Total pull requests: 0
- Average time to close issues: about 1 month
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 3.75
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 0
- Average time to close issues: about 1 month
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 3.75
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- x-tabdeveloping (4)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- MarkupSafe ==2.1.5
- flask ==3.1.0
- gensim ==4.3.3
- itsdangerous ==2.2.0
- jinja2 ==3.1.4
- nltk ==3.9.1
- numpy ==2.0.2
- pyLDAvis ==3.4.1
- six ==1.15.0
- werkzeug ==3.1.3
