anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali

anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali - Published in JOSS (2026)

https://github.com/vinayakdasgupta/anvay

Science Score: 87.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

bengali digital-humanities flask gensim lda text-analysis topic-modelling
Last synced: 3 days ago · JSON representation

Repository

anvay is a Flask-based Bengali text processing and topic modeling tool that uses Latent Dirichlet Allocation (LDA) to extract topics from uploaded text files.

Basic Info
  • Host: GitHub
  • Owner: vinayakdasgupta
  • License: mit
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 22.9 MB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
bengali digital-humanities flask gensim lda text-analysis topic-modelling
Created about 1 year ago · Last pushed 2 months ago
Metadata Files
Readme Changelog Contributing License

README.md

anvay: A Bengali Topic Modelling Dashboard

https://github.com/user-attachments/assets/75327a2f-27fb-467a-8ebf-e1585a97e0ec

anvay is a web-based topic modelling interface built for exploring, analysing, and interpreting large corpora of Bengali text. Developed with a focus on literary and historical materials, anvay offers users fine-grained control over preprocessing options and presents results in a structured, interactive interface designed for both researchers and students. The application is modular, interpretable, and lightweight, making it suitable for public deployment and pedagogical use.

Overview

anvay takes plain-text .txt files in Bengali, performs preprocessing (tokenisation, stemming, stopword removal, frequency filtering, n-gram construction), builds a Latent Dirichlet Allocation (LDA) topic model using Gensim, and visualises the results across multiple tabs with topic-wise document insights.

The interface is designed to foreground interpretability over complexity: there is no reliance on neural networks, transformer embeddings, or LLMs. Every transformation is documented and controlled by the user.


Release 1.1.1

Clustering

  • Enhanced hierarchical clustering with BERTopic-style merged-cluster keyword tooltips.

Documentation

  • Updated documentation to explain that Heatmap and Bar Chart visualise the same topic–word weight matrix.

Visualisation

  • Added top-word hover tooltips across all visualisations for clearer topic interpretation.
  • Standardised global topic colour scheme across all charts. The automated marker colours are determined by golden-ratio hue jumping + lightness alternation
  • Reduced number of displayed terms in plots to prevent hidden tick labels; added hover-based x-axis details where needed.
  • Unified Plotly font styling using Roboto/Noto Bengali; reduced margins for a cleaner layout.

Quality-Of-Life

  • Clarified Topic Evolution axis (document upload order) and added filenames to hover output.
  • Added missing loading spinner to indicate processing during analysis.

Notes

  • Hovering on Plotly legends is unfortunately not supported; tooltips are therefore provided directly on the plots.

These changes significantly improve clarity, consistency, and user experience in the visualisation interface.

Features

Upload & Preprocessing

  • Upload up to 800 UTF-8 encoded .txt files at once (maximum total size: 100MB)
  • Corpus size and token thresholds enforced to ensure browser responsiveness
  • Preprocessing controls include:
    • Standard + custom Bengali stopwords
    • no_below and no_above frequency thresholds
    • Top-N% most frequent tokens filter
    • N-gram selection: unigrams, bigrams, trigrams
    • Dictionary-based stemming

Dictionary-Based Stemming (v1.1.0)

  • Replaces earlier rule-based suffix stripping
  • Offers better semantic interpretability and topic-word clarity

Topic Modelling

  • Gensim's LdaMulticore implementation for fast, multicore topic modelling
  • Tunable parameters:
    • Number of topics
    • Passes and iterations
    • Alpha and Eta priors
    • Chunk size
    • Minimum probability threshold

Visualisations (Tabbed UI)

  • Visualisations Tab: Bar chart, scatter plot, pie chart, heatmap, topic-word network graph
  • Report Tab: Training summary, top tokens, topic prevalence, representative documents
  • Downloads Tab: Export results as CSV and TXT
  • Guide Tab: Step-by-step interpretive instructions

Topic-Document Drilldown

  • Per-topic list of most representative documents
  • Context-aware sentence preview
  • Topic label and confidence indicator

Design Principles

  • Mobile-friendly and responsive layout

Documentation

anvay includes a fully integrated documentation panel accessible from the interface itself. The documentation is designed not merely as technical reference, but as a pedagogical aid that walks users through each stage of the topic modelling process — from corpus preparation and parameter selection to result interpretation. It explains preprocessing choices (e.g. stopword filtering, n-gram selection, stemming) in clear language, and provides visual examples and tooltips to guide first-time users. The documentation also includes a walkthrough of a sample run, highlighting what users can expect from the model outputs. Importantly, the documentation assumes no prior knowledge of machine learning, making anvay accessible to students, scholars, and corpus curators working with Bengali texts.


Screenshots

Upload interface screenshot Upload interface
Documentation interface screenshot Upload interface
Visualization interface screenshot Upload interface
Report interface screenshot Upload interface

Technical Stack

  • Backend: Python (Flask), Gensim, NLTK, NetworkX, Scikit-learn
  • Frontend: Bootstrap, jQuery, Plotly, Bokeh, Seaborn
  • Deployment: Designed to be hosted on a university or personal server (e.g., via Gunicorn)

Installation

anvay has been tested with Python 3.9-3.11 and Gensim 4.3.x.
Two installation methods are provided:

  • Standard installation (virtual environment)
  • Docker-based installation

Option 1: Standard installation (virtual environment)

This method installs anvay directly on your system using a Python virtual environment.

Prerequisites

  • Python 3.9-3.11
  • Git
  • pip (Python package installer)

Step-by-step instructions

Clone the repository:

bash git clone https://github.com/vinayakdasgupta/anvay.git cd anvay

Create and activate a virtual environment (recommended):

bash python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows

Install dependencies:

bash pip install -r requirements.txt

Run the application:

bash python app.py

Access the web interface:

http://localhost:5000

You can now upload .txt files and begin exploring topics.


Option 2: Docker-based installation (recommended for reproducibility)

Prerequisites

  • Docker (Docker Desktop on Windows/macOS)

Step-by-step instructions

Clone the repository:

bash git clone https://github.com/vinayakdasgupta/anvay.git cd anvay

Build the Docker image:

bash docker build -t anvay .

Run the container:

bash docker run -p 5000:5000 anvay

Access the web interface:

http://localhost:5000

You can now upload .txt files and begin exploring topics.


How to Cite anvay

If you use anvay in academic work, please cite it as follows:

Das Gupta, Vinayak. anvay: a web-based tool for interpretive topic modelling in bengali.
Zenodo. https://doi.org/10.5281/zenodo.18186215

Once a DOI or formal publication is available, this should be replaced with the appropriate citation.


Referenced Datasets and Libraries

The following tools, datasets, and libraries are used in anvay and should be cited as appropriate:

Lemmatization Dataset

```bibtex @inproceedings{chakrabarty-etal-2017-context, title = "Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks", author = "Chakrabarty, Abhisek and Pandit, Onkar Arun and Garain, Utpal", editor = "Barzilay, Regina and Kan, Min-Yen", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1136/", doi = "10.18653/v1/P17-1136", pages = "1481--1491" }

@article{alam2021review, title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models}, author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar}, journal={arXiv preprint arXiv:2107.03844}, year={2021} } ```

Gensim

bibtex @inproceedings{rehurek_lrec, author = {Řehůřek, Radim and Sojka, Petr}, title = {Software Framework for Topic Modelling with Large Corpora}, booktitle = {Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks}, pages = {45--50}, year = {2010}, publisher = {ELRA}, address = {Valletta, Malta}, }

NLTK

bibtex @book{bird2009natural, author = {Bird, Steven and Klein, Ewan and Loper, Edward}, title = {Natural Language Processing with Python}, year = {2009}, publisher = {O'Reilly Media, Inc.} }

NetworkX

bibtex @inproceedings{hagberg2008exploring, author = {Hagberg, Aric A. and Schult, Daniel A. and Swart, Pieter J.}, title = {Exploring Network Structure, Dynamics, and Function using NetworkX}, booktitle = {Proceedings of the 7th Python in Science Conference (SciPy2008)}, pages = {11--15}, year = {2008} }

Scikit-learn

bibtex @article{pedregosa2011scikit, author = {Pedregosa, Fabian et al.}, title = {Scikit-learn: Machine Learning in Python}, journal = {Journal of Machine Learning Research}, volume = {12}, pages = {2825--2830}, year = {2011} }

Plotly

Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC. https://plot.ly


Acknowledgements

anvay draws on multiple open-source projects: - Gensim – topic modelling - NLTK – stopword filtering - Plotly, Seaborn, Matplotlib – visualisation - NetworkX – topic-word graph - Scikit-learn – PCA and clustering - Flask – web application framework


License

anvay is released under the MIT License.


Contact

Vinayak Das Gupta
Shiv Nadar University [https://vinayakdasgupta.com]

For questions, suggestions, or scholarly collaborations, please open an issue or contact via GitHub.

Owner

  • Login: vinayakdasgupta
  • Kind: user

JOSS Publication

anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali
Published
February 17, 2026
Volume 11, Issue 118, Page 8641
Authors
Vinayak Das Gupta ORCID
Shiv Nadar Institution of Eminence
Editor
Abhishek Tiwari ORCID
Tags
topic modelling bengali language natural language processing digital humanities interpretability pedagogy

GitHub Events

Total
  • Release event: 1
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 2
  • Public event: 1
  • Push event: 52
  • Create event: 2
Last Year
  • Release event: 1
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 2
  • Push event: 39

Issues and Pull Requests

Last synced: about 1 month ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: about 1 month
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 3.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: about 1 month
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 3.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • x-tabdeveloping (4)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • MarkupSafe ==2.1.5
  • flask ==3.1.0
  • gensim ==4.3.3
  • itsdangerous ==2.2.0
  • jinja2 ==3.1.4
  • nltk ==3.9.1
  • numpy ==2.0.2
  • pyLDAvis ==3.4.1
  • six ==1.15.0
  • werkzeug ==3.1.3