Turftopic

Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers - Published in JOSS (2025)

https://github.com/x-tabdeveloping/turftopic

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: arxiv.org, joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

contextual llm topic-modeling transformers

Scientific Fields

Mathematics Computer Science - 63% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Robust and fast topic models with sentence-transformers.

Basic Info
Statistics
  • Stars: 80
  • Watchers: 3
  • Forks: 8
  • Open Issues: 11
  • Releases: 7
Topics
contextual llm topic-modeling transformers
Created about 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md


Topic modeling is your turf too.
Contextual topic models with representations from transformers.

DOI

Features

| | | | - | - | | SOTA Transformer-based Topic Models | :compass: , :key: KeyNMF, :gem: GMM, Clustering Models (BERTopic and Top2Vec), Autoencoding models (ZeroShotTM and CombinedTM), FASTopic | | Models for all Scenarios | :chartwithupwardstrend: Dynamic, :ocean: Online, :herb: Seeded, :evergreentree: Hierarchical, and :camera: Multimodal topic modeling | | Easy Interpretation | :bookmarktabs: Pretty Printing, :barchart: Interactive Figures, :art: topicwizard compatible | | Topic Naming | :robot: LLM-based, N-gram Retrieval, :wave: Manual | | Informative Topic Descriptions | :key: Keyphrases, Noun-phrases, Lemmatization, Stemming |

Basics

Open in Colab

For more details on a particular topic, you can consult our documentation page:

| | | | | - | - | - | | :house: Build and Train Topic Models | :art: Explore, Interpret and Visualize your Models | :wrench: Modify and Fine-tune Topic Models | | :pushpin: Choose the Right Model for your Use-Case | :chartwithupwardstrend: Explore Topics Changing over Time | :newspaper: Use Phrases or Lemmas for Topic Models | | :ocean: Extract Topics from a Stream of Documents | :evergreentree: Find Hierarchical Order in Topics | :whale: Name Topics with Large Language Models |

Installation

Turftopic can be installed from PyPI.

bash pip install turftopic

If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.

bash pip install "turftopic[pyro-ppl]"

If you want to use clustering models like BERTopic or Top2Vec, install:

bash pip install "turftopic[umap-learn]"

Fitting a Model

Turftopic's models follow the scikit-learn API conventions, and as such they are quite easy to use if you are familiar with scikit-learn workflows.

Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.

If you are using a Mac, you might have to install the required SSL certificates on your system in order to be able to download the dataset.

```python from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups( subset="all", remove=("headers", "footers", "quotes"), ) corpus: list[str] = newsgroups.data print(len(corpus)) # 18846 ```

Turftopic also comes with interpretation tools that make it easy to display and understand your results.

```python from turftopic import KeyNMF

model = KeyNMF(20) documenttopicmatrix = model.fit_transform(corpus) ```

Interpreting Models

Turftopic comes with a number of pretty printing utilities for interpreting the models.

To see the highest the most important words for each topic, use the print_topics() method.

python model.print_topics()

| Topic ID | Top 10 Words | | -------- | ----------------------------------------------------------------------------------------------- | | 0 | armenians, armenian, armenia, turks, turkish, genocide, azerbaijan, soviet, turkey, azerbaijani | | 1 | sale, price, shipping, offer, sell, prices, interested, 00, games, selling | | 2 | christians, christian, bible, christianity, church, god, scripture, faith, jesus, sin | | 3 | encryption, chip, clipper, nsa, security, secure, privacy, encrypted, crypto, cryptography | | | .... |

```python

Print highest ranking documents for topic 0

model.printrepresentativedocuments(0, corpus, documenttopicmatrix) ```

| Document | Score | | -----------------------------------------------------------------------------------------------------| ----- | | Poor 'Poly'. I see you're preparing the groundwork for yet another retreat from your... | 0.40 | | Then you must be living in an alternate universe. Where were they? An Appeal to Mankind During the... | 0.40 | | It is 'Serdar', 'kocaoglan'. Just love it. Well, it could be your head wasn't screwed on just right... | 0.39 |

python model.print_topic_distribution( "I think guns should definitely banned from all public institutions, such as schools." )

| Topic name | Score | | ----------------------------------------- | ----- | | 7_gun_guns_firearms_weapons | 0.05 | | 17_mail_address_email_send | 0.00 | | 3_encryption_chip_clipper_nsa | 0.00 | | 19_baseball_pitching_pitcher_hitter | 0.00 | | 11_graphics_software_program_3d | 0.00 |

Automated Topic Naming

Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!

You will need to pip install "turftopic[openai]" for this to work.

```python from turftopic import KeyNMF from turftopic.namers import OpenAITopicNamer

model = KeyNMF(10).fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini") model.renametopics(namer) model.printtopics() ```

| Topic ID | Topic Name | Highest Ranking | | - | - | - | | 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | | 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | | 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | | 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | | | ... |

Vectorizers Module

You can use a set of custom vectorizers for topic modeling over phrases, as well as lemmata and stems.

You will need to pip install "turftopic[spacy]" for this to work.

```python from turftopic import BERTopic from turftopic.vectorizers.spacy import NounPhraseCountVectorizer

model = BERTopic( ncomponents=10, vectorizer=NounPhraseCountVectorizer("encorewebsm"), ) model.fit(corpus) model.print_topics() ```

| Topic ID | Highest Ranking | | - | - | | | ... | | 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism | | 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index | | | ... |

Visualization

Turftopic comes with a number of visualization and pretty printing utilities for specific models and specific contexts, such as hierarchical or dynamic topic modelling. You will find an overview of these in the Interpreting and Visualizing Models section of our documentation.

pip install "turftopic[datamapplot, openai]"

```python from turftopic import ClusteringTopicModel from turftopic.namers import OpenAITopicNamer

model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini") model.rename_topics(namer)

fig = model.plotclustersdatamapplot() fig.show() ```

In addition, Turftopic is natively supported in topicwizard, an interactive topic model visualization library, is compatible with all models from Turftopic.

bash pip install "turftopic[topic-wizard]"

By far the easiest way to visualize your models for interpretation is to launch the topicwizard web app.

```python import topicwizard

topicwizard.visualize(corpus, model=model) ```

Screenshot of the topicwizard Web Application

Alternatively you can use the Figures API in topicwizard for individual HTML figures.

References

  • Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
  • Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
    • Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
    • Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
    • Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
    • Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
    • Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024, October 16). Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. arXiv.org. https://arxiv.org/abs/2410.12791

Owner

  • Name: Márton Kardos
  • Login: x-tabdeveloping
  • Kind: user
  • Location: Aarhus, Denmark
  • Company: Center for Humanities Computing

JOSS Publication

Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers
Published
July 03, 2025
Volume 10, Issue 111, Page 8183
Authors
Márton Kardos ORCID
Center for Humanities Computing, Aarhus University, Denmark
Kenneth C. Enevoldsen ORCID
Center for Humanities Computing, Aarhus University, Denmark
Jan Kostkan ORCID
Center for Humanities Computing, Aarhus University, Denmark
Ross Deans Kristensen-McLachlan ORCID
Center for Humanities Computing, Aarhus University, Denmark, Department of Linguistics, Cognitive Science, and Semiotics, Aarhus University, Denmark
Roberta Rocca ORCID
Interacting Minds Center, Aarhus University, Denmark
Editor
Abhishek Tiwari ORCID
Tags
topic modelling sentence-transformers embeddings

Citation (citation.cff)

cff-version: "1.2.0"
authors:
- family-names: Kardos
  given-names: Márton
  orcid: "https://orcid.org/0000-0001-9652-4498"
- family-names: Enevoldsen
  given-names: Kenneth C.
  orcid: "https://orcid.org/0000-0001-8733-0966"
- family-names: Kostkan
  given-names: Jan
  orcid: "https://orcid.org/0000-0002-9707-7121"
- family-names: Kristensen-McLachlan
  given-names: Ross Deans
  orcid: "https://orcid.org/0000-0001-8714-1911"
- family-names: Rocca
  given-names: Roberta
  orcid: "https://orcid.org/0000-0001-9017-8088"
doi: 10.5281/zenodo.15688293
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Kardos
    given-names: Márton
    orcid: "https://orcid.org/0000-0001-9652-4498"
  - family-names: Enevoldsen
    given-names: Kenneth C.
    orcid: "https://orcid.org/0000-0001-8733-0966"
  - family-names: Kostkan
    given-names: Jan
    orcid: "https://orcid.org/0000-0002-9707-7121"
  - family-names: Kristensen-McLachlan
    given-names: Ross Deans
    orcid: "https://orcid.org/0000-0001-8714-1911"
  - family-names: Rocca
    given-names: Roberta
    orcid: "https://orcid.org/0000-0001-9017-8088"
  date-published: 2025-07-03
  doi: 10.21105/joss.08183
  issn: 2475-9066
  issue: 111
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 8183
  title: "Turftopic: Topic Modelling with Contextual Representations
    from Sentence Transformers"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.08183"
  volume: 10
title: "Turftopic: Topic Modelling with Contextual Representations from
  Sentence Transformers"

GitHub Events

Total
  • Create event: 21
  • Release event: 3
  • Issues event: 20
  • Watch event: 49
  • Delete event: 1
  • Member event: 1
  • Issue comment event: 37
  • Push event: 155
  • Pull request review event: 88
  • Pull request review comment event: 138
  • Pull request event: 47
  • Fork event: 3
Last Year
  • Create event: 21
  • Release event: 3
  • Issues event: 20
  • Watch event: 49
  • Delete event: 1
  • Member event: 1
  • Issue comment event: 37
  • Push event: 155
  • Pull request review event: 88
  • Pull request review comment event: 138
  • Pull request event: 47
  • Fork event: 3

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 478
  • Total Committers: 5
  • Avg Commits per committer: 95.6
  • Development Distribution Score (DDS): 0.056
Past Year
  • Commits: 331
  • Committers: 2
  • Avg Commits per committer: 165.5
  • Development Distribution Score (DDS): 0.003
Top Committers
Name Email Commits
Márton Kardos p****3@g****m 451
rbroc r****c@g****m 18
supplyandcommand 4****d 7
jankounchained 4****d 1
Richard Bellamy r****y@p****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 41
  • Total pull requests: 76
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 4 days
  • Total issue authors: 11
  • Total pull request authors: 7
  • Average comments per issue: 1.56
  • Average comments per pull request: 0.55
  • Merged pull requests: 67
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 14
  • Pull requests: 40
  • Average time to close issues: 7 days
  • Average time to close pull requests: 4 days
  • Issue authors: 8
  • Pull request authors: 4
  • Average comments per issue: 1.21
  • Average comments per pull request: 0.53
  • Merged pull requests: 33
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • x-tabdeveloping (18)
  • jankounchained (8)
  • rbroc (4)
  • PetrKorab (3)
  • DobromirM (2)
  • Jemoka (2)
  • mjaniec2013 (1)
  • mattguida (1)
  • ahgraber (1)
  • miscodisco (1)
  • awlassche (1)
Pull Request Authors
  • x-tabdeveloping (90)
  • rbroc (10)
  • jankounchained (8)
  • rbellamy (2)
  • mhmaguire (1)
  • PetrKorab (1)
  • abhishektiwari (1)
Top Labels
Issue Labels
enhancement (6) bug (3) good first issue (2) documentation (2) question (1) not planned (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 2,082 last-month
  • Total dependent packages: 1
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 53
  • Total maintainers: 1
proxy.golang.org: github.com/x-tabdeveloping/turftopic
  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.5%
Average: 6.7%
Dependent repos count: 7.0%
Last synced: 4 months ago
pypi.org: turftopic

Topic modeling with contextual representations from sentence transformers.

  • Versions: 47
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 2,082 Last month
Rankings
Dependent packages count: 10.1%
Average: 38.6%
Dependent repos count: 67.1%
Maintainers (1)
Last synced: 4 months ago

Dependencies

pyproject.toml pypi
  • numpy ^1.23.0
  • pyro-ppl ^1.8.0
  • python ^3.9
  • scikit-learn ^1.2.0
  • scipy ^1.10.0
  • sentence-transformers ^2.2.0
  • torch ^2.1.0