tweetopic

Blazing fast topic modelling for short texts.

https://github.com/centre-for-humanities-computing/tweetopic

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary

Keywords

dirichlet-process-mixtures dmm gibbs-sampling gsdmm machine-learning mcmc nlp python scikit-learn topic-modeling tweet tweet-analysis visualization

Keywords from Contributors

spacy-extension transformers
Last synced: 6 months ago · JSON representation

Repository

Blazing fast topic modelling for short texts.

Basic Info
Statistics
  • Stars: 32
  • Watchers: 0
  • Forks: 4
  • Open Issues: 9
  • Releases: 1
Topics
dirichlet-process-mixtures dmm gibbs-sampling gsdmm machine-learning mcmc nlp python scikit-learn topic-modeling tweet tweet-analysis visualization
Created over 3 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

tweetopic

:zap: Blazing Fast topic modelling over short texts in Python

PyPI version pip downloads python version Code style: black

Features

  • Fast :zap:
  • Scalable :collision:
  • High consistency and coherence :dart:
  • High quality topics :fire:
  • Easy visualization and inspection :eyes:
  • Full scikit-learn compatibility :nutandbolt:

New in version 0.4.0

You can now pass random_state to topic models to make your results reproducible.

```python from tweetopic import DMM

model = DMM(10, random_state=42) ```

Installation

Install from PyPI:

bash pip install tweetopic

Usage (documentation)

Train your a topic model on a corpus of short texts:

```python from tweetopic import DMM from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline

Creating a vectorizer for extracting document-term matrix from the

text corpus.

vectorizer = CountVectorizer(mindf=15, maxdf=0.1)

Creating a Dirichlet Multinomial Mixture Model with 30 components

dmm = DMM(ncomponents=30, niterations=100, alpha=0.1, beta=0.1)

Creating topic pipeline

pipeline = Pipeline([ ("vectorizer", vectorizer), ("dmm", dmm), ]) ```

You may fit the model with a stream of short texts:

python pipeline.fit(texts)

To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.

Install it from PyPI:

bash pip install topic-wizard

Then visualize your topic model:

```python import topicwizard

topicwizard.visualize(pipeline=pipeline, corpus=texts) ```

topicwizard visualization

References

  • Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233242). Association for Computing Machinery.

Owner

  • Name: Center for Humanities Computing Aarhus
  • Login: centre-for-humanities-computing
  • Kind: organization
  • Email: chcaa@cas.au.dk
  • Location: Aarhus, Denmark

GitHub Events

Total
  • Watch event: 4
  • Issue comment event: 4
  • Push event: 3
  • Fork event: 1
Last Year
  • Watch event: 4
  • Issue comment event: 4
  • Push event: 3
  • Fork event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 115
  • Total Committers: 3
  • Avg Commits per committer: 38.333
  • Development Distribution Score (DDS): 0.148
Past Year
  • Commits: 9
  • Committers: 1
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Márton Kardos p****3@g****m 98
Kenneth Enevoldsen k****n@g****m 15
pre-commit-ci[bot] 6****] 2

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 23
  • Total pull requests: 13
  • Average time to close issues: 14 days
  • Average time to close pull requests: about 2 hours
  • Total issue authors: 5
  • Total pull request authors: 4
  • Average comments per issue: 1.09
  • Average comments per pull request: 0.0
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: 1 day
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 2.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • x-tabdeveloping (6)
  • KennethEnevoldsen (5)
  • niemim (1)
  • rdkm89 (1)
  • Decade-rider (1)
Pull Request Authors
  • KennethEnevoldsen (4)
  • x-tabdeveloping (2)
  • dependabot[bot] (1)
  • pre-commit-ci[bot] (1)
Top Labels
Issue Labels
bug (2)
Pull Request Labels
dependencies (1) github_actions (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 93 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 11
  • Total maintainers: 1
pypi.org: tweetopic

Topic modelling over short texts

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 93 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 14.5%
Average: 17.3%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/static.yaml actions
  • actions/checkout v3 composite
  • actions/configure-pages v2 composite
  • actions/deploy-pages v1 composite
  • actions/upload-pages-artifact v1 composite
pyproject.toml pypi
  • deprecated >=1.2.0
  • joblib >=1.1.0
  • numba >=0.56.0
  • numpy >=1.19,<1.24.0
  • python >=3.8.0
  • scikit-learn >=1.1.1,<1.3.0
  • tqdm >=4.64.0