tweetopic

Blazing fast topic modelling for short texts.

https://github.com/centre-for-humanities-computing/tweetopic

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

dirichlet-process-mixtures dmm gibbs-sampling gsdmm machine-learning mcmc nlp python scikit-learn topic-modeling tweet tweet-analysis visualization

Keywords from Contributors

spacy-extension transformers

Last synced: 6 months ago · JSON representation

Repository

Blazing fast topic modelling for short texts.

Basic Info

Host: GitHub
Owner: centre-for-humanities-computing
License: mit
Language: Python
Default Branch: main
Homepage: https://centre-for-humanities-computing.github.io/tweetopic/
Size: 2.2 MB

Statistics

Stars: 32
Watchers: 0
Forks: 4
Open Issues: 9
Releases: 1

Topics

dirichlet-process-mixtures dmm gibbs-sampling gsdmm machine-learning mcmc nlp python scikit-learn topic-modeling tweet tweet-analysis visualization

Created over 3 years ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

README.md

tweetopic

:zap: Blazing Fast topic modelling over short texts in Python

Features

Fast :zap:
Scalable :collision:
High consistency and coherence :dart:
High quality topics :fire:
Easy visualization and inspection :eyes:
Full scikit-learn compatibility :nutandbolt:

New in version 0.4.0

You can now pass random_state to topic models to make your results reproducible.

```python from tweetopic import DMM

model = DMM(10, random_state=42) ```

Installation

Install from PyPI:

bash pip install tweetopic

Usage (documentation)

Train your a topic model on a corpus of short texts:

```python from tweetopic import DMM from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline

Creating a vectorizer for extracting document-term matrix from the

text corpus.

vectorizer = CountVectorizer(mindf=15, maxdf=0.1)

Creating a Dirichlet Multinomial Mixture Model with 30 components

dmm = DMM(ncomponents=30, niterations=100, alpha=0.1, beta=0.1)

Creating topic pipeline

pipeline = Pipeline([ ("vectorizer", vectorizer), ("dmm", dmm), ]) ```

You may fit the model with a stream of short texts:

python pipeline.fit(texts)

To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.

Install it from PyPI:

bash pip install topic-wizard

Then visualize your topic model:

```python import topicwizard

topicwizard.visualize(pipeline=pipeline, corpus=texts) ```

topicwizard visualization

References

Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233242). Association for Computing Machinery.

Owner

Name: Center for Humanities Computing Aarhus
Login: centre-for-humanities-computing
Kind: organization
Email: chcaa@cas.au.dk
Location: Aarhus, Denmark

Website: https://chc.au.dk/
Repositories: 130
Profile: https://github.com/centre-for-humanities-computing

GitHub Events

Total

Watch event: 4
Issue comment event: 4
Push event: 3
Fork event: 1

Last Year

Watch event: 4
Issue comment event: 4
Push event: 3
Fork event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 115
Total Committers: 3
Avg Commits per committer: 38.333
Development Distribution Score (DDS): 0.148

Past Year

Commits: 9
Committers: 1
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Márton Kardos	p**3@g**m	98
Kenneth Enevoldsen	k**n@g**m	15
pre-commit-ci[bot]	6****]	2

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 23
Total pull requests: 13
Average time to close issues: 14 days
Average time to close pull requests: about 2 hours
Total issue authors: 5
Total pull request authors: 4
Average comments per issue: 1.09
Average comments per pull request: 0.0
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 4
Pull requests: 0
Average time to close issues: 1 day
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 2.5
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

x-tabdeveloping (6)
KennethEnevoldsen (5)
niemim (1)
rdkm89 (1)
Decade-rider (1)

Pull Request Authors

KennethEnevoldsen (4)
x-tabdeveloping (2)
dependabot[bot] (1)
pre-commit-ci[bot] (1)

Top Labels

Issue Labels

bug (2)

Pull Request Labels

dependencies (1) github_actions (1)

Packages

Total packages: 1
Total downloads:
- pypi 93 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 11
Total maintainers: 1

pypi.org: tweetopic

Topic modelling over short texts

Documentation: https://tweetopic.readthedocs.io/
License: MIT
Latest release: 0.4.0
published over 1 year ago

Versions: 11
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 93 Last month

Rankings

Dependent packages count: 6.6%

Downloads: 14.5%

Average: 17.3%

Dependent repos count: 30.6%

Maintainers (1)

drkardosdrur

Last synced: 6 months ago

Dependencies

.github/workflows/static.yaml actions

actions/checkout v3 composite
actions/configure-pages v2 composite
actions/deploy-pages v1 composite
actions/upload-pages-artifact v1 composite

pyproject.toml pypi

deprecated >=1.2.0
joblib >=1.1.0
numba >=0.56.0
numpy >=1.19,<1.24.0
python >=3.8.0
scikit-learn >=1.1.1,<1.3.0
tqdm >=4.64.0