tweetopic
Blazing fast topic modelling for short texts.
https://github.com/centre-for-humanities-computing/tweetopic
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Blazing fast topic modelling for short texts.
Basic Info
- Host: GitHub
- Owner: centre-for-humanities-computing
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://centre-for-humanities-computing.github.io/tweetopic/
- Size: 2.2 MB
Statistics
- Stars: 32
- Watchers: 0
- Forks: 4
- Open Issues: 9
- Releases: 1
Topics
Metadata Files
README.md
tweetopic
:zap: Blazing Fast topic modelling over short texts in Python
Features
- Fast :zap:
- Scalable :collision:
- High consistency and coherence :dart:
- High quality topics :fire:
- Easy visualization and inspection :eyes:
- Full scikit-learn compatibility :nutandbolt:
New in version 0.4.0
You can now pass random_state to topic models to make your results reproducible.
```python from tweetopic import DMM
model = DMM(10, random_state=42) ```
Installation
Install from PyPI:
bash
pip install tweetopic
Usage (documentation)
Train your a topic model on a corpus of short texts:
```python from tweetopic import DMM from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline
Creating a vectorizer for extracting document-term matrix from the
text corpus.
vectorizer = CountVectorizer(mindf=15, maxdf=0.1)
Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(ncomponents=30, niterations=100, alpha=0.1, beta=0.1)
Creating topic pipeline
pipeline = Pipeline([ ("vectorizer", vectorizer), ("dmm", dmm), ]) ```
You may fit the model with a stream of short texts:
python
pipeline.fit(texts)
To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.
Install it from PyPI:
bash
pip install topic-wizard
Then visualize your topic model:
```python import topicwizard
topicwizard.visualize(pipeline=pipeline, corpus=texts) ```

References
- Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233242). Association for Computing Machinery.
Owner
- Name: Center for Humanities Computing Aarhus
- Login: centre-for-humanities-computing
- Kind: organization
- Email: chcaa@cas.au.dk
- Location: Aarhus, Denmark
- Website: https://chc.au.dk/
- Repositories: 130
- Profile: https://github.com/centre-for-humanities-computing
GitHub Events
Total
- Watch event: 4
- Issue comment event: 4
- Push event: 3
- Fork event: 1
Last Year
- Watch event: 4
- Issue comment event: 4
- Push event: 3
- Fork event: 1
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Márton Kardos | p****3@g****m | 98 |
| Kenneth Enevoldsen | k****n@g****m | 15 |
| pre-commit-ci[bot] | 6****] | 2 |
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 23
- Total pull requests: 13
- Average time to close issues: 14 days
- Average time to close pull requests: about 2 hours
- Total issue authors: 5
- Total pull request authors: 4
- Average comments per issue: 1.09
- Average comments per pull request: 0.0
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 4
Past Year
- Issues: 4
- Pull requests: 0
- Average time to close issues: 1 day
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 2.5
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- x-tabdeveloping (6)
- KennethEnevoldsen (5)
- niemim (1)
- rdkm89 (1)
- Decade-rider (1)
Pull Request Authors
- KennethEnevoldsen (4)
- x-tabdeveloping (2)
- dependabot[bot] (1)
- pre-commit-ci[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 93 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 11
- Total maintainers: 1
pypi.org: tweetopic
Topic modelling over short texts
- Documentation: https://tweetopic.readthedocs.io/
- License: MIT
-
Latest release: 0.4.0
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/configure-pages v2 composite
- actions/deploy-pages v1 composite
- actions/upload-pages-artifact v1 composite
- deprecated >=1.2.0
- joblib >=1.1.0
- numba >=0.56.0
- numpy >=1.19,<1.24.0
- python >=3.8.0
- scikit-learn >=1.1.1,<1.3.0
- tqdm >=4.64.0