ldaPrototype

ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations - Published in JOSS (2020)

https://github.com/jonasrieger/ldaprototype

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 15 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org, zenodo.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

latent-dirichlet-allocation lda model-selection modelselection reliability text-mining textdata topic-model topic-models topic-similarities topicmodeling topicmodelling

Scientific Fields

Engineering Computer Science - 60% confidence

Last synced: 6 months ago · JSON representation

Repository

Determine a Prototype from a number of runs of Latent Dirichlet Allocation.

Basic Info

Host: GitHub
Owner: JonasRieger
License: gpl-3.0
Language: R
Default Branch: master
Homepage:
Size: 799 KB

Statistics

Stars: 8
Watchers: 2
Forks: 1
Open Issues: 3
Releases: 5

Topics

latent-dirichlet-allocation lda model-selection modelselection reliability text-mining textdata topic-model topic-models topic-similarities topicmodeling topicmodelling

Created almost 7 years ago · Last pushed about 3 years ago

Metadata Files

Readme License Code of conduct

ldaPrototype

Prototype of Multiple Latent Dirichlet Allocation Runs

Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.

Citation

Please cite the JOSS paper using the BibTeX entry @article{<placeholder>, title = {{ldaPrototype}: A method in {R} to get a Prototype of multiple Latent Dirichlet Allocations}, author = {Jonas Rieger}, journal = {Journal of Open Source Software}, year = {2020}, volume = {5}, number = {51}, pages = {2181}, doi = {10.21105/joss.02181}, url = {https://doi.org/10.21105/joss.02181} } which is also obtained by the call citation("ldaPrototype").

References (related to the methodology)

Rieger, J., Jentsch, C. & Rahnenführer, J.: LDAPrototype: A Model Selection Algorithm to Improve Reliability of Latent Dirichlet Allocation. preprint
Rieger, J. (2020). ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations. Journal of Open Source Software, 5(51), 2181.
Rieger, J., Rahnenführer, J. & Jentsch, C. (2020). Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype. Natural Language Processing and Information Systems, NLDB 2020. LNCS 12089, pp. 118-125.

Please also have a look at this short overview on topic modeling in R: * Wiedemann, G. (2022). The World of Topic Modeling in R. M&K Medien & Kommunikationswissenschaft, 70(3), pp. 286-291.

Related Software

tm is useful for preprocessing text data.
lda offers a fast implementation of the Latent Dirichlet Allocation and is used by ldaPrototype.
quanteda is a framework for "Quantitative Analysis of Textual Data".
stm is a framework for Structural Topic Models.
tosca is a framework for statistical methods in content analysis including visualizations and validation techniques. It is also useful for managing and manipulating text data to a structure requested by ldaPrototype.
topicmodels is another framework for various topic models based on the Latent Dirichlet Allocation and Correlated Topics Models.
ldatuning is a framework for finding the optimal number of topics using various metrics.

Contribution

This R package is licensed under the GPLv3. For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the issue tracker. Pull requests are welcome and will be included at the discretion of the author.

Installation

{R} install.packages("ldaPrototype") For the development version use devtools: {R} devtools::install_github("JonasRieger/ldaPrototype")

(Quick Start) Example

Load the package and the example dataset from Reuters consisting of 91 articles - tosca::LDAprep can be used to manipulate text data to the format requested by ldaPrototype. {R} library("ldaPrototype") data(reuters_docs) data(reuters_vocab) Run the shortcut function to create a LDAPrototype object. It consists of the LDAPrototype of 4 LDA runs (with specified seeds) with 10 topics each. The LDA selected by the algorithm can be retrieved using getPrototype or getLDA. {R} res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, seeds = 1:4) proto = getPrototype(res) #= getLDA(res) The same result can also be achieved by executing the following lines of code in several steps, which can be useful for interim evaluations. ```{R} reps = LDARep(docs = reutersdocs, vocab = reutersvocab, n = 4, K = 10, seeds = 1:4) topics = mergeTopics(reps, vocab = reuters_vocab) jacc = jaccardTopics(topics) sclop = SCLOP.pairwise(jacc) res2 = getPrototype(reps, sclop = sclop)

proto2 = getPrototype(res2) #= getLDA(res2)

identical(res, res2) `There is also the option to use similarity measures other than the Jaccard coefficient. Currently, the measures cosine similarity (cosineTopics), Jensen-Shannon divergence (jsTopics) and rank-biased overlap (rboTopics) are implemented in addition to the standard Jaccard coefficient (jaccardTopics``).

To get an overview of the workflow, the associated functions and getters for each type of object, the following call is helpful: {R} ?`ldaPrototype-package`

(Slightly more detailed) Example

Similar to the quick start example, the shortcut of one single call is again compared with the step-by-step procedure. We model 5 LDAs with K = 12 topics, hyperparameters alpha = eta = 0.1 and seeds 1:5. We want to calculate the log likelihoods for the 20 iterations after 5 burn-in iterations and topic similarities should be based on atLeast = 3 words (see Step 3 below). In addition, we want to keep all interim calculations, which would be discarded by default to save memory space. {R} res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 5, K = 12, alpha = 0.1, eta = 0.1, compute.log.likelihood = TRUE, burnin = 5, num.iterations = 20, atLeast = 3, seeds = 1:5, keepLDAs = TRUE, keepSims = TRUE, keepTopics = TRUE) Based on res we can have a look at several getter functions: ```{R} getID(res) getPrototypeID(res)

getParam(res) getParam(getLDA(res))

getLDA(res, all = TRUE) getLDA(res)

est = getEstimators(getLDA(res)) est$phi[,1:3] est$theta[,1:3] getLog.likelihoods(getLDA(res))

getSCLOP(res) getSimilarity(res)[1:5, 1:5] tosca::topWords(getTopics(getLDA(res)), 5) ```

Step 1: LDA Replications

In the first step we simply run the LDA procedure five times with the given parameters. This can also be done with support of batchtools using LDABatch instead of LDARep or parallelMap setting the pm.backend and (optionally) ncpus argument(s). {R} reps = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 5, K = 12, alpha = 0.1, eta = 0.1, compute.log.likelihood = TRUE, burnin = 5, num.iterations = 20, seeds = 1:5)

Step 2: Merging Topic Matrices of Replications

The topic matrices of all replications are merged and reduced to the vocabulary given in vocab. By default the vocabulary of the first topic matrix is used as a simplification of the case that all LDAs contain the same vocabulary set. {R} topics = mergeTopics(reps, vocab = reuters_vocab)

Step 3: Topic Similarities

We use the merged topic matrix to calculate pairwise topic similarites using the Jaccard coefficient with parameters adjusting the consideration of words. A word is taken as relevant for a topic if its count passes thresholds given by limit.rel and limit.abs. A word is considered for calculation of similarities if it's relevant for the topic or if it belongs to the (atLeast =) 3 most common words in the corresponding topic. Alternatively, the similarities can also be calculated considering the cosine similarity (cosineTopics), Jensen-Shannon divergence (jsTopics - parameter epsilon to ensure computability) or rank-biased overlap (rboTopics - parameter k for maximum depth of evaluation and p as weighting parameter). {R} jacc = jaccardTopics(topics, limit.rel = 1/500, limit.abs = 10, atLeast = 3) getSimilarity(jacc)[1:3, 1:3] We can check the number of relevant and considered words using the ad-hoc getter. The difference between n1 and n2 can become larger than (atLeast =) 3 if there are ties in the count of words, which is negligible for large sample sizes. {R} n1 = getRelevantWords(jacc) n2 = getConsideredWords(jacc) (n2-n1)[n2-n1 != 0]

Step 3.1: Representation of Topic Similarities as Dendrogram

It is possible to represent the calulcated pairwise topic similarities as dendrogram using dendTopics and related plot options. {R} dend = dendTopics(jacc) plot(dend) The S-CLOP algorithm results in a pruning state of the dendrogram, which can be retrieved calling pruneSCLOP. By default each of the topics is colorized by its LDA run belonging; but the cluster belongings can also be visualized by the colors or by vertical lines with freely chosen parameters. {R} pruned = pruneSCLOP(dend) plot(dend, pruned) plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red"))

Step 4: Pairwise LDA Model Similarities (S-CLOP)

For determination of the LDAPrototype the pairwise S-CLOP similarities of the 5 LDA runs are needed. {R} sclop = SCLOP.pairwise(jacc)

Step 5: Determine LDAPrototype

In the last step the LDAPrototype itself is determined by maximizing the mean pairwise S-CLOP per LDA. {R} res2 = getPrototype(reps, sclop = sclop) There are several possibilites for using shortcut functions to summarize steps of the procedure. For example, we can determine the LDAPrototype after Step 1: {R} res3 = getPrototype(reps, atLeast = 3)

Owner

Name: Jonas Rieger
Login: JonasRieger
Kind: user
Location: Germany
Company: TU Dortmund University

Website: https://jonasrieger.github.io/
Twitter: Rieger94
Repositories: 6
Profile: https://github.com/JonasRieger

Statistician | Data Scientist in NLP

JOSS Publication

ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations

Published

July 16, 2020

DOI

10.21105/joss.02181

Volume 5, Issue 51, Page 2181

Authors

Jonas Rieger

TU Dortmund University

Editor

Karthik Ram

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 357
Total Committers: 1
Avg Commits per committer: 357.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Jonas Rieger	j**r@t**e	357

Committer Domains (Top 20 + Academic)

tu-dortmund.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 11
Total pull requests: 1
Average time to close issues: 20 days
Average time to close pull requests: 2 minutes
Total issue authors: 5
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

bstewart (3)
JonasRieger (3)
TommyJones (3)
HenrikBengtsson (1)
hadley (1)

Pull Request Authors

JonasRieger (1)

Top Labels

Issue Labels

usability (4) bug (4) formalities (4) wontfix (1) help wanted (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 380 last-month
Total docker downloads: 16

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 4
Total maintainers: 1

cran.r-project.org: ldaPrototype

Prototype of Multiple Latent Dirichlet Allocation Runs

Homepage: https://github.com/JonasRieger/ldaPrototype
Documentation: http://cran.r-project.org/web/packages/ldaPrototype/ldaPrototype.pdf
License: GPL (≥ 3)
Latest release: 0.3.1
published over 4 years ago

Versions: 4
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 380 Last month
Docker Downloads: 16

Rankings

Dependent packages count: 18.1%

Stargazers count: 19.3%

Forks count: 21.0%

Average: 23.4%

Dependent repos count: 23.9%

Downloads: 34.5%

Maintainers (1)

jonas.rieger@tu-dortmund.de

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
batchtools >= 0.9.11 imports
checkmate >= 1.8.5 imports
colorspace >= 1.4 imports
data.table >= 1.11.2 imports
dendextend * imports
fs >= 1.2.0 imports
future * imports
lda >= 1.4.2 imports
parallelMap * imports
progress >= 1.1.1 imports
stats * imports
utils * imports
RColorBrewer >= 1.1 suggests
covr * suggests
testthat * suggests
tosca * suggests

.github/workflows/R-CMD-check.yaml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite

.github/workflows/covr.yaml actions

actions/checkout v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

ldaPrototype

Science Score: 95.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ldaPrototype

Prototype of Multiple Latent Dirichlet Allocation Runs

Citation

References (related to the methodology)

Related Software

Contribution

Installation

(Quick Start) Example

(Slightly more detailed) Example

Step 1: LDA Replications

Step 2: Merging Topic Matrices of Replications

Step 3: Topic Similarities

Step 3.1: Representation of Topic Similarities as Dendrogram

Step 4: Pairwise LDA Model Similarities (S-CLOP)

Step 5: Determine LDAPrototype

Owner

JOSS Publication

ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: ldaPrototype

Rankings

Maintainers (1)

Dependencies