tidylda

tidylda: An R Package for Latent Dirichlet Allocation Using 'tidyverse' Conventions - Published in JOSS (2024)

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: arxiv.org, joss.theoj.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Last synced: 9 months ago · JSON representation

Repository

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).

Basic Info

Host: GitHub
Owner: TommyJones
License: other
Language: R
Default Branch: main
Size: 51.2 MB

Statistics

Stars: 42
Watchers: 5
Forks: 3
Open Issues: 10
Releases: 5

Created over 6 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# tidylda 


[![DOI](https://joss.theoj.org/papers/10.21105/joss.06800/status.svg)](https://doi.org/10.21105/joss.06800)
[![Codecov test coverage](https://codecov.io/gh/TommyJones/tidylda/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tommyjones/tidylda/branch/main)
[![R-CMD-check](https://GitHub.com/TommyJones/tidylda/actions/workflows/R-CMD-check.yaml/badge.svg)](https://GitHub.com/TommyJones/tidylda/actions/workflows/R-CMD-check.yaml)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)



Latent Dirichlet Allocation Using 'tidyverse' Conventions

`tidylda` implements an algorithm for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.GitHub.io/model-implementation-principles/). 
    
In addition this implementation of LDA allows you to:

* use asymmetric prior parameters alpha and eta
* use a matrix prior parameter, eta, to seed topics into a model
* use a previously-trained model as a prior for a new model
* apply LDA in a transfer-learning paradigm, updating a model's parameters with additional data (or additional iterations)


## Installation

You can install the latest CRAN release with:

``` r
install("tidylda")
```


You can install the development version from [GitHub](https://GitHub.com/) with:

``` r
install.packages("remotes")

remotes::install_GitHub("tommyjones/tidylda")
```

For a list of dependencies see the DESCRIPTION file.

# Getting started

This package is still in its early stages of development. However, some basic functionality is below. Here, we will use the `tidytext` package to create a document term matrix, fit a topic model, predict topics of unseen documents, and update the model with those new documents.

`tidylda` uses the following naming conventions for topic models:

* `theta` is a matrix whose rows are distributions of topics over documents, or P(topic|document)
* `beta` is a matrix whose rows are distributions of tokens over topics, or P(token|topic)
* `lambda` is a matrix whose rows are distributions of topics over tokens, or P(topic|token)
  `lambda` is useful for making predictions with a computationally-simple and efficient dot product and it may be interesting to analyze in its own right.
* `alpha` is the prior that tunes `theta`
* `eta` is the prior that tunes `beta`

## Example

```{r example}
library(tidytext)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidylda)
library(Matrix)

### Initial set up ---
# load some documents
docs <- nih_sample 

# tokenize using tidytext's unnest_tokens
tidy_docs <- docs %>% 
  select(APPLICATION_ID, ABSTRACT_TEXT) %>% 
  unnest_tokens(output = word, 
                input = ABSTRACT_TEXT,
                stopwords = stop_words$word,
                token = "ngrams",
                n_min = 1, n = 2) %>% 
  count(APPLICATION_ID, word) %>% 
  filter(n>1) #Filtering for words/bigrams per document, rather than per corpus

tidy_docs <- tidy_docs %>% # filter words that are just numbers
  filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))

# append observation level data 
colnames(tidy_docs)[1:2] <- c("document", "term")


# turn a tidy tbl into a sparse dgCMatrix 
# note tidylda has support for several document term matrix formats
d <- tidy_docs %>% 
  cast_sparse(document, term, n)

# let's split the documents into two groups to demonstrate predictions and updates
d1 <- d[1:50, ]

d2 <- d[51:nrow(d), ]

# make sure we have different vocabulary for each data set to simulate the "real world"
# where you get new tokens coming in over time
d1 <- d1[, colSums(d1) > 0]

d2 <- d2[, colSums(d2) > 0]

### fit an intial model and inspect it ----
set.seed(123)

lda <- tidylda(
  data = d1,
  k = 10,
  iterations = 200, 
  burnin = 175,
  alpha = 0.1, # also accepts vector inputs
  eta = 0.05, # also accepts vector or matrix inputs
  optimize_alpha = FALSE, # experimental
  calc_likelihood = TRUE,
  calc_r2 = TRUE, # see https://arxiv.org/abs/1911.11061
  return_data = FALSE
)

# did the model converge?
# there are actual test stats for this, but should look like "yes"
qplot(x = iteration, y = log_likelihood, data = lda$log_likelihood, geom = "line") + 
    ggtitle("Checking model convergence")

# look at the model overall
glance(lda)

print(lda)

# it comes with its own summary matrix that's printed out with print(), above
lda$summary


# inspect the individual matrices
tidy_theta <- tidy(lda, matrix = "theta")

tidy_theta

tidy_beta <- tidy(lda, matrix = "beta")

tidy_beta

tidy_lambda <- tidy(lda, matrix = "lambda")

tidy_lambda

# append observation-level data
augmented_docs <- augment(lda, data = tidy_docs)

augmented_docs

### predictions on held out data ---
# two methods: gibbs is cleaner and more technically correct in the bayesian sense
p_gibbs <- predict(lda, new_data = d2[1, ], iterations = 100, burnin = 75)

# dot is faster, less prone to error (e.g. underflow), noisier, and frequentist
p_dot <- predict(lda, new_data = d2[1, ], method = "dot")

# pull both together into a plot to compare
tibble(topic = 1:ncol(p_gibbs), gibbs = p_gibbs[1,], dot = p_dot[1, ]) %>%
  pivot_longer(cols = gibbs:dot, names_to = "type") %>%
  ggplot() + 
  geom_bar(mapping = aes(x = topic, y = value, group = type, fill = type), 
           stat = "identity", position="dodge") +
  scale_x_continuous(breaks = 1:10, labels = 1:10) + 
  ggtitle("Gibbs predictions vs. dot product predictions")

### Augment as an implicit prediction using the 'dot' method ----
# Aggregating over terms results in a distribution of topics over documents
# roughly equivalent to using the "dot" method of predictions.
augment_predict <- 
  augment(lda, tidy_docs, "prob") %>%
  group_by(document) %>% 
  select(-c(document, term)) %>% 
  summarise_all(function(x) sum(x, na.rm = T))

# reformat for easy plotting
augment_predict <- 
  as_tibble(t(augment_predict[, -c(1,2)]), .name_repair = "minimal")

colnames(augment_predict) <- unique(tidy_docs$document)

augment_predict$topic <- 1:nrow(augment_predict) %>% as.factor()

compare_mat <- 
  augment_predict %>%
  select(
    topic,
    augment = matches(rownames(d2)[1])
  ) %>%
  mutate(
    augment = augment / sum(augment), # normalize to sum to 1
    dot = p_dot[1, ]
  ) %>%
  pivot_longer(cols = c(augment, dot))

ggplot(compare_mat) + 
  geom_bar(aes(y = value, x = topic, group = name, fill = name), 
           stat = "identity", position = "dodge") +
  labs(title = "Prediction using 'augment' vs 'predict(..., method = \"dot\")'")

# Not shown: aggregating over documents results in recovering the "tidy" lambda.

### updating the model ----
# now that you have new documents, maybe you want to fold them into the model?
lda2 <- refit(
  object = lda, 
  new_data = d, # save me the trouble of manually-combining these by just using d
  iterations = 200, 
  burnin = 175,
  calc_likelihood = TRUE,
  calc_r2 = TRUE
)

# we can do similar analyses
# did the model converge?
qplot(x = iteration, y = log_likelihood, data = lda2$log_likelihood, geom = "line") +
  ggtitle("Checking model convergence")

# look at the model overall
glance(lda2)

print(lda2)


# how does that compare to the old model?
print(lda)
```

There are several vignettes available in [/vignettes](https://GitHub.com/TommyJones/tidylda/tree/main/vignettes). They can be compiled using `knitr` or you can view [PDF versions on CRAN](https://CRAN.R-project.org/package=tidylda).

See NEWS.md for a changelog, including changes from the CRAN release to the development version on GitHub.

See the "Issues" tab on GitHub to see planned features as well as bug fixes.

# Contributions

If you would like to contribute to this package, please start by opening an issue on GitHub. Direct contributions via merge requests are discouraged unless invited to do so. 

If you have any suggestions for additional functionality, changes to functionality, changes to arguments or other aspects of the API please let me know by opening an issue on GitHub or sending me an email: jones.thos.w at gmail.com.

## Code of Conduct

Please note that the tidylda project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

Owner

Name: Tommy Jones
Login: TommyJones
Kind: user
Location: Washington DC

Website: http://www.jonesingfordata.com
Twitter: thos_jones
Repositories: 43
Profile: https://github.com/TommyJones

Technology | Statistics | Machine Learning.

JOSS Publication

tidylda: An R Package for Latent Dirichlet Allocation Using 'tidyverse' Conventions

Published

July 25, 2024

DOI

10.21105/joss.06800

Volume 9, Issue 99, Page 6800

Authors

Tommy Jones

Foundation, USA

Editor

Kanishka B. Narayan

GitHub Events

Total

Issues event: 1
Watch event: 1
Push event: 2

Last Year

Issues event: 1
Watch event: 1
Push event: 2

Committers

Last synced: 10 months ago

All Time

Total Commits: 490
Total Committers: 4
Avg Commits per committer: 122.5
Development Distribution Score (DDS): 0.01

Past Year

Commits: 2
Committers: 2
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
Tommy Jones	j**w@g**m	485
Brendan Knapp	b**p@g**m	3
Tommy Jones	t**y@T**l	1
=	=	1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 66
Total pull requests: 12
Average time to close issues: 4 months
Average time to close pull requests: 13 minutes
Total issue authors: 4
Total pull request authors: 1
Average comments per issue: 2.06
Average comments per pull request: 0.25
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

TommyJones (60)
maximelenormand (3)
hassaniazi (2)
harryahlas (1)

Pull Request Authors

TommyJones (12)

Top Labels

Issue Labels

CRAN (20) enhancement (18) bug (8) documentation (4) administrative (4) wontfix (1) duplicate (1) help wanted (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 642 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 5
Total maintainers: 1

cran.r-project.org: tidylda

Latent Dirichlet Allocation Using 'tidyverse' Conventions

Homepage: https://github.com/TommyJones/tidylda/
Documentation: http://cran.r-project.org/web/packages/tidylda/tidylda.pdf
License: MIT + file LICENSE
Latest release: 0.0.5
published about 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 642 Last month

Rankings

Stargazers count: 8.0%

Forks count: 14.9%

Average: 26.9%

Dependent packages count: 29.8%

Dependent repos count: 35.5%

Downloads: 46.6%

Maintainers (1)

jones.thos.w@gmail.com

Last synced: 9 months ago

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
Matrix * imports
Rcpp >= 1.0.2 imports
dplyr * imports
generics * imports
gtools * imports
methods * imports
mvrsquared >= 0.1.0 imports
rlang * imports
stats * imports
stringr * imports
tibble * imports
tidyr * imports
tidytext * imports
covr * suggests
knitr * suggests
parallel * suggests
quanteda * suggests
spelling * suggests
testthat * suggests
tm * suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/test-coverage.yaml actions

actions/cache v2 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite

tidylda

Science Score: 93.0%

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

JOSS Publication

tidylda: An R Package for Latent Dirichlet Allocation Using 'tidyverse' Conventions

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: tidylda

Rankings

Maintainers (1)

Dependencies