tidylo

Weighted tidy log odds ratio ⚖️

https://github.com/juliasilge/tidylo

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.6%) to scientific vocabulary

Keywords

empirical-bayes log-odds-ratio r tidy-data tidyverse weighted-log-odds

Last synced: 6 months ago · JSON representation

Repository

Weighted tidy log odds ratio ⚖️

Basic Info

Host: GitHub
Owner: juliasilge
License: other
Language: R
Default Branch: main
Homepage: https://juliasilge.github.io/tidylo/
Size: 4.12 MB

Statistics

Stars: 96
Watchers: 7
Forks: 4
Open Issues: 0
Releases: 0

Topics

empirical-bayes log-odds-ratio r tidy-data tidyverse weighted-log-odds

Created over 6 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.path = "man/figures/README-",
    out.width = "100%"
)
suppressPackageStartupMessages(library(ggplot2))
theme_set(theme_light())
```

# tidylo: Weighted Tidy Log Odds Ratio ⚖️


[![CRAN status](https://www.r-pkg.org/badges/version/tidylo)](https://CRAN.R-project.org/package=tidylo)
[![R-CMD-check](https://github.com/juliasilge/tidylo/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/juliasilge/tidylo/actions/workflows/check-standard.yaml)
[![Codecov test coverage](https://codecov.io/gh/juliasilge/tidylo/branch/main/graph/badge.svg)](https://app.codecov.io/gh/juliasilge/tidylo?branch=main)
[![lifecycle](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)


**Authors:** [Julia Silge](https://juliasilge.com/), [Alex Hayes](https://www.alexpghayes.com/), [Tyler Schnoebelen](https://www.letslanguage.org/)

**License:** [MIT](https://opensource.org/licenses/MIT)

How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful? 

Enter the **weighted log odds**, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in [Monroe, Colaresi, and Quinn (2008)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by a prior. By default, the prior is estimated from the data itself, an empirical Bayes approach, but an uninformative prior is also available.

## Installation

You can install the released version of tidylo from [CRAN](https://CRAN.R-project.org) with:

```{r eval=FALSE}
install.packages("tidylo")
```


Or you can install the development version from GitHub with [devtools](https://devtools.r-lib.org/):

```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("juliasilge/tidylo")
```

## Example

Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words.

```{r}
library(dplyr)
library(janeaustenr)
library(tidytext)

tidy_bigrams <- austen_books() %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    filter(!is.na(bigram))

bigram_counts <- tidy_bigrams %>%
    count(book, bigram, sort = TRUE)

bigram_counts
```

Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. The weighted log odds computed by this function are also [z-scores](https://en.wikipedia.org/wiki/Standard_score) for the log odds; this quantity is useful for comparing frequencies across categories or sets but its relationship to an odds ratio is not straightforward after the weighting. 

What are the bigrams with the highest weighted log odds for these books?

```{r}
library(tidylo)

bigram_log_odds <- bigram_counts %>%
    bind_log_odds(book, bigram, n) 

bigram_log_odds %>%
    arrange(-log_odds_weighted)
```

The bigrams more likely to come from each book, compared to the others, involve proper nouns. We can make a visualization as well.

```{r bigram_plot, fig.width=10, fig.height=7}
library(ggplot2)

bigram_log_odds %>%
    group_by(book) %>%
    slice_max(log_odds_weighted, n = 10) %>%
    ungroup() %>%
    mutate(bigram = reorder(bigram, log_odds_weighted)) %>%
    ggplot(aes(log_odds_weighted, bigram, fill = book)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(vars(book), scales = "free") +
    labs(y = NULL)
```

### Community Guidelines

This project is released with a
[Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html).
By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](https://github.com/juliasilge/tidylo/issues).

Owner

Name: Julia Silge
Login: juliasilge
Kind: user
Location: Salt Lake City, UT
Company: @posit-pbc

Website: https://juliasilge.com/
Repositories: 22
Profile: https://github.com/juliasilge

Data science and MLOps with #rstats, text mining, 💖

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: over 1 year ago

All Time

Total Commits: 65
Total Committers: 4
Avg Commits per committer: 16.25
Development Distribution Score (DDS): 0.123

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Julia Silge	j**e@g**m	57
alex hayes	a**s@g**m	5
TylerSchnoebelen	t**n@h**m	2
Maëlle Salmon	m**n@y**e	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 3
Average time to close issues: 6 months
Average time to close pull requests: 1 day
Total issue authors: 4
Total pull request authors: 3
Average comments per issue: 2.33
Average comments per pull request: 1.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

juliasilge (3)
dhicks (1)
NoelMatinez (1)
lyons7 (1)

Pull Request Authors

maelle (1)
alexpghayes (1)
juliasilge (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 410 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 2
Total maintainers: 1

cran.r-project.org: tidylo

Weighted Tidy Log Odds Ratio

Homepage: https://juliasilge.github.io/tidylo/
Documentation: http://cran.r-project.org/web/packages/tidylo/tidylo.pdf
License: MIT + file LICENSE
Latest release: 0.2.0
published almost 4 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 410 Last month

Rankings

Stargazers count: 4.3%

Forks count: 14.3%

Average: 18.7%

Downloads: 22.8%

Dependent repos count: 24.3%

Dependent packages count: 27.9%

Maintainers (1)

julia.silge@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

dplyr * imports
rlang * imports
covr * suggests
ggplot2 * suggests
janeaustenr * suggests
knitr * suggests
rmarkdown * suggests
stringr * suggests
testthat >= 3.0.0 suggests
tidytext * suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

tidylo

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: tidylo

Rankings

Maintainers (1)

Dependencies