tidylo

Weighted tidy log odds ratio ⚖️

https://github.com/juliasilge/tidylo

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.6%) to scientific vocabulary

Keywords

empirical-bayes log-odds-ratio r tidy-data tidyverse weighted-log-odds
Last synced: 6 months ago · JSON representation

Repository

Weighted tidy log odds ratio ⚖️

Basic Info
Statistics
  • Stars: 96
  • Watchers: 7
  • Forks: 4
  • Open Issues: 0
  • Releases: 0
Topics
empirical-bayes log-odds-ratio r tidy-data tidyverse weighted-log-odds
Created over 6 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.path = "man/figures/README-",
    out.width = "100%"
)
suppressPackageStartupMessages(library(ggplot2))
theme_set(theme_light())
```

# tidylo: Weighted Tidy Log Odds Ratio ⚖️


[![CRAN status](https://www.r-pkg.org/badges/version/tidylo)](https://CRAN.R-project.org/package=tidylo)
[![R-CMD-check](https://github.com/juliasilge/tidylo/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/juliasilge/tidylo/actions/workflows/check-standard.yaml)
[![Codecov test coverage](https://codecov.io/gh/juliasilge/tidylo/branch/main/graph/badge.svg)](https://app.codecov.io/gh/juliasilge/tidylo?branch=main)
[![lifecycle](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)


**Authors:** [Julia Silge](https://juliasilge.com/), [Alex Hayes](https://www.alexpghayes.com/), [Tyler Schnoebelen](https://www.letslanguage.org/)
**License:** [MIT](https://opensource.org/licenses/MIT) How can we measure how the usage or frequency of some **feature**, such as words, differs across some group or **set**, such as documents? One option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability; we haven't counted every feature the same number of times so how do we know which differences are meaningful? Enter the **weighted log odds**, which tidylo provides an implementation for, using tidy data principles. In particular, here we use the method outlined in [Monroe, Colaresi, and Quinn (2008)](https://doi.org/10.1093/pan/mpn018) to weight the log odds ratio by a prior. By default, the prior is estimated from the data itself, an empirical Bayes approach, but an uninformative prior is also available. ## Installation You can install the released version of tidylo from [CRAN](https://CRAN.R-project.org) with: ```{r eval=FALSE} install.packages("tidylo") ``` Or you can install the development version from GitHub with [devtools](https://devtools.r-lib.org/): ```{r, eval=FALSE} # install.packages("devtools") devtools::install_github("juliasilge/tidylo") ``` ## Example Using weighted log odds is a great approach for text analysis when we want to measure how word usage differs across a set of documents. Let's explore the [six published, completed novels of Jane Austen](https://github.com/juliasilge/janeaustenr) and use the [tidytext](https://github.com/juliasilge/tidytext) package to count up the bigrams (sequences of two adjacent words) in each novel. This weighted log odds approach would work equally well for single words. ```{r} library(dplyr) library(janeaustenr) library(tidytext) tidy_bigrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% filter(!is.na(bigram)) bigram_counts <- tidy_bigrams %>% count(book, bigram, sort = TRUE) bigram_counts ``` Now let's use the `bind_log_odds()` function from the tidylo package to find the weighted log odds for each bigram. The weighted log odds computed by this function are also [z-scores](https://en.wikipedia.org/wiki/Standard_score) for the log odds; this quantity is useful for comparing frequencies across categories or sets but its relationship to an odds ratio is not straightforward after the weighting. What are the bigrams with the highest weighted log odds for these books? ```{r} library(tidylo) bigram_log_odds <- bigram_counts %>% bind_log_odds(book, bigram, n) bigram_log_odds %>% arrange(-log_odds_weighted) ``` The bigrams more likely to come from each book, compared to the others, involve proper nouns. We can make a visualization as well. ```{r bigram_plot, fig.width=10, fig.height=7} library(ggplot2) bigram_log_odds %>% group_by(book) %>% slice_max(log_odds_weighted, n = 10) %>% ungroup() %>% mutate(bigram = reorder(bigram, log_odds_weighted)) %>% ggplot(aes(log_odds_weighted, bigram, fill = book)) + geom_col(show.legend = FALSE) + facet_wrap(vars(book), scales = "free") + labs(y = NULL) ``` ### Community Guidelines This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](https://github.com/juliasilge/tidylo/issues).

Owner

  • Name: Julia Silge
  • Login: juliasilge
  • Kind: user
  • Location: Salt Lake City, UT
  • Company: @posit-pbc

Data science and MLOps with #rstats, text mining, 💖

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 65
  • Total Committers: 4
  • Avg Commits per committer: 16.25
  • Development Distribution Score (DDS): 0.123
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Julia Silge j****e@g****m 57
alex hayes a****s@g****m 5
TylerSchnoebelen t****n@h****m 2
Maëlle Salmon m****n@y****e 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 6
  • Total pull requests: 3
  • Average time to close issues: 6 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 4
  • Total pull request authors: 3
  • Average comments per issue: 2.33
  • Average comments per pull request: 1.67
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • juliasilge (3)
  • dhicks (1)
  • NoelMatinez (1)
  • lyons7 (1)
Pull Request Authors
  • maelle (1)
  • alexpghayes (1)
  • juliasilge (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 410 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 2
  • Total maintainers: 1
cran.r-project.org: tidylo

Weighted Tidy Log Odds Ratio

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 410 Last month
Rankings
Stargazers count: 4.3%
Forks count: 14.3%
Average: 18.7%
Downloads: 22.8%
Dependent repos count: 24.3%
Dependent packages count: 27.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • dplyr * imports
  • rlang * imports
  • covr * suggests
  • ggplot2 * suggests
  • janeaustenr * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • stringr * suggests
  • testthat >= 3.0.0 suggests
  • tidytext * suggests