tidytext

tidytext: Text Mining and Analysis Using Tidy Data Principles in R - Published in JOSS (2016)

https://github.com/juliasilge/tidytext

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org, zenodo.org
✓
Committers with academic emails
3 of 33 committers (9.1%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

natural-language-processing r text-mining tidy-data tidyverse

Keywords from Contributors

tokenizer correlation statistical-analysis similarity-measures information-theory

Last synced: 6 months ago · JSON representation

Repository

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Basic Info

Host: GitHub
Owner: juliasilge
License: other
Language: R
Default Branch: main
Homepage: https://juliasilge.github.io/tidytext/
Size: 130 MB

Statistics

Stars: 1,193
Watchers: 63
Forks: 182
Open Issues: 9
Releases: 23

Topics

natural-language-processing r text-mining tidy-data tidyverse

Created almost 10 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---




```{r}
#| include = FALSE
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  message = FALSE
)
suppressPackageStartupMessages(library(ggplot2))
theme_set(theme_light())
```

# tidytext: Text mining using tidy tools 

**Authors:** [Julia Silge](https://juliasilge.com/), [David Robinson](http://varianceexplained.org/)

**License:** [MIT](https://opensource.org/licenses/MIT)


[![R-CMD-check](https://github.com/juliasilge/tidytext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/juliasilge/tidytext/actions/workflows/R-CMD-check.yaml)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/tidytext)](https://cran.r-project.org/package=tidytext)
[![Codecov test coverage](https://codecov.io/gh/juliasilge/tidytext/branch/main/graph/badge.svg)](https://app.codecov.io/gh/juliasilge/tidytext?branch=main)
[![DOI](https://zenodo.org/badge/22224/juliasilge/tidytext.svg)](https://zenodo.org/badge/latestdoi/22224/juliasilge/tidytext)
[![JOSS](https://joss.theoj.org/papers/10.21105/joss.00037/status.svg)](https://joss.theoj.org/papers/10.21105/joss.00037)
[![Downloads](https://cranlogs.r-pkg.org/badges/tidytext)](https://CRAN.R-project.org/package=tidytext)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/tidytext?color=orange)](https://CRAN.R-project.org/package=tidytext)



Using [tidy data principles](https://doi.org/10.18637/jss.v059.i10) can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like [dplyr](https://cran.r-project.org/package=dplyr), [broom](https://cran.r-project.org/package=broom), [tidyr](https://cran.r-project.org/package=tidyr), and [ggplot2](https://cran.r-project.org/package=ggplot2). In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out [our book](https://www.tidytextmining.com/) to learn more about text mining using tidy data principles.

### Installation

You can install this package from CRAN:

```{r}
#| eval = FALSE
install.packages("tidytext")
```


Or you can install the development version from GitHub with [remotes](https://github.com/r-lib/remotes):

```{r}
#| eval = FALSE
library(remotes)
install_github("juliasilge/tidytext")
```

### Tidy text mining example: the `unnest_tokens` function

The novels of Jane Austen can be so tidy! Let's use the text of Jane Austen's 6 completed, published novels from the [janeaustenr](https://cran.r-project.org/package=janeaustenr) package, and transform them to a tidy format. janeaustenr provides them as a one-row-per-line format:

```{r}
library(janeaustenr)
library(dplyr)

original_books <- austen_books() |>
  group_by(book) |>
  mutate(line = row_number()) |>
  ungroup()

original_books
```

To work with this as a tidy dataset, we need to restructure it as **one-token-per-row** format. The `unnest_tokens()` function is a way to convert a dataframe with a text column to be one-token-per-row:

```{r}
library(tidytext)
tidy_books <- original_books |>
  unnest_tokens(word, text)

tidy_books
```

This function uses the [tokenizers](https://docs.ropensci.org/tokenizers/) package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in a one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (available via the function `get_stopwords()`) with an `anti_join()`.

```{r}
tidy_books <- tidy_books |>
  anti_join(get_stopwords())
```

We can also use `count()` to find the most common words in all the books as a whole.

```{r}
tidy_books |>
  count(word, sort = TRUE)
```

Sentiment analysis can be implemented as an inner join. Three sentiment lexicons are available via the `get_sentiments()` function. Let's examine how sentiment changes across each novel. Let's find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.

```{r}
#| fig.width = 8,
#| fig.height = 10
library(tidyr)
get_sentiments("bing")

janeaustensentiment <- tidy_books |>
  inner_join(
    get_sentiments("bing"),
    by = "word",
    relationship = "many-to-many"
  ) |>
  count(book, index = line %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative)

janeaustensentiment
```

Now we can plot these sentiment scores across the plot trajectory of each novel.

```{r}
#| fig.width = 7,
#| fig.height = 7,
#| fig.alt = "Sentiment scores across the trajectories of Jane Austen's six published novels",
#| warning = FALSE
library(ggplot2)

ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(book), ncol = 2, scales = "free_x")
```

For more examples of text mining using tidy data frames, see the tidytext vignette.

### Tidying document term matrices

Some existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels dataset.

```{r}
library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
```

If we want to analyze this with tidy tools, we need to transform it into a one-row-per-term data frame first with a `tidy()` function. (For more on the tidy verb, [see the broom package](https://broom.tidymodels.org/)).

```{r}
tidy(AssociatedPress)
```

We could find the most negative documents:

```{r}
ap_sentiments <- tidy(AssociatedPress) |>
  inner_join(get_sentiments("bing"), by = c(term = "word")) |>
  count(document, sentiment, wt = count) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative) |>
  arrange(sentiment)
```

Or we can join the Austen and AP datasets and compare the frequencies of each word:

```{r}
#| fig.height = 8,
#| fig.width = 8,
#| fig.alt = 'Scatterplot for word frequencies in Jane Austen vs. AP news articles. Some words like "cried" are only common in Jane Austen, some words like "national" are only common in AP articles, and some word like "time" are common in both.'
comparison <- tidy(AssociatedPress) |>
  count(word = term) |>
  rename(AP = n) |>
  inner_join(count(tidy_books, word)) |>
  rename(Austen = n) |>
  mutate(
    AP = AP / sum(AP),
    Austen = Austen / sum(Austen)
  )


comparison

library(scales)
ggplot(comparison, aes(AP, Austen)) +
  geom_point(alpha = 0.5) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1, hjust = 1) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")
```

For more examples of working with objects from other text mining packages using tidy data principles, see the [vignette](https://juliasilge.github.io/tidytext/articles/tidying_casting.html) on converting to and from document term matrices.

### Community Guidelines

This project is released with a [Contributor Code of Conduct](https://github.com/juliasilge/tidytext/blob/main/CONDUCT.md). By participating in this project you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](https://github.com/juliasilge/tidytext/issues).

Owner

Name: Julia Silge
Login: juliasilge
Kind: user
Location: Salt Lake City, UT
Company: @posit-pbc

Website: https://juliasilge.com/
Repositories: 22
Profile: https://github.com/juliasilge

Data science and MLOps with #rstats, text mining, 💖

JOSS Publication

tidytext: Text Mining and Analysis Using Tidy Data Principles in R

Published

July 11, 2016

DOI

10.21105/joss.00037

Volume 1, Issue 3, Page 37

Authors

Julia Silge

Datassist

David Robinson

Stack Overflow

Editor

Arfon Smith

GitHub Events

Total

Issues event: 5
Watch event: 24
Delete event: 1
Issue comment event: 20
Push event: 6
Pull request event: 3
Fork event: 4
Create event: 2

Last Year

Issues event: 5
Watch event: 24
Delete event: 1
Issue comment event: 20
Push event: 6
Pull request event: 3
Fork event: 4
Create event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 716
Total Committers: 33
Avg Commits per committer: 21.697
Development Distribution Score (DDS): 0.268

Past Year

Commits: 3
Committers: 1
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Julia Silge	j**e@g**m	524
Dave Robinson	d**n@s**m	54
dgrtwo	d**o@p**u	38
Colin	c**n@t**r	23
Julia Silge	j**e@s**m	17
Oliver Keyes	i**s@g**m	7
Kenneth Benoit	k**t@l**k	7
Timothy Mastny	t**y@g**m	6
Emil Hvitfeldt	e**t@g**m	6
Jeff Erickson	j**f@e**o	3
Jim Hester	j**r@g**m	3
kanishkamisra	m**e@g**m	3
David Robinson	a**d@g**m	2
Lionel Henry	l**y@g**m	2
Luis de Sousa	l**d@s**a	2
aedobbyn	a**1@g**m	2
Dan Lependorf	d**f@t**m	1
Dave Childers	c**e@g**m	1
Erwan Le Pennec	l**c@g**m	1
James Keirstead	j**d@g**m	1
Jenny Bryan	j**n@g**m	1
Jonathan Völkle	3****e	1
Lincoln Mullen	l**n@l**m	1
Michael Chirico	m**4@g**m	1
Ramnath Vaidyanathan	r**a@g**m	1
Seth Berry	s**y@n**u	1
Vincent Arel-Bundock	v**k@u**a	1
Y. Yu	5****e	1
jonmcalder	j**r@g**m	1
olivroy	5****y	1
and 3 more...

Committer Domains (Top 20 + Academic)

stackoverflow.com: 2 umontreal.ca: 1 nd.edu: 1 lincolnmullen.com: 1 theathletic.com: 1 syeop.co.za: 1 erick.so: 1 lse.ac.uk: 1 thinkr.fr: 1 princeton.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 93
Total pull requests: 21
Average time to close issues: about 1 month
Average time to close pull requests: 4 days
Total issue authors: 61
Total pull request authors: 10
Average comments per issue: 3.78
Average comments per pull request: 2.19
Merged pull requests: 18
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 19 minutes
Issue authors: 3
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

juliasilge (16)
dgrtwo (8)
TheOne000 (5)
nabsiddiqui (3)
petereckley (3)
MichaelChirico (2)
Ironholds (2)
kjmobile (1)
yli74 (1)
jirkalewandowski (1)
1danjordan (1)
kbenoit (1)
ariespirgel (1)
dan-reznik (1)
twedl (1)

Pull Request Authors

juliasilge (10)
kbenoit (3)
olivroy (2)
seankross (1)
AmeliaMN (1)
jimhester (1)
jonathanvoelkle (1)
arfon (1)
davechilders (1)
MichaelChirico (1)

Top Labels

Issue Labels

feature (2)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- cran 42,128 last-month
Total docker downloads: 142,547

Total dependent packages: 71
(may contain duplicates)
Total dependent repositories: 195
(may contain duplicates)
Total versions: 51
Total maintainers: 1

cran.r-project.org: tidytext

Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Homepage: https://juliasilge.github.io/tidytext/
Documentation: http://cran.r-project.org/web/packages/tidytext/tidytext.pdf
License: MIT + file LICENSE
Latest release: 0.4.3
published 7 months ago

Versions: 15
Dependent Packages: 66
Dependent Repositories: 194
Downloads: 42,128 Last month
Docker Downloads: 142,547

Rankings

Stargazers count: 0.2%

Forks count: 0.3%

Average: 1.3%

Dependent repos count: 1.3%

Dependent packages count: 1.4%

Downloads: 2.0%

Docker downloads count: 2.3%

Maintainers (1)

julia.silge@gmail.com

Last synced: 6 months ago

proxy.golang.org: github.com/juliasilge/tidytext

Documentation: https://pkg.go.dev/github.com/juliasilge/tidytext#section-documentation
License: other
Latest release: v0.4.3
published 7 months ago

Versions: 22
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

conda-forge.org: r-tidytext

Homepage: http://github.com/juliasilge/tidytext
License: MIT
Latest release: 0.3.4
published over 3 years ago

Versions: 14
Dependent Packages: 5
Dependent Repositories: 1

Rankings

Dependent packages count: 10.4%

Stargazers count: 11.9%

Forks count: 13.0%

Average: 14.8%

Dependent repos count: 24.1%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 2.10 depends
Matrix * imports
dplyr * imports
generics * imports
hunspell * imports
janeaustenr * imports
lifecycle * imports
methods * imports
purrr >= 0.1.1 imports
rlang >= 0.4.10 imports
stringr * imports
tibble * imports
tokenizers * imports
vctrs * imports
NLP * suggests
broom * suggests
covr * suggests
data.table * suggests
ggplot2 * suggests
knitr * suggests
mallet * suggests
quanteda * suggests
readr * suggests
reshape2 * suggests
rmarkdown * suggests
scales * suggests
stm * suggests
stopwords * suggests
testthat >= 2.1.0 suggests
textdata * suggests
tidyr * suggests
tm * suggests
topicmodels * suggests
vdiffr * suggests
wordcloud * suggests

.github/workflows/R-CMD-check-hard.yaml actions

actions/checkout v2 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/lock.yaml actions

dessant/lock-threads v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action 4.1.4 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pr-commands.yaml actions

actions/checkout v2 composite
r-lib/actions/pr-fetch v2 composite
r-lib/actions/pr-push v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/test-coverage.yaml actions

actions/checkout v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

tidytext

Science Score: 95.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

JOSS Publication

tidytext: Text Mining and Analysis Using Tidy Data Principles in R

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: tidytext

Rankings

Maintainers (1)

proxy.golang.org: github.com/juliasilge/tidytext

Rankings

conda-forge.org: r-tidytext

Rankings

Dependencies