textdata
Download, parse, store, and load text datasets instead of storing it in packages
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.5%) to scientific vocabulary
Keywords
r
rstats
text-datasets
Keywords from Contributors
setup
book
bookdown
package-creation
documentation-tool
tidy-data
tidyverse
Last synced: 10 months ago
·
JSON representation
Repository
Download, parse, store, and load text datasets instead of storing it in packages
Basic Info
- Host: GitHub
- Owner: EmilHvitfeldt
- License: other
- Language: R
- Default Branch: main
- Homepage: https://emilhvitfeldt.github.io/textdata/
- Size: 14.4 MB
Statistics
- Stars: 78
- Watchers: 8
- Forks: 11
- Open Issues: 12
- Releases: 8
Topics
r
rstats
text-datasets
Created over 7 years ago
· Last pushed about 2 years ago
Metadata Files
Readme
Changelog
License
Code of conduct
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# textdata
[](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)
[](https://CRAN.R-project.org/package=textdata)
[](https://cran.r-project.org/package=textdata)
[](https://doi.org/10.5281/zenodo.3244433)
[](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)
[](https://lifecycle.r-lib.org/articles/stages.html)
The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.
## Installation
You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("textdata")
```
And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")
```
## Example
The first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.

After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.
## Included text datasets
As of today, the datasets included in textdata are:
| Dataset | Function |
| --------------------------------------------------------------- | ----------------------------- |
| v1.0 sentence polarity dataset | `dataset_sentence_polarity()` |
| AFINN-111 sentiment lexicon | `lexicon_afinn()` |
| Hu and Liu's opinion lexicon | `lexicon_bing()` |
| NRC word-emotion association lexicon | `lexicon_nrc()` |
| NRC Emotion Intensity Lexicon | `lexicon_nrc_eil()` |
| The NRC Valence, Arousal, and Dominance Lexicon | `lexicon_nrc_vad()` |
| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()` |
| AG's News | `dataset_ag_news()` |
| DBpedia ontology | `dataset_dbpedia()` |
| Trec-6 and Trec-50 | `dataset_trec()` |
| IMDb Large Movie Review Dataset | `dataset_imdb()` |
| Stanford NLP GloVe pre-trained word vectors | `embedding_glove6b()` |
| | `embedding_glove27b()` |
| | `embedding_glove42b()` |
| | `embedding_glove840b()` |
Check out each function's documentation for detailed information (including citations) for the relevant dataset.
## Community Guidelines
Note that this project is released with a
[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
Feedback, bug reports (and fixes!), and feature requests are welcome; file
issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).
For details on how to add a new dataset to this package, check out the vignette!
Owner
- Name: Emil Hvitfeldt
- Login: EmilHvitfeldt
- Kind: user
- Location: California
- Company: @posit-dev
- Website: https://www.emilhvitfeldt.com/
- Repositories: 27
- Profile: https://github.com/EmilHvitfeldt
All things @tidymodels
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| EmilHvitfeldt | e****t@g****m | 124 |
| Julia Silge | j****e@g****m | 5 |
| olivroy | 5****y | 1 |
| Jon Harmon | j****k@g****m | 1 |
| James Clawson | j****n | 1 |
| Ellis Valentiner | e****r | 1 |
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 51
- Total pull requests: 8
- Average time to close issues: 19 days
- Average time to close pull requests: about 12 hours
- Total issue authors: 13
- Total pull request authors: 6
- Average comments per issue: 0.75
- Average comments per pull request: 2.25
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- EmilHvitfeldt (38)
- sbw78 (1)
- sarahmnalifa (1)
- nujcharee (1)
- randomgambit (1)
- sjentsch (1)
- grantdick (1)
- richierocks (1)
- jmclawson (1)
- KyleOfCanada (1)
- jonthegeek (1)
- Ocete (1)
- ebridge2 (1)
Pull Request Authors
- juliasilge (3)
- jmclawson (2)
- olivroy (2)
- jonthegeek (1)
- EmilHvitfeldt (1)
- ellisvalentiner (1)
Top Labels
Issue Labels
enhancement (1)
reprex (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 5,783 last-month
- Total docker downloads: 1,367
- Total dependent packages: 3
- Total dependent repositories: 24
- Total versions: 9
- Total maintainers: 1
cran.r-project.org: textdata
Download and Load Various Text Datasets
- Homepage: https://emilhvitfeldt.github.io/textdata/
- Documentation: http://cran.r-project.org/web/packages/textdata/textdata.pdf
- License: MIT + file LICENSE
-
Latest release: 0.4.5
published about 2 years ago
Rankings
Downloads: 4.9%
Stargazers count: 5.0%
Dependent repos count: 5.6%
Forks count: 6.8%
Average: 9.7%
Dependent packages count: 10.9%
Docker downloads count: 25.0%
Maintainers (1)
Last synced:
10 months ago
Dependencies
DESCRIPTION
cran
- fs * imports
- rappdirs * imports
- readr * imports
- tibble * imports
- covr * suggests
- knitr * suggests
- rmarkdown * suggests
- testthat >= 2.1.0 suggests
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v2 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml
actions
- JamesIves/github-pages-deploy-action 4.1.4 composite
- actions/checkout v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pr-commands.yaml
actions
- actions/checkout v2 composite
- r-lib/actions/pr-fetch master composite
- r-lib/actions/pr-push master composite
- r-lib/actions/setup-r master composite