textdata

Download, parse, store, and load text datasets instead of storing it in packages

https://github.com/emilhvitfeldt/textdata

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.5%) to scientific vocabulary

Keywords

r rstats text-datasets

Keywords from Contributors

setup book bookdown package-creation documentation-tool tidy-data tidyverse
Last synced: 10 months ago · JSON representation

Repository

Download, parse, store, and load text datasets instead of storing it in packages

Basic Info
Statistics
  • Stars: 78
  • Watchers: 8
  • Forks: 11
  • Open Issues: 12
  • Releases: 8
Topics
r rstats text-datasets
Created over 7 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog License Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# textdata 


[![R-CMD-check](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/textdata)](https://CRAN.R-project.org/package=textdata)
[![Downloads](http://cranlogs.r-pkg.org/badges/textdata)](https://cran.r-project.org/package=textdata)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3244433.svg)](https://doi.org/10.5281/zenodo.3244433)
[![Codecov test coverage](https://codecov.io/gh/EmilHvitfeldt/textdata/branch/main/graph/badge.svg)](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)


The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

## Installation

You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("textdata")
```

And the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")
```
## Example

The first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.

![](man/figures/textdata_demo.gif)

After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.

## Included text datasets

As of today, the datasets included in textdata are:

| Dataset                                                         | Function                      |
| --------------------------------------------------------------- | ----------------------------- |
| v1.0 sentence polarity dataset                                  | `dataset_sentence_polarity()` |
| AFINN-111 sentiment lexicon                                     | `lexicon_afinn()`             |
| Hu and Liu's opinion lexicon                                    | `lexicon_bing()`              |
| NRC word-emotion association lexicon                            | `lexicon_nrc()`               |
| NRC Emotion Intensity Lexicon                                   | `lexicon_nrc_eil()`           |
| The NRC Valence, Arousal, and Dominance Lexicon                 | `lexicon_nrc_vad()`           |
| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()`          |
| AG's News                                                       | `dataset_ag_news()`           |
| DBpedia ontology                                                | `dataset_dbpedia()`           |
| Trec-6 and Trec-50                                              | `dataset_trec()`              |
| IMDb Large Movie Review Dataset	                                | `dataset_imdb()`              |
| Stanford NLP GloVe pre-trained word vectors                     | `embedding_glove6b()`         |
|                                                                 | `embedding_glove27b()`        |
|                                                                 | `embedding_glove42b()`        |
|                                                                 | `embedding_glove840b()`       |

Check out each function's documentation for detailed information (including citations) for the relevant dataset.

## Community Guidelines

Note that this project is released with a
[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms. 
Feedback, bug reports (and fixes!), and feature requests are welcome; file 
issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).
For details on how to add a new dataset to this package, check out the vignette!

Owner

  • Name: Emil Hvitfeldt
  • Login: EmilHvitfeldt
  • Kind: user
  • Location: California
  • Company: @posit-dev

All things @tidymodels

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 133
  • Total Committers: 6
  • Avg Commits per committer: 22.167
  • Development Distribution Score (DDS): 0.068
Past Year
  • Commits: 8
  • Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
EmilHvitfeldt e****t@g****m 124
Julia Silge j****e@g****m 5
olivroy 5****y 1
Jon Harmon j****k@g****m 1
James Clawson j****n 1
Ellis Valentiner e****r 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 51
  • Total pull requests: 8
  • Average time to close issues: 19 days
  • Average time to close pull requests: about 12 hours
  • Total issue authors: 13
  • Total pull request authors: 6
  • Average comments per issue: 0.75
  • Average comments per pull request: 2.25
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • EmilHvitfeldt (38)
  • sbw78 (1)
  • sarahmnalifa (1)
  • nujcharee (1)
  • randomgambit (1)
  • sjentsch (1)
  • grantdick (1)
  • richierocks (1)
  • jmclawson (1)
  • KyleOfCanada (1)
  • jonthegeek (1)
  • Ocete (1)
  • ebridge2 (1)
Pull Request Authors
  • juliasilge (3)
  • jmclawson (2)
  • olivroy (2)
  • jonthegeek (1)
  • EmilHvitfeldt (1)
  • ellisvalentiner (1)
Top Labels
Issue Labels
enhancement (1) reprex (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 5,783 last-month
  • Total docker downloads: 1,367
  • Total dependent packages: 3
  • Total dependent repositories: 24
  • Total versions: 9
  • Total maintainers: 1
cran.r-project.org: textdata

Download and Load Various Text Datasets

  • Versions: 9
  • Dependent Packages: 3
  • Dependent Repositories: 24
  • Downloads: 5,783 Last month
  • Docker Downloads: 1,367
Rankings
Downloads: 4.9%
Stargazers count: 5.0%
Dependent repos count: 5.6%
Forks count: 6.8%
Average: 9.7%
Dependent packages count: 10.9%
Docker downloads count: 25.0%
Maintainers (1)
Last synced: 10 months ago

Dependencies

DESCRIPTION cran
  • fs * imports
  • rappdirs * imports
  • readr * imports
  • tibble * imports
  • covr * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • testthat >= 2.1.0 suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action 4.1.4 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pr-commands.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/pr-fetch master composite
  • r-lib/actions/pr-push master composite
  • r-lib/actions/setup-r master composite