readtext

an R package for reading text files

https://github.com/quanteda/readtext

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 12 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.0%) to scientific vocabulary

Keywords

encoding quanteda r text

Keywords from Contributors

corpus text-analytics sentiment-analysis lsa text-analysis excel
Last synced: 6 months ago · JSON representation

Repository

an R package for reading text files

Basic Info
Statistics
  • Stars: 120
  • Watchers: 13
  • Forks: 26
  • Open Issues: 33
  • Releases: 8
Topics
encoding quanteda r text
Created over 9 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog

README.Rmd

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "images/"
)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("badger")
```

# readtext: Import and handling for plain and formatted text files


[![CRAN Version](https://www.r-pkg.org/badges/version/readtext)](https://CRAN.R-project.org/package=readtext)
`r badge_devel("quanteda/readtext", "royalblue")`
[![Downloads](https://cranlogs.r-pkg.org/badges/readtext)](https://CRAN.R-project.org/package=readtext)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/readtext?color=orange)](https://CRAN.R-project.org/package=readtext)
[![R-CMD-check](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/quanteda/readtext/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/readtext?branch=master)



[1]: https://codecov.io/gh/quanteda/readtext/branch/master

An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.

## Introduction

**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables.  Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.  

**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types.  **readtext** is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.

As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings.  (All encoding functions are handled by the **stringi** package.)

## How to Install


1.  From CRAN

    ```{r, eval = FALSE}
    install.packages("readtext")
    ```

2.  From GitHub, if you want the latest development version.

    ```{r, eval = FALSE}
    # devtools packaged required to install readtext from Github 
    remotes::install_github("quanteda/readtext") 
    ```

Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:

```{bash, eval = FALSE}
sudo apt-get install libpoppler-cpp-dev   # for antiword
```

## Demonstration: Reading one or more text files

**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).

The file formats are determined automatically by the filename extensions.  If a file has no extension or is unknown, **readtext** will assume that it is plain text.  The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:

```{r}
library("readtext")
# get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

# read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
```

For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:

```{r}
# read in comma-separated values and specify text field
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
```

For a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).

## Inter-operability with other packages

### With **quanteda**

**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

```{r}
library("quanteda")
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
```

### Text Interchange Format compatibility

**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.  

If you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.

Owner

  • Name: Quanteda Initiative
  • Login: quanteda
  • Kind: organization
  • Location: London, UK

GitHub Events

Total
  • Issues event: 1
  • Watch event: 3
  • Issue comment event: 3
  • Push event: 5
  • Pull request event: 5
  • Fork event: 1
Last Year
  • Issues event: 1
  • Watch event: 3
  • Issue comment event: 3
  • Push event: 5
  • Pull request event: 5
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 427
  • Total Committers: 12
  • Avg Commits per committer: 35.583
  • Development Distribution Score (DDS): 0.532
Past Year
  • Commits: 3
  • Committers: 2
  • Avg Commits per committer: 1.5
  • Development Distribution Score (DDS): 0.333
Top Committers
Name Email Commits
Kenneth Benoit k****t@l****k 200
Adam Obeng g****b@b****m 110
amatsuo m****a@g****m 46
Kohei Watanabe w****i@g****m 28
Kenneth Benoit k****t@K****l 26
Stefan Müller m****s@t****e 10
chainsawriot c****y@g****m 2
pnulty p****y@g****m 1
olivroy 5****y 1
Tom Nicholls g****b@t****k 1
Jirka Lewandowski j****i@p****e 1
JBGruber j****1@r****k 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 72
  • Total pull requests: 37
  • Average time to close issues: 6 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 35
  • Total pull request authors: 12
  • Average comments per issue: 1.83
  • Average comments per pull request: 1.19
  • Merged pull requests: 31
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 26 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.33
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kbenoit (23)
  • koheiw (6)
  • adamobeng (6)
  • stefan-mueller (4)
  • krystian8207 (2)
  • ogorodriguez (2)
  • cownr10r (1)
  • sagarkumar16 (1)
  • dkholo (1)
  • jslapin (1)
  • Astelix (1)
  • ElCarlitos (1)
  • gcpoole (1)
  • leeper (1)
  • SinaOzdemir (1)
Pull Request Authors
  • kbenoit (8)
  • amatsuo (6)
  • chainsawriot (6)
  • adamobeng (6)
  • koheiw (3)
  • olivroy (3)
  • gcpoole (2)
  • stefan-mueller (2)
  • JBGruber (1)
  • pnulty (1)
  • pmyteh (1)
  • jirkalewandowski (1)
Top Labels
Issue Labels
enhancement (8) bug (5) Documentation (4) Difficulty: Hard (3) Difficulty: Medium (3) pre-CRAN (3) question (1) Difficulty: Easy (1) performance (1)
Pull Request Labels
bug (1)

Packages

  • Total packages: 2
  • Total downloads:
    • cran 4,138 last-month
  • Total docker downloads: 89,332
  • Total dependent packages: 7
    (may contain duplicates)
  • Total dependent repositories: 25
    (may contain duplicates)
  • Total versions: 14
  • Total maintainers: 1
cran.r-project.org: readtext

Import and Handling for Plain and Formatted Text Files

  • Versions: 12
  • Dependent Packages: 7
  • Dependent Repositories: 24
  • Downloads: 4,138 Last month
  • Docker Downloads: 89,332
Rankings
Forks count: 2.8%
Stargazers count: 3.6%
Dependent repos count: 5.6%
Downloads: 5.9%
Dependent packages count: 7.3%
Average: 7.8%
Docker downloads count: 21.8%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-readtext
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 1
Rankings
Dependent repos count: 24.3%
Stargazers count: 32.8%
Forks count: 33.1%
Average: 35.5%
Dependent packages count: 51.6%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.6 depends
  • antiword * imports
  • data.table * imports
  • digest * imports
  • httr * imports
  • jsonlite >= 0.9.10 imports
  • pdftools * imports
  • readODS >= 1.7.0 imports
  • readxl * imports
  • streamR * imports
  • stringi * imports
  • striprtf * imports
  • tibble * imports
  • utils * imports
  • xml2 * imports
  • knitr * suggests
  • pkgload * suggests
  • quanteda >= 3.0 suggests
  • rmarkdown * suggests
  • testthat * suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/setup-pandoc master composite
  • r-lib/actions/setup-r master composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc master composite
  • r-lib/actions/setup-r master composite