readtext

an R package for reading text files

https://github.com/quanteda/readtext

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 12 committers (16.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (19.0%) to scientific vocabulary

Keywords

encoding quanteda r text

Keywords from Contributors

corpus text-analytics sentiment-analysis lsa text-analysis excel

Last synced: 6 months ago · JSON representation

Repository

an R package for reading text files

Basic Info

Host: GitHub
Owner: quanteda
Language: R
Default Branch: master
Homepage: https://readtext.quanteda.io
Size: 16.5 MB

Statistics

Stars: 120
Watchers: 13
Forks: 26
Open Issues: 33
Releases: 8

Topics

encoding quanteda r text

Created over 9 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog

README.Rmd

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "images/"
)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("badger")
```

# readtext: Import and handling for plain and formatted text files


[![CRAN Version](https://www.r-pkg.org/badges/version/readtext)](https://CRAN.R-project.org/package=readtext)
`r badge_devel("quanteda/readtext", "royalblue")`
[![Downloads](https://cranlogs.r-pkg.org/badges/readtext)](https://CRAN.R-project.org/package=readtext)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/readtext?color=orange)](https://CRAN.R-project.org/package=readtext)
[![R-CMD-check](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/quanteda/readtext/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/readtext?branch=master)



[1]: https://codecov.io/gh/quanteda/readtext/branch/master

An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.

## Introduction

**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables.  Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.  

**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types.  **readtext** is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.

As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings.  (All encoding functions are handled by the **stringi** package.)

## How to Install


1.  From CRAN

    ```{r, eval = FALSE}
    install.packages("readtext")
    ```

2.  From GitHub, if you want the latest development version.

    ```{r, eval = FALSE}
    # devtools packaged required to install readtext from Github 
    remotes::install_github("quanteda/readtext") 
    ```

Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:

```{bash, eval = FALSE}
sudo apt-get install libpoppler-cpp-dev   # for antiword
```

## Demonstration: Reading one or more text files

**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).

The file formats are determined automatically by the filename extensions.  If a file has no extension or is unknown, **readtext** will assume that it is plain text.  The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:

```{r}
library("readtext")
# get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

# read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
```

For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:

```{r}
# read in comma-separated values and specify text field
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
```

For a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).

## Inter-operability with other packages

### With **quanteda**

**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

```{r}
library("quanteda")
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
```

### Text Interchange Format compatibility

**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.  

If you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.

Owner

Name: Quanteda Initiative
Login: quanteda
Kind: organization
Location: London, UK

Website: https://quanteda.org
Repositories: 21
Profile: https://github.com/quanteda

GitHub Events

Total

Issues event: 1
Watch event: 3
Issue comment event: 3
Push event: 5
Pull request event: 5
Fork event: 1

Last Year

Issues event: 1
Watch event: 3
Issue comment event: 3
Push event: 5
Pull request event: 5
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 427
Total Committers: 12
Avg Commits per committer: 35.583
Development Distribution Score (DDS): 0.532

Past Year

Commits: 3
Committers: 2
Avg Commits per committer: 1.5
Development Distribution Score (DDS): 0.333

Top Committers

Name	Email	Commits
Kenneth Benoit	k**t@l**k	200
Adam Obeng	g**b@b**m	110
amatsuo	m**a@g**m	46
Kohei Watanabe	w**i@g**m	28
Kenneth Benoit	k**t@K**l	26
Stefan Müller	m**s@t**e	10
chainsawriot	c**y@g**m	2
pnulty	p**y@g**m	1
olivroy	5****y	1
Tom Nicholls	g**b@t**k	1
Jirka Lewandowski	j**i@p**e	1
JBGruber	j**1@r**k	1

Committer Domains (Top 20 + Academic)

research.gla.ac.uk: 1 posteo.de: 1 tomnicholls.me.uk: 1 tcd.ie: 1 binaryeagle.com: 1 lse.ac.uk: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 72
Total pull requests: 37
Average time to close issues: 6 months
Average time to close pull requests: about 1 month
Total issue authors: 35
Total pull request authors: 12
Average comments per issue: 1.83
Average comments per pull request: 1.19
Merged pull requests: 31
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 26 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 3.0
Average comments per pull request: 0.33
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kbenoit (23)
koheiw (6)
adamobeng (6)
stefan-mueller (4)
krystian8207 (2)
ogorodriguez (2)
cownr10r (1)
sagarkumar16 (1)
dkholo (1)
jslapin (1)
Astelix (1)
ElCarlitos (1)
gcpoole (1)
leeper (1)
SinaOzdemir (1)

Pull Request Authors

kbenoit (8)
amatsuo (6)
chainsawriot (6)
adamobeng (6)
koheiw (3)
olivroy (3)
gcpoole (2)
stefan-mueller (2)
JBGruber (1)
pnulty (1)
pmyteh (1)
jirkalewandowski (1)

Top Labels

Issue Labels

enhancement (8) bug (5) Documentation (4) Difficulty: Hard (3) Difficulty: Medium (3) pre-CRAN (3) question (1) Difficulty: Easy (1) performance (1)

Pull Request Labels

bug (1)

Packages

Total packages: 2
Total downloads:
- cran 4,138 last-month
Total docker downloads: 89,332

Total dependent packages: 7
(may contain duplicates)
Total dependent repositories: 25
(may contain duplicates)
Total versions: 14
Total maintainers: 1

cran.r-project.org: readtext

Import and Handling for Plain and Formatted Text Files

Homepage: https://readtext.quanteda.io/
Documentation: http://cran.r-project.org/web/packages/readtext/readtext.pdf
License: GPL-3
Latest release: 0.92.1
published 7 months ago

Versions: 12
Dependent Packages: 7
Dependent Repositories: 24
Downloads: 4,138 Last month
Docker Downloads: 89,332

Rankings

Forks count: 2.8%

Stargazers count: 3.6%

Dependent repos count: 5.6%

Downloads: 5.9%

Dependent packages count: 7.3%

Average: 7.8%

Docker downloads count: 21.8%

Maintainers (1)

kbenoit@lse.ac.uk

Last synced: 6 months ago

conda-forge.org: r-readtext

Homepage: https://github.com/quanteda/readtext
License: GPL-3.0-only
Latest release: 0.81
published over 4 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 1

Rankings

Dependent repos count: 24.3%

Stargazers count: 32.8%

Forks count: 33.1%

Average: 35.5%

Dependent packages count: 51.6%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.6 depends
antiword * imports
data.table * imports
digest * imports
httr * imports
jsonlite >= 0.9.10 imports
pdftools * imports
readODS >= 1.7.0 imports
readxl * imports
streamR * imports
stringi * imports
striprtf * imports
tibble * imports
utils * imports
xml2 * imports
knitr * suggests
pkgload * suggests
quanteda >= 3.0 suggests
rmarkdown * suggests
testthat * suggests

.github/workflows/R-CMD-check.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r master composite

.github/workflows/test-coverage.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r master composite