Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
2 of 12 committers (16.7%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (19.0%) to scientific vocabulary
Keywords
encoding
quanteda
r
text
Keywords from Contributors
corpus
text-analytics
sentiment-analysis
lsa
text-analysis
excel
Last synced: 6 months ago
·
JSON representation
Repository
an R package for reading text files
Basic Info
- Host: GitHub
- Owner: quanteda
- Language: R
- Default Branch: master
- Homepage: https://readtext.quanteda.io
- Size: 16.5 MB
Statistics
- Stars: 120
- Watchers: 13
- Forks: 26
- Open Issues: 33
- Releases: 8
Topics
encoding
quanteda
r
text
Created over 9 years ago
· Last pushed 7 months ago
Metadata Files
Readme
Changelog
README.Rmd
---
output: github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "images/"
)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("badger")
```
# readtext: Import and handling for plain and formatted text files
[](https://CRAN.R-project.org/package=readtext)
`r badge_devel("quanteda/readtext", "royalblue")`
[](https://CRAN.R-project.org/package=readtext)
[](https://CRAN.R-project.org/package=readtext)
[](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/quanteda/readtext?branch=master)
[1]: https://codecov.io/gh/quanteda/readtext/branch/master
An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.
## Introduction
**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables. Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.
**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types. **readtext** is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.
As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings. (All encoding functions are handled by the **stringi** package.)
## How to Install
1. From CRAN
```{r, eval = FALSE}
install.packages("readtext")
```
2. From GitHub, if you want the latest development version.
```{r, eval = FALSE}
# devtools packaged required to install readtext from Github
remotes::install_github("quanteda/readtext")
```
Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:
```{bash, eval = FALSE}
sudo apt-get install libpoppler-cpp-dev # for antiword
```
## Demonstration: Reading one or more text files
**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).
The file formats are determined automatically by the filename extensions. If a file has no extension or is unknown, **readtext** will assume that it is plain text. The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:
```{r}
library("readtext")
# get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")
# read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
```
For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:
```{r}
# read in comma-separated values and specify text field
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
```
For a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).
## Inter-operability with other packages
### With **quanteda**
**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.
```{r}
library("quanteda")
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
```
### Text Interchange Format compatibility
**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.
If you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.
Owner
- Name: Quanteda Initiative
- Login: quanteda
- Kind: organization
- Location: London, UK
- Website: https://quanteda.org
- Repositories: 21
- Profile: https://github.com/quanteda
GitHub Events
Total
- Issues event: 1
- Watch event: 3
- Issue comment event: 3
- Push event: 5
- Pull request event: 5
- Fork event: 1
Last Year
- Issues event: 1
- Watch event: 3
- Issue comment event: 3
- Push event: 5
- Pull request event: 5
- Fork event: 1
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Kenneth Benoit | k****t@l****k | 200 |
| Adam Obeng | g****b@b****m | 110 |
| amatsuo | m****a@g****m | 46 |
| Kohei Watanabe | w****i@g****m | 28 |
| Kenneth Benoit | k****t@K****l | 26 |
| Stefan Müller | m****s@t****e | 10 |
| chainsawriot | c****y@g****m | 2 |
| pnulty | p****y@g****m | 1 |
| olivroy | 5****y | 1 |
| Tom Nicholls | g****b@t****k | 1 |
| Jirka Lewandowski | j****i@p****e | 1 |
| JBGruber | j****1@r****k | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 72
- Total pull requests: 37
- Average time to close issues: 6 months
- Average time to close pull requests: about 1 month
- Total issue authors: 35
- Total pull request authors: 12
- Average comments per issue: 1.83
- Average comments per pull request: 1.19
- Merged pull requests: 31
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 26 days
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 3.0
- Average comments per pull request: 0.33
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- kbenoit (23)
- koheiw (6)
- adamobeng (6)
- stefan-mueller (4)
- krystian8207 (2)
- ogorodriguez (2)
- cownr10r (1)
- sagarkumar16 (1)
- dkholo (1)
- jslapin (1)
- Astelix (1)
- ElCarlitos (1)
- gcpoole (1)
- leeper (1)
- SinaOzdemir (1)
Pull Request Authors
- kbenoit (8)
- amatsuo (6)
- chainsawriot (6)
- adamobeng (6)
- koheiw (3)
- olivroy (3)
- gcpoole (2)
- stefan-mueller (2)
- JBGruber (1)
- pnulty (1)
- pmyteh (1)
- jirkalewandowski (1)
Top Labels
Issue Labels
enhancement (8)
bug (5)
Documentation (4)
Difficulty: Hard (3)
Difficulty: Medium (3)
pre-CRAN (3)
question (1)
Difficulty: Easy (1)
performance (1)
Pull Request Labels
bug (1)
Packages
- Total packages: 2
-
Total downloads:
- cran 4,138 last-month
- Total docker downloads: 89,332
-
Total dependent packages: 7
(may contain duplicates) -
Total dependent repositories: 25
(may contain duplicates) - Total versions: 14
- Total maintainers: 1
cran.r-project.org: readtext
Import and Handling for Plain and Formatted Text Files
- Homepage: https://readtext.quanteda.io/
- Documentation: http://cran.r-project.org/web/packages/readtext/readtext.pdf
- License: GPL-3
-
Latest release: 0.92.1
published 7 months ago
Rankings
Forks count: 2.8%
Stargazers count: 3.6%
Dependent repos count: 5.6%
Downloads: 5.9%
Dependent packages count: 7.3%
Average: 7.8%
Docker downloads count: 21.8%
Maintainers (1)
Last synced:
6 months ago
conda-forge.org: r-readtext
- Homepage: https://github.com/quanteda/readtext
- License: GPL-3.0-only
-
Latest release: 0.81
published over 4 years ago
Rankings
Dependent repos count: 24.3%
Stargazers count: 32.8%
Forks count: 33.1%
Average: 35.5%
Dependent packages count: 51.6%
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.6 depends
- antiword * imports
- data.table * imports
- digest * imports
- httr * imports
- jsonlite >= 0.9.10 imports
- pdftools * imports
- readODS >= 1.7.0 imports
- readxl * imports
- streamR * imports
- stringi * imports
- striprtf * imports
- tibble * imports
- utils * imports
- xml2 * imports
- knitr * suggests
- pkgload * suggests
- quanteda >= 3.0 suggests
- rmarkdown * suggests
- testthat * suggests
.github/workflows/R-CMD-check.yaml
actions
- actions/cache v1 composite
- actions/checkout v2 composite
- actions/upload-artifact main composite
- r-lib/actions/setup-pandoc master composite
- r-lib/actions/setup-r master composite
.github/workflows/test-coverage.yaml
actions
- actions/cache v1 composite
- actions/checkout v2 composite
- r-lib/actions/setup-pandoc master composite
- r-lib/actions/setup-r master composite