quanteda

quanteda: An R package for the quantitative analysis of textual data - Published in JOSS (2018)

https://github.com/quanteda/quanteda

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
    6 of 44 committers (13.6%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

corpus natural-language-processing quanteda r text-analytics

Keywords from Contributors

twitter reproducibility
Last synced: 6 months ago · JSON representation

Repository

An R package for the Quantitative Analysis of Textual Data

Basic Info
  • Host: GitHub
  • Owner: quanteda
  • License: gpl-3.0
  • Language: R
  • Default Branch: master
  • Homepage: https://quanteda.io
  • Size: 751 MB
Statistics
  • Stars: 863
  • Watchers: 52
  • Forks: 188
  • Open Issues: 62
  • Releases: 43
Topics
corpus natural-language-processing quanteda r text-analytics
Created over 13 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
 warning = FALSE,
 collapse = TRUE,
 comment = "##",
 fig.path = "man/images/"
)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("badger")
```

[![quanteda: quantitative analysis of textual data](https://cdn.rawgit.com/quanteda/quanteda/master/images/quanteda_logo.svg)](http://quanteda.io)


[![CRAN Version](https://www.r-pkg.org/badges/version/quanteda)](https://CRAN.R-project.org/package=quanteda)
`r badge_devel("quanteda/quanteda", "royalblue")`
[![Downloads](https://cranlogs.r-pkg.org/badges/quanteda)](https://CRAN.R-project.org/package=quanteda)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/quanteda?color=orange)](https://CRAN.R-project.org/package=quanteda)
[![R-CMD-check](https://github.com/quanteda/quanteda/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/quanteda/quanteda/actions/workflows/check-standard.yaml)
[![codecov](https://codecov.io/gh/quanteda/quanteda/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/quanteda) [![DOI](https://zenodo.org/badge/5424649.svg)](https://zenodo.org/badge/latestdoi/5424649)
[![DOI](http://joss.theoj.org/papers/10.21105/joss.00774/status.svg)](https://doi.org/10.21105/joss.00774)


## About


**quanteda** is an R package for managing and analyzing text, created and maintained by [Kenneth Benoit](https://kenbenoit.net) and [Kohei Watanabe](https://blog.koheiw.net/). Its creation was funded by the European Research Council grant ERC-2011-StG 283794-QUANTESS and its continued development is supported by the [Quanteda Initiative CIC](https://quanteda.org).

For more details, see https://quanteda.io.

## **quanteda** version 4

The **quanteda** 4.0 is a major release that improves functionality and performance and further improves function consistency by removing previously deprecated functions.  It also includes significant new tokeniser rules that make the default tokeniser smarter than ever, with new Unicode and ICU-compliant rules enabling it to work more consistently with even more languages.

We describe more fully these significant changes in:

- an [article about the new external pointer tokens objects](https://quanteda.io/articles/pkgdown/tokens_xptr.html);
- an [article showing performance benchmarks](https://quanteda.io/articles/pkgdown/benchmarks_xptr.html) for the new external pointer tokens objects, as well as some of the tokeniser improvements in v4; and
- the [changelog for v4](https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-40) a full listing of the changes, improvements, and deprecations in v4.

## The **quanteda** family of packages

We completed the trend of splitting **quanteda** into modular packages with the release of v3. The quanteda family of packages includes the following:

- [**quanteda**](https://github.com/quanteda/quanteda): contains all of the core natural language processing and textual data management functions
- [**quanteda.textmodels**](https://github.com/quanteda/quanteda.textmodels): contains all of the text models and supporting functions, namely the `textmodel_*()` functions. This was split from the main package with the v2 release
- [**quanteda.textstats**](https://github.com/quanteda/quanteda.textstats): statistics for textual data, namely the `textstat_*()` functions, split with the v3 release
- [**quanteda.textplots**](https://github.com/quanteda/quanteda.textplots): plots for textual data, namely the `textplot_*()` functions, split with the v3 release

We are working on additional package releases, available in the meantime from our GitHub pages:

- [**quanteda.sentiment**](https://github.com/quanteda/quanteda.sentiment): Functions and lexicons for sentiment analysis using dictionaries
- [**quanteda.tidy**](https://github.com/quanteda/quanteda.tidy): Extensions for manipulating document variables in core **quanteda** objects using your favourite **tidyverse** functions

and more to come.

## How To...

### Install (binaries) from CRAN

The normal way from CRAN, using your R GUI or 
```{r eval = FALSE}
install.packages("quanteda") 
```

**(New for quanteda v4.0)** For Linux users: Because all installations on Linux are compiled, Linux users will first need to install the Intel oneAPI Threading Building Blocks for parallel computing for installation to work.

To install TBB on Linux:

```{bash eval = FALSE}
# Fedora, CentOS, RHEL
sudo yum install tbb-devel

# Debian and Ubuntu
sudo apt install libtbb-dev
```

### Compile from source (macOS and Windows)

Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers to build the development version.  

You will also need to install TBB:

**macOS:**

First, you will need to install XCode command line tools.

```{bash eval = FALSE}
xcode-select --install
```

Then install the TBB libraries and the pkg-config utility: (after installing [Homebrew](https://brew.sh)):

```{bash eval = FALSE}
brew install tbb pkg-config
```

Finally, you will need to install [gfortran](https://github.com/fxcoudert/gfortran-for-macOS/releases).

**Windows:**

Install [RTools](https://cran.r-project.org/bin/windows/Rtools/), which includes the TBB libraries.



### Enable parallelisation


**quanteda** takes advantage of parallel computing through the [TBB (Threading Building Blocks) library](https://en.wikipedia.org/wiki/Threading_Building_Blocks) to speed up computations. This guide provides step-by-step instructions on how to set up your system for using Quanteda with parallel capabilities on Windows, macOS, and Linux.


**Windows:**

Download and install RTools from [RTools download page](https://cran.r-project.org/bin/windows/Rtools/).


**macOS:**

1. **Install XCode Command Line Tools**
   - Type the following command in the terminal:
     ```bash
     xcode-select --install
     ```
     
2. **Install Homebrew**
   - If Homebrew is not installed, run:
     ```bash
     /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
     ```

3. **Install TBB and pkg-config**
   - After installing Homebrew, run:
     ```bash
     brew install tbb pkg-config
     ```

4. **Install gfortran**
   - Required for compiling Fortran code, install using Homebrew:
     ```bash
     brew install gcc
     ```

**Linux:**

Install TBB:

- For Fedora, CentOS, RHEL:
  ```bash
  sudo yum install tbb-devel
  ```
- For Debian and Ubuntu:
  ```bash
  sudo apt install libtbb-dev
  ```

More details are provided in the [quanteda documentation](http://quanteda.io/articles/pkgdown/parallelisation.html).


### Use **quanteda**

See the [quick start guide](https://quanteda.io/articles/quickstart.html) to learn how to use **quanteda**.

### Get Help

* Read out documentation at https://quanteda.io.
* Check out the [**quanteda** cheatsheet](https://github.com/quanteda/quanteda/blob/master/tests/cheatsheet/quanteda-cheatsheet.pdf).
* Submit a question on the **quanteda** channel on StackOverflow (questions tagged as quanteda").
* See our [tutorial site](https://tutorials.quanteda.io/).

### Cite the package

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) "[quanteda: An R package for the quantitative analysis of textual data](https://www.theoj.org/joss-papers/joss.00774/10.21105.joss.00774.pdf)". _Journal of Open Source Software_ 3(30), 774. [https://doi.org/10.21105/joss.00774](https://doi.org/10.21105/joss.00774).

For a BibTeX entry, use the output from `citation(package = "quanteda")`.

### Leave Feedback

If you like **quanteda**, please consider leaving [feedback or a testimonial here](https://github.com/quanteda/quanteda/issues/461).

### Contribute

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

* Fork the source code, modify, and issue a [pull request](https://help.github.com/articles/creating-a-pull-request-from-a-fork/) through the [project GitHub page](https://github.com/quanteda/quanteda). See our [Contributor Code of Conduct](https://github.com/quanteda/quanteda/blob/master/CONDUCT.md) and the all-important **quanteda** [Style Guide](https://github.com/quanteda/quanteda/wiki/Style-guide).
* Issues, bug reports, and wish lists: [File a GitHub issue](https://github.com/quanteda/quanteda/issues).
* Contact [the maintainer](mailto:kbenoit@lse.ac.uk) by email.

Owner

  • Name: Quanteda Initiative
  • Login: quanteda
  • Kind: organization
  • Location: London, UK

JOSS Publication

quanteda: An R package for the quantitative analysis of textual data
Published
October 06, 2018
Volume 3, Issue 30, Page 774
Authors
Kenneth Benoit ORCID
Department of Methodology, London School of Economics and Political Science
Kohei Watanabe ORCID
Department of Methodology, London School of Economics and Political Science
Haiyan Wang ORCID
De Beers Inc.
Paul Nulty ORCID
Centre for Research in Arts, Social Science and Humanities, University of Cambridge
Adam Obeng ORCID
Facebook, Inc. (work conducted at the Department of Methodology, London School of Economics and Political Science)
Stefan Müller ORCID
Department of Political Science, Trinity College Dublin
Akitaka Matsuo ORCID
Department of Methodology, London School of Economics and Political Science
Editor
Arfon Smith ORCID
Tags
text mining natural language processing

GitHub Events

Total
  • Create event: 23
  • Release event: 2
  • Issues event: 30
  • Watch event: 24
  • Delete event: 16
  • Issue comment event: 108
  • Push event: 122
  • Pull request review comment event: 9
  • Pull request review event: 41
  • Pull request event: 41
  • Fork event: 2
Last Year
  • Create event: 23
  • Release event: 2
  • Issues event: 30
  • Watch event: 24
  • Delete event: 16
  • Issue comment event: 108
  • Push event: 122
  • Pull request review comment event: 9
  • Pull request review event: 41
  • Pull request event: 41
  • Fork event: 2

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 9,207
  • Total Committers: 44
  • Avg Commits per committer: 209.25
  • Development Distribution Score (DDS): 0.559
Past Year
  • Commits: 201
  • Committers: 6
  • Avg Commits per committer: 33.5
  • Development Distribution Score (DDS): 0.204
Top Committers
Name Email Commits
Kohei Watanabe w****i@g****m 4,060
Kenneth Benoit k****t@l****k 3,316
HaiyanLW w****1@g****m 488
Paul Nulty p****y@g****m 440
Adam Obeng g****b@b****m 289
Kenneth Benoit k****t@K****l 214
Stefan Müller m****s@t****e 165
Paul Nulty p****l@p****) 59
jiongweilua j****a@l****k 47
amatsuo m****a@g****m 33
Christian Mueller c****s@p****h 11
Benjamin Lauderdale b****e@l****k 9
jiongweilua l****i@L****e 8
odlmarce d****r@g****m 7
conjugateprior c****r@g****m 5
Roger Bivand R****d@g****m 5
José Tomás Atria j****a@g****m 5
Pablo Barberá p****a@n****u 4
Lua Jiong Wei l****i@L****l 4
Tyler Rinker t****r@g****m 3
mpadge m****m@e****m 3
Christopher Gandrud c****d@g****m 3
Michael Chirico m****4@g****m 2
chainsawriot c****y@g****m 2
Iñaki Úcar i****r@f****g 2
Alec L. Robitaille r****c@g****m 2
Benoit k****t@L****T 2
Kohei Watanabe k****w@K****l 2
Paul p****l@p****) 2
andrea rota a@x****u 1
and 14 more...

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 122
  • Total pull requests: 46
  • Average time to close issues: 6 months
  • Average time to close pull requests: 16 days
  • Total issue authors: 31
  • Total pull request authors: 7
  • Average comments per issue: 1.93
  • Average comments per pull request: 2.33
  • Merged pull requests: 34
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 20
  • Pull requests: 39
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 12 days
  • Issue authors: 7
  • Pull request authors: 2
  • Average comments per issue: 0.75
  • Average comments per pull request: 2.44
  • Merged pull requests: 29
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kbenoit (52)
  • koheiw (24)
  • pnulty (20)
  • andreskull (5)
  • zachmayer (4)
  • jtyun (2)
  • chmue (2)
  • bfisseler (2)
  • Leo-Send (2)
  • michellerh330 (2)
  • angeloklin (2)
  • peterejkemp (2)
  • TheKashe (1)
  • chrishanretty (1)
  • rafaelrpaiva (1)
Pull Request Authors
  • koheiw (67)
  • kbenoit (19)
  • rsbivand (3)
  • stefan-mueller (2)
  • chmue (2)
  • rimonim (2)
  • Enchufa2 (1)
  • trinker (1)
  • robitalec (1)
  • pjsio (1)
  • pablobarbera (1)
Top Labels
Issue Labels
enhancement (19) bug (8) CRAN (6) design (4) documentation (4) infrastructure (3) URGENT (3) dfm (2) robustness (2) tokens (1) collocations (1) textplot (1) hashing (1) question (1) compatibility (1) help wanted (1)
Pull Request Labels
CRAN (3) URGENT (3) help wanted (1) documentation (1)

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 37
proxy.golang.org: github.com/quanteda/quanteda
  • Versions: 37
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.5%
Average: 5.7%
Dependent repos count: 5.9%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.5.0 depends
  • methods * depends
  • Matrix >= 1.2 imports
  • Rcpp >= 0.12.12 imports
  • RcppParallel * imports
  • SnowballC * imports
  • fastmatch * imports
  • magrittr * imports
  • stopwords * imports
  • stringi * imports
  • xml2 * imports
  • yaml * imports
  • RColorBrewer * suggests
  • dplyr * suggests
  • formatR * suggests
  • ggplot2 * suggests
  • jsonlite * suggests
  • knitr * suggests
  • lda * suggests
  • lsa * suggests
  • purrr * suggests
  • quanteda * suggests
  • quanteda.textmodels * suggests
  • quanteda.textplots * suggests
  • quanteda.textstats * suggests
  • rmarkdown * suggests
  • slam * suggests
  • spacyr * suggests
  • spelling * suggests
  • stm * suggests
  • testthat * suggests
  • text2vec * suggests
  • tibble * suggests
  • tidytext * suggests
  • tm >= 0.6 suggests
  • tokenizers * suggests
  • topicmodels * suggests
  • xtable * suggests
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/check-standard.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite