grafzahl

🧛 fine-tuning Transformers for text data from within R

https://github.com/gesistsa/grafzahl

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • ✓
    CITATION.cff file
    Found CITATION.cff file
  • ✓
    codemeta.json file
    Found codemeta.json file
  • ✓
    .zenodo.json file
    Found .zenodo.json file
  • ✓
    DOI references
    Found 4 DOI reference(s) in README
  • â—‹
    Academic publication links
  • â—‹
    Academic email domains
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (18.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

🧛 fine-tuning Transformers for text data from within R

Basic Info
Statistics
  • Stars: 42
  • Watchers: 4
  • Forks: 3
  • Open Issues: 5
  • Releases: 3
Created over 3 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# grafzahl 


[![CRAN status](https://www.r-pkg.org/badges/version/grafzahl)](https://CRAN.R-project.org/package=grafzahl)
[![R-CMD-check](https://github.com/gesistsa/grafzahl/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gesistsa/grafzahl/actions/workflows/R-CMD-check.yaml)


The goal of grafzahl (**G**racious **R** **A**nalytical **F**ramework for **Z**appy **A**nalysis of **H**uman **L**anguages [^1]) is to duct tape the [quanteda](https://github.com/quanteda/quanteda) ecosystem to modern [Transformer-based text classification models](https://simpletransformers.ai/), e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package [quanteda.textmodels](https://github.com/quanteda/quanteda.textmodels).

If you don't know what I am talking about, don't worry, this package is gracious. You don't need to know a lot about Transformers to use this package. See the examples below.

Please cite this software as:

Chan, C., (2023). [grafzahl: fine-tuning Transformers for text data from within R](paper/grafzahl_sp.pdf). *Computational Communication Research* 5(1): 76-84. [https://doi.org/10.5117/CCR2023.1.003.CHAN](https://doi.org/10.5117/CCR2023.1.003.CHAN)

## Installation: Local environment

Install the CRAN version

```r
install.packages("grafzahl")
```

After that, you need to setup your conda environment

```r
require(grafzahl)
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)
```

## On remote environments, e.g. Google Colab

On Google Colab, you need to enable non-Conda mode

```r
install.packages("grafzahl")
require(grafzahl)
use_nonconda()
```

Please refer the vignette.

## Usage

Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from [this repository](https://github.com/pablobarbera/incivility-sage-open) (Theocharis et al., 2020).

```{r, echo = FALSE, message = FALSE}
devtools::load_all()
```

```{r}
unciviltweets
```

In order to train a Transfomer model, please select the `model_name` from [Hugging Face's list](https://huggingface.co/models). The table below lists some common choices. In most of the time, providing `model_name` is sufficient, there is no need to provide `model_type`.

Suppose you want to train a Transformer model using "bertweet" (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the `output` directory of the current directory. You can change it to elsewhere using the `output_dir` parameter. 

```r
model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
### If you are hardcore quanteda user:
## model <- textmodel_transformer(unciviltweets,
##                                model_type = "bertweet", model_name = "vinai/bertweet-base")
```

Make prediction

```r
predict(model)
```

That is it.

## Extended examples

Several extended examples are also available.

| Examples                                        | file                                           |
|-------------------------------------------------|------------------------------------------------|
| van Atteveldt et al. (2021)                     | [paper/vanatteveldt.md](paper/vanatteveldt.md) |
| Dobbrick et al. (2021)                          | [paper/dobbrick.md](paper/dobbrick.md)         |
| Theocharis et al. (2020)                        | [paper/theocharis.md](paper/theocharis.md)     |
| OffensEval-TR (2020)                            | [paper/coltekin.md](paper/coltekin.md)         |
| Amharic News Text classification Dataset (2021) | [paper/azime.md](paper/azime.md)               |

## Some common choices of `model_name`

| Your data         | model_type | model_name                         |
|-------------------|------------|------------------------------------|
| English tweets    | bertweet   | vinai/bertweet-base                |
| Lightweight       | mobilebert | google/mobilebert-uncased          |
|                   | distilbert | distilbert-base-uncased            |
| Long Text         | longformer | allenai/longformer-base-4096       |
|                   | bigbird    | google/bigbird-roberta-base        |
| English (General) | bert       | bert-base-uncased                  |
|                   | bert       | bert-base-cased                    |
|                   | electra    | google/electra-small-discriminator |
|                   | roberta    | roberta-base                       |
| Multilingual      | xlm        | xlm-mlm-17-1280                    |
|                   | xml        | xlm-mlm-100-1280                   |
|                   | bert       | bert-base-multilingual-cased       |
|                   | xlmroberta | xlm-roberta-base                   |
|                   | xlmroberta | xlm-roberta-large                  |

# References

1. Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
2. Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

---
[^1]: Yes, I totally made up the meaningless long name. Actually, it is the German name of the *Sesame Street* character [Count von Count](https://de.wikipedia.org/wiki/Sesamstra%C3%9Fe#Graf_Zahl), meaning "Count (the noble title) Number". And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.

Owner

  • Name: Transparent Social Analytics
  • Login: gesistsa
  • Kind: organization
  • Location: Germany

Open Science Tools maintained by Transparent Social Analytics Team, GESIS

Citation (CITATION.cff)

# --------------------------------------------
# CITATION file created with {cffr} R package
# See also: https://docs.ropensci.org/cffr/
# --------------------------------------------
 
cff-version: 1.2.0
message: 'To cite package "grafzahl" in publications use:'
type: software
license: GPL-3.0-or-later
title: 'grafzahl: Supervised Machine Learning for Textual Data Using Transformers
  and ''Quanteda'''
version: 0.0.11
doi: 10.5117/CCR2023.1.003.CHAN
identifiers:
- type: doi
  value: 10.32614/CRAN.package.grafzahl
abstract: 'Duct tape the ''quanteda'' ecosystem (Benoit et al., 2018) <https://doi.org/10.21105/joss.00774>
  to modern Transformer-based text classification models (Wolf et al., 2020) <https://doi.org/10.18653/v1/2020.emnlp-demos.6>,
  in order to facilitate supervised machine learning for textual data. This package
  mimics the behaviors of ''quanteda.textmodels'' and provides a function to setup
  the ''Python'' environment to use the pretrained models from ''Hugging Face'' <https://huggingface.co/>.
  More information: <https://doi.org/10.5117/CCR2023.1.003.CHAN>.'
authors:
- family-names: Chan
  given-names: Chung-hong
  email: chainsawtiney@gmail.com
  orcid: https://orcid.org/0000-0002-6232-7530
preferred-citation:
  type: article
  title: 'grafzahl: fine-tuning Transformers for text data from within R.'
  authors:
  - family-names: Chan
    given-names: Chung-hong
    email: chainsawtiney@gmail.com
    orcid: https://orcid.org/0000-0002-6232-7530
  journal: Computational Communication Research
  doi: 10.5117/CCR2023.1.003.CHAN
  volume: '5'
  issue: '1'
  year: '2023'
  start: 76-84
repository: https://CRAN.R-project.org/package=grafzahl
repository-code: https://github.com/gesistsa/grafzahl
url: https://gesistsa.github.io/grafzahl/
contact:
- family-names: Chan
  given-names: Chung-hong
  email: chainsawtiney@gmail.com
  orcid: https://orcid.org/0000-0002-6232-7530
references:
- type: software
  title: knitr
  abstract: 'knitr: A General-Purpose Package for Dynamic Report Generation in R'
  notes: Suggests
  url: https://yihui.org/knitr/
  repository: https://CRAN.R-project.org/package=knitr
  authors:
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  year: '2024'
  doi: 10.32614/CRAN.package.knitr
- type: software
  title: rmarkdown
  abstract: 'rmarkdown: Dynamic Documents for R'
  notes: Suggests
  url: https://pkgs.rstudio.com/rmarkdown/
  repository: https://CRAN.R-project.org/package=rmarkdown
  authors:
  - family-names: Allaire
    given-names: JJ
    email: jj@posit.co
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  - family-names: Dervieux
    given-names: Christophe
    email: cderv@posit.co
    orcid: https://orcid.org/0000-0003-4474-2498
  - family-names: McPherson
    given-names: Jonathan
    email: jonathan@posit.co
  - family-names: Luraschi
    given-names: Javier
  - family-names: Ushey
    given-names: Kevin
    email: kevin@posit.co
  - family-names: Atkins
    given-names: Aron
    email: aron@posit.co
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  - family-names: Cheng
    given-names: Joe
    email: joe@posit.co
  - family-names: Chang
    given-names: Winston
    email: winston@posit.co
  - family-names: Iannone
    given-names: Richard
    email: rich@posit.co
    orcid: https://orcid.org/0000-0003-3925-190X
  year: '2024'
  doi: 10.32614/CRAN.package.rmarkdown
- type: software
  title: testthat
  abstract: 'testthat: Unit Testing for R'
  notes: Suggests
  url: https://testthat.r-lib.org
  repository: https://CRAN.R-project.org/package=testthat
  authors:
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  year: '2024'
  doi: 10.32614/CRAN.package.testthat
  version: '>= 3.0.0'
- type: software
  title: withr
  abstract: 'withr: Run Code ''With'' Temporarily Modified Global State'
  notes: Suggests
  url: https://withr.r-lib.org
  repository: https://CRAN.R-project.org/package=withr
  authors:
  - family-names: Hester
    given-names: Jim
  - family-names: Henry
    given-names: Lionel
    email: lionel@posit.co
  - family-names: Müller
    given-names: Kirill
    email: krlmlr+r@mailbox.org
  - family-names: Ushey
    given-names: Kevin
    email: kevinushey@gmail.com
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  - family-names: Chang
    given-names: Winston
  year: '2024'
  doi: 10.32614/CRAN.package.withr
- type: software
  title: jsonlite
  abstract: 'jsonlite: A Simple and Robust JSON Parser and Generator for R'
  notes: Imports
  url: https://jeroen.r-universe.dev/jsonlite
  repository: https://CRAN.R-project.org/package=jsonlite
  authors:
  - family-names: Ooms
    given-names: Jeroen
    email: jeroenooms@gmail.com
    orcid: https://orcid.org/0000-0002-4035-0289
  year: '2024'
  doi: 10.32614/CRAN.package.jsonlite
- type: software
  title: lime
  abstract: 'lime: Local Interpretable Model-Agnostic Explanations'
  notes: Imports
  url: https://lime.data-imaginist.com
  repository: https://CRAN.R-project.org/package=lime
  authors:
  - family-names: Hvitfeldt
    given-names: Emil
    email: emilhhvitfeldt@gmail.com
    orcid: https://orcid.org/0000-0002-0679-1945
  - family-names: Pedersen
    given-names: Thomas Lin
    email: thomasp85@gmail.com
    orcid: https://orcid.org/0000-0002-5147-4711
  - family-names: Benesty
    given-names: Michaël
    email: michael@benesty.fr
  year: '2024'
  doi: 10.32614/CRAN.package.lime
- type: software
  title: quanteda
  abstract: 'quanteda: Quantitative Analysis of Textual Data'
  notes: Imports
  url: https://quanteda.io
  repository: https://CRAN.R-project.org/package=quanteda
  authors:
  - family-names: Benoit
    given-names: Kenneth
    email: kbenoit@lse.ac.uk
    orcid: https://orcid.org/0000-0002-0797-564X
  - family-names: Watanabe
    given-names: Kohei
    email: watanabe.kohei@gmail.com
    orcid: https://orcid.org/0000-0001-6519-5265
  - family-names: Wang
    given-names: Haiyan
    email: whyinsa@yahoo.com
    orcid: https://orcid.org/0000-0003-4992-4311
  - family-names: Nulty
    given-names: Paul
    email: paul.nulty@gmail.com
    orcid: https://orcid.org/0000-0002-7214-4666
  - family-names: Obeng
    given-names: Adam
    email: quanteda@binaryeagle.com
    orcid: https://orcid.org/0000-0002-2906-4775
  - family-names: Müller
    given-names: Stefan
    email: stefan.mueller@ucd.ie
    orcid: https://orcid.org/0000-0002-6315-4125
  - family-names: Matsuo
    given-names: Akitaka
    email: a.matsuo@essex.ac.uk
    orcid: https://orcid.org/0000-0002-3323-6330
  - family-names: Lowe
    given-names: William
    email: lowe@hertie-school.org
    orcid: https://orcid.org/0000-0002-1549-6163
  year: '2024'
  doi: 10.32614/CRAN.package.quanteda
- type: software
  title: reticulate
  abstract: 'reticulate: Interface to ''Python'''
  notes: Imports
  url: https://rstudio.github.io/reticulate/
  repository: https://CRAN.R-project.org/package=reticulate
  authors:
  - family-names: Ushey
    given-names: Kevin
    email: kevin@posit.co
  - family-names: Allaire
    given-names: JJ
    email: jj@posit.co
  - family-names: Tang
    given-names: Yuan
    email: terrytangyuan@gmail.com
    orcid: https://orcid.org/0000-0001-5243-233X
  year: '2024'
  doi: 10.32614/CRAN.package.reticulate
- type: software
  title: utils
  abstract: 'R: A Language and Environment for Statistical Computing'
  notes: Imports
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
- type: software
  title: stats
  abstract: 'R: A Language and Environment for Statistical Computing'
  notes: Imports
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
- type: software
  title: 'R: A Language and Environment for Statistical Computing'
  notes: Depends
  url: https://www.R-project.org/
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
  version: '>= 3.5'

GitHub Events

Total
  • Issues event: 5
  • Watch event: 1
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 11
  • Pull request event: 9
  • Fork event: 1
  • Create event: 4
Last Year
  • Issues event: 5
  • Watch event: 1
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 11
  • Pull request event: 9
  • Fork event: 1
  • Create event: 4

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 28
  • Total pull requests: 16
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 1 day
  • Total issue authors: 8
  • Total pull request authors: 3
  • Average comments per issue: 1.21
  • Average comments per pull request: 0.31
  • Merged pull requests: 14
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 8
  • Average time to close issues: 8 days
  • Average time to close pull requests: 2 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.13
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • chainsawriot (21)
  • LuigiC72 (1)
  • ureber (1)
  • barracuda156 (1)
  • bachl (1)
  • cbpuschmann (1)
  • rgaiacs (1)
  • tweedmann (1)
Pull Request Authors
  • chainsawriot (10)
  • ArthurMuehl (4)
  • bachl (2)
Top Labels
Issue Labels
v0.1 (3) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 316 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: grafzahl

Supervised Machine Learning for Textual Data Using Transformers and 'Quanteda'

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 316 Last month
Rankings
Stargazers count: 9.8%
Forks count: 28.8%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Average: 36.4%
Downloads: 78.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.5 depends
  • jsonlite * imports
  • lime * imports
  • quanteda * imports
  • reticulate * imports
  • stats * imports
  • utils * imports
  • quanteda.textmodels * suggests
  • testthat >= 3.0.0 suggests
  • withr * suggests