grafzahl

🧛 fine-tuning Transformers for text data from within R

https://github.com/gesistsa/grafzahl

Last synced: 10 months ago · JSON representation ·

Repository

🧛 fine-tuning Transformers for text data from within R

Basic Info

Host: GitHub
Owner: gesistsa
License: gpl-3.0
Language: R
Default Branch: v0.1
Homepage: https://gesistsa.github.io/grafzahl/
Size: 5.04 MB

Statistics

Stars: 42
Watchers: 4
Forks: 3
Open Issues: 5
Releases: 3

Created almost 4 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# grafzahl 


[![CRAN status](https://www.r-pkg.org/badges/version/grafzahl)](https://CRAN.R-project.org/package=grafzahl)
[![R-CMD-check](https://github.com/gesistsa/grafzahl/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gesistsa/grafzahl/actions/workflows/R-CMD-check.yaml)


The goal of grafzahl (**G**racious **R** **A**nalytical **F**ramework for **Z**appy **A**nalysis of **H**uman **L**anguages [^1]) is to duct tape the [quanteda](https://github.com/quanteda/quanteda) ecosystem to modern [Transformer-based text classification models](https://simpletransformers.ai/), e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package [quanteda.textmodels](https://github.com/quanteda/quanteda.textmodels).

If you don't know what I am talking about, don't worry, this package is gracious. You don't need to know a lot about Transformers to use this package. See the examples below.

Please cite this software as:

Chan, C., (2023). [grafzahl: fine-tuning Transformers for text data from within R](paper/grafzahl_sp.pdf). *Computational Communication Research* 5(1): 76-84. [https://doi.org/10.5117/CCR2023.1.003.CHAN](https://doi.org/10.5117/CCR2023.1.003.CHAN)

## Installation: Local environment

Install the CRAN version

```r
install.packages("grafzahl")
```

After that, you need to setup your conda environment

```r
require(grafzahl)
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)
```

## On remote environments, e.g. Google Colab

On Google Colab, you need to enable non-Conda mode

```r
install.packages("grafzahl")
require(grafzahl)
use_nonconda()
```

Please refer the vignette.

## Usage

Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from [this repository](https://github.com/pablobarbera/incivility-sage-open) (Theocharis et al., 2020).

```{r, echo = FALSE, message = FALSE}
devtools::load_all()
```

```{r}
unciviltweets
```

In order to train a Transfomer model, please select the `model_name` from [Hugging Face's list](https://huggingface.co/models). The table below lists some common choices. In most of the time, providing `model_name` is sufficient, there is no need to provide `model_type`.

Suppose you want to train a Transformer model using "bertweet" (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the `output` directory of the current directory. You can change it to elsewhere using the `output_dir` parameter. 

```r
model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
### If you are hardcore quanteda user:
## model <- textmodel_transformer(unciviltweets,
##                                model_type = "bertweet", model_name = "vinai/bertweet-base")
```

Make prediction

```r
predict(model)
```

That is it.

## Extended examples

Several extended examples are also available.

| Examples                                        | file                                           |
|-------------------------------------------------|------------------------------------------------|
| van Atteveldt et al. (2021)                     | [paper/vanatteveldt.md](paper/vanatteveldt.md) |
| Dobbrick et al. (2021)                          | [paper/dobbrick.md](paper/dobbrick.md)         |
| Theocharis et al. (2020)                        | [paper/theocharis.md](paper/theocharis.md)     |
| OffensEval-TR (2020)                            | [paper/coltekin.md](paper/coltekin.md)         |
| Amharic News Text classification Dataset (2021) | [paper/azime.md](paper/azime.md)               |

## Some common choices of `model_name`

| Your data         | model_type | model_name                         |
|-------------------|------------|------------------------------------|
| English tweets    | bertweet   | vinai/bertweet-base                |
| Lightweight       | mobilebert | google/mobilebert-uncased          |
|                   | distilbert | distilbert-base-uncased            |
| Long Text         | longformer | allenai/longformer-base-4096       |
|                   | bigbird    | google/bigbird-roberta-base        |
| English (General) | bert       | bert-base-uncased                  |
|                   | bert       | bert-base-cased                    |
|                   | electra    | google/electra-small-discriminator |
|                   | roberta    | roberta-base                       |
| Multilingual      | xlm        | xlm-mlm-17-1280                    |
|                   | xml        | xlm-mlm-100-1280                   |
|                   | bert       | bert-base-multilingual-cased       |
|                   | xlmroberta | xlm-roberta-base                   |
|                   | xlmroberta | xlm-roberta-large                  |

# References

1. Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
2. Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

---
[^1]: Yes, I totally made up the meaningless long name. Actually, it is the German name of the *Sesame Street* character [Count von Count](https://de.wikipedia.org/wiki/Sesamstra%C3%9Fe#Graf_Zahl), meaning "Count (the noble title) Number". And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.

Owner

Name: Transparent Social Analytics
Login: gesistsa
Kind: organization
Location: Germany

Repositories: 2
Profile: https://github.com/gesistsa

Open Science Tools maintained by Transparent Social Analytics Team, GESIS

Citation (CITATION.cff)

# --------------------------------------------
# CITATION file created with {cffr} R package
# See also: https://docs.ropensci.org/cffr/
# --------------------------------------------
 
cff-version: 1.2.0
message: 'To cite package "grafzahl" in publications use:'
type: software
license: GPL-3.0-or-later
title: 'grafzahl: Supervised Machine Learning for Textual Data Using Transformers
  and ''Quanteda'''
version: 0.0.11
doi: 10.5117/CCR2023.1.003.CHAN
identifiers:
- type: doi
  value: 10.32614/CRAN.package.grafzahl
abstract: 'Duct tape the ''quanteda'' ecosystem (Benoit et al., 2018) <https://doi.org/10.21105/joss.00774>
  to modern Transformer-based text classification models (Wolf et al., 2020) <https://doi.org/10.18653/v1/2020.emnlp-demos.6>,
  in order to facilitate supervised machine learning for textual data. This package
  mimics the behaviors of ''quanteda.textmodels'' and provides a function to setup
  the ''Python'' environment to use the pretrained models from ''Hugging Face'' <https://huggingface.co/>.
  More information: <https://doi.org/10.5117/CCR2023.1.003.CHAN>.'
authors:
- family-names: Chan
  given-names: Chung-hong
  email: chainsawtiney@gmail.com
  orcid: https://orcid.org/0000-0002-6232-7530
preferred-citation:
  type: article
  title: 'grafzahl: fine-tuning Transformers for text data from within R.'
  authors:
  - family-names: Chan
    given-names: Chung-hong
    email: chainsawtiney@gmail.com
    orcid: https://orcid.org/0000-0002-6232-7530
  journal: Computational Communication Research
  doi: 10.5117/CCR2023.1.003.CHAN
  volume: '5'
  issue: '1'
  year: '2023'
  start: 76-84
repository: https://CRAN.R-project.org/package=grafzahl
repository-code: https://github.com/gesistsa/grafzahl
url: https://gesistsa.github.io/grafzahl/
contact:
- family-names: Chan
  given-names: Chung-hong
  email: chainsawtiney@gmail.com
  orcid: https://orcid.org/0000-0002-6232-7530
references:
- type: software
  title: knitr
  abstract: 'knitr: A General-Purpose Package for Dynamic Report Generation in R'
  notes: Suggests
  url: https://yihui.org/knitr/
  repository: https://CRAN.R-project.org/package=knitr
  authors:
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  year: '2024'
  doi: 10.32614/CRAN.package.knitr
- type: software
  title: rmarkdown
  abstract: 'rmarkdown: Dynamic Documents for R'
  notes: Suggests
  url: https://pkgs.rstudio.com/rmarkdown/
  repository: https://CRAN.R-project.org/package=rmarkdown
  authors:
  - family-names: Allaire
    given-names: JJ
    email: jj@posit.co
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  - family-names: Dervieux
    given-names: Christophe
    email: cderv@posit.co
    orcid: https://orcid.org/0000-0003-4474-2498
  - family-names: McPherson
    given-names: Jonathan
    email: jonathan@posit.co
  - family-names: Luraschi
    given-names: Javier
  - family-names: Ushey
    given-names: Kevin
    email: kevin@posit.co
  - family-names: Atkins
    given-names: Aron
    email: aron@posit.co
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  - family-names: Cheng
    given-names: Joe
    email: joe@posit.co
  - family-names: Chang
    given-names: Winston
    email: winston@posit.co
  - family-names: Iannone
    given-names: Richard
    email: rich@posit.co
    orcid: https://orcid.org/0000-0003-3925-190X
  year: '2024'
  doi: 10.32614/CRAN.package.rmarkdown
- type: software
  title: testthat
  abstract: 'testthat: Unit Testing for R'
  notes: Suggests
  url: https://testthat.r-lib.org
  repository: https://CRAN.R-project.org/package=testthat
  authors:
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  year: '2024'
  doi: 10.32614/CRAN.package.testthat
  version: '>= 3.0.0'
- type: software
  title: withr
  abstract: 'withr: Run Code ''With'' Temporarily Modified Global State'
  notes: Suggests
  url: https://withr.r-lib.org
  repository: https://CRAN.R-project.org/package=withr
  authors:
  - family-names: Hester
    given-names: Jim
  - family-names: Henry
    given-names: Lionel
    email: lionel@posit.co
  - family-names: Müller
    given-names: Kirill
    email: krlmlr+r@mailbox.org
  - family-names: Ushey
    given-names: Kevin
    email: kevinushey@gmail.com
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  - family-names: Chang
    given-names: Winston
  year: '2024'
  doi: 10.32614/CRAN.package.withr
- type: software
  title: jsonlite
  abstract: 'jsonlite: A Simple and Robust JSON Parser and Generator for R'
  notes: Imports
  url: https://jeroen.r-universe.dev/jsonlite
  repository: https://CRAN.R-project.org/package=jsonlite
  authors:
  - family-names: Ooms
    given-names: Jeroen
    email: jeroenooms@gmail.com
    orcid: https://orcid.org/0000-0002-4035-0289
  year: '2024'
  doi: 10.32614/CRAN.package.jsonlite
- type: software
  title: lime
  abstract: 'lime: Local Interpretable Model-Agnostic Explanations'
  notes: Imports
  url: https://lime.data-imaginist.com
  repository: https://CRAN.R-project.org/package=lime
  authors:
  - family-names: Hvitfeldt
    given-names: Emil
    email: emilhhvitfeldt@gmail.com
    orcid: https://orcid.org/0000-0002-0679-1945
  - family-names: Pedersen
    given-names: Thomas Lin
    email: thomasp85@gmail.com
    orcid: https://orcid.org/0000-0002-5147-4711
  - family-names: Benesty
    given-names: Michaël
    email: michael@benesty.fr
  year: '2024'
  doi: 10.32614/CRAN.package.lime
- type: software
  title: quanteda
  abstract: 'quanteda: Quantitative Analysis of Textual Data'
  notes: Imports
  url: https://quanteda.io
  repository: https://CRAN.R-project.org/package=quanteda
  authors:
  - family-names: Benoit
    given-names: Kenneth
    email: kbenoit@lse.ac.uk
    orcid: https://orcid.org/0000-0002-0797-564X
  - family-names: Watanabe
    given-names: Kohei
    email: watanabe.kohei@gmail.com
    orcid: https://orcid.org/0000-0001-6519-5265
  - family-names: Wang
    given-names: Haiyan
    email: whyinsa@yahoo.com
    orcid: https://orcid.org/0000-0003-4992-4311
  - family-names: Nulty
    given-names: Paul
    email: paul.nulty@gmail.com
    orcid: https://orcid.org/0000-0002-7214-4666
  - family-names: Obeng
    given-names: Adam
    email: quanteda@binaryeagle.com
    orcid: https://orcid.org/0000-0002-2906-4775
  - family-names: Müller
    given-names: Stefan
    email: stefan.mueller@ucd.ie
    orcid: https://orcid.org/0000-0002-6315-4125
  - family-names: Matsuo
    given-names: Akitaka
    email: a.matsuo@essex.ac.uk
    orcid: https://orcid.org/0000-0002-3323-6330
  - family-names: Lowe
    given-names: William
    email: lowe@hertie-school.org
    orcid: https://orcid.org/0000-0002-1549-6163
  year: '2024'
  doi: 10.32614/CRAN.package.quanteda
- type: software
  title: reticulate
  abstract: 'reticulate: Interface to ''Python'''
  notes: Imports
  url: https://rstudio.github.io/reticulate/
  repository: https://CRAN.R-project.org/package=reticulate
  authors:
  - family-names: Ushey
    given-names: Kevin
    email: kevin@posit.co
  - family-names: Allaire
    given-names: JJ
    email: jj@posit.co
  - family-names: Tang
    given-names: Yuan
    email: terrytangyuan@gmail.com
    orcid: https://orcid.org/0000-0001-5243-233X
  year: '2024'
  doi: 10.32614/CRAN.package.reticulate
- type: software
  title: utils
  abstract: 'R: A Language and Environment for Statistical Computing'
  notes: Imports
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
- type: software
  title: stats
  abstract: 'R: A Language and Environment for Statistical Computing'
  notes: Imports
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
- type: software
  title: 'R: A Language and Environment for Statistical Computing'
  notes: Depends
  url: https://www.R-project.org/
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
  version: '>= 3.5'

GitHub Events

Total

Issues event: 5
Watch event: 1
Delete event: 2
Issue comment event: 1
Push event: 11
Pull request event: 9
Fork event: 1
Create event: 4

Last Year

Issues event: 5
Watch event: 1
Delete event: 2
Issue comment event: 1
Push event: 11
Pull request event: 9
Fork event: 1
Create event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 28
Total pull requests: 16
Average time to close issues: about 1 month
Average time to close pull requests: 1 day
Total issue authors: 8
Total pull request authors: 3
Average comments per issue: 1.21
Average comments per pull request: 0.31
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 8
Average time to close issues: 8 days
Average time to close pull requests: 2 days
Issue authors: 2
Pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 0.13
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

chainsawriot (21)
LuigiC72 (1)
ureber (1)
barracuda156 (1)
bachl (1)
cbpuschmann (1)
rgaiacs (1)
tweedmann (1)

Pull Request Authors

chainsawriot (10)
ArthurMuehl (4)
bachl (2)

Top Labels

Issue Labels

v0.1 (3) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 316 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

cran.r-project.org: grafzahl

Supervised Machine Learning for Textual Data Using Transformers and 'Quanteda'

Homepage: https://gesistsa.github.io/grafzahl/
Documentation: http://cran.r-project.org/web/packages/grafzahl/grafzahl.pdf
License: GPL (≥ 3)
Latest release: 0.0.12
published about 1 year ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 316 Last month

Rankings

Stargazers count: 9.8%

Forks count: 28.8%

Dependent packages count: 29.8%

Dependent repos count: 35.5%

Average: 36.4%

Downloads: 78.1%

Maintainers (1)

chainsawtiney@gmail.com