grafzahl
🧛 fine-tuning Transformers for text data from within R
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
â—‹Academic publication links
-
â—‹Academic email domains
-
â—‹Institutional organization owner
-
â—‹JOSS paper metadata
-
â—‹Scientific vocabulary similarity
Low similarity (18.7%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
·
Repository
🧛 fine-tuning Transformers for text data from within R
Basic Info
- Host: GitHub
- Owner: gesistsa
- License: gpl-3.0
- Language: R
- Default Branch: v0.1
- Homepage: https://gesistsa.github.io/grafzahl/
- Size: 5.04 MB
Statistics
- Stars: 42
- Watchers: 4
- Forks: 3
- Open Issues: 5
- Releases: 3
Created over 3 years ago
· Last pushed 6 months ago
Metadata Files
Readme
License
Citation
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# grafzahl
[](https://CRAN.R-project.org/package=grafzahl)
[](https://github.com/gesistsa/grafzahl/actions/workflows/R-CMD-check.yaml)
The goal of grafzahl (**G**racious **R** **A**nalytical **F**ramework for **Z**appy **A**nalysis of **H**uman **L**anguages [^1]) is to duct tape the [quanteda](https://github.com/quanteda/quanteda) ecosystem to modern [Transformer-based text classification models](https://simpletransformers.ai/), e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package [quanteda.textmodels](https://github.com/quanteda/quanteda.textmodels).
If you don't know what I am talking about, don't worry, this package is gracious. You don't need to know a lot about Transformers to use this package. See the examples below.
Please cite this software as:
Chan, C., (2023). [grafzahl: fine-tuning Transformers for text data from within R](paper/grafzahl_sp.pdf). *Computational Communication Research* 5(1): 76-84. [https://doi.org/10.5117/CCR2023.1.003.CHAN](https://doi.org/10.5117/CCR2023.1.003.CHAN)
## Installation: Local environment
Install the CRAN version
```r
install.packages("grafzahl")
```
After that, you need to setup your conda environment
```r
require(grafzahl)
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)
```
## On remote environments, e.g. Google Colab
On Google Colab, you need to enable non-Conda mode
```r
install.packages("grafzahl")
require(grafzahl)
use_nonconda()
```
Please refer the vignette.
## Usage
Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from [this repository](https://github.com/pablobarbera/incivility-sage-open) (Theocharis et al., 2020).
```{r, echo = FALSE, message = FALSE}
devtools::load_all()
```
```{r}
unciviltweets
```
In order to train a Transfomer model, please select the `model_name` from [Hugging Face's list](https://huggingface.co/models). The table below lists some common choices. In most of the time, providing `model_name` is sufficient, there is no need to provide `model_type`.
Suppose you want to train a Transformer model using "bertweet" (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the `output` directory of the current directory. You can change it to elsewhere using the `output_dir` parameter.
```r
model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
### If you are hardcore quanteda user:
## model <- textmodel_transformer(unciviltweets,
## model_type = "bertweet", model_name = "vinai/bertweet-base")
```
Make prediction
```r
predict(model)
```
That is it.
## Extended examples
Several extended examples are also available.
| Examples | file |
|-------------------------------------------------|------------------------------------------------|
| van Atteveldt et al. (2021) | [paper/vanatteveldt.md](paper/vanatteveldt.md) |
| Dobbrick et al. (2021) | [paper/dobbrick.md](paper/dobbrick.md) |
| Theocharis et al. (2020) | [paper/theocharis.md](paper/theocharis.md) |
| OffensEval-TR (2020) | [paper/coltekin.md](paper/coltekin.md) |
| Amharic News Text classification Dataset (2021) | [paper/azime.md](paper/azime.md) |
## Some common choices of `model_name`
| Your data | model_type | model_name |
|-------------------|------------|------------------------------------|
| English tweets | bertweet | vinai/bertweet-base |
| Lightweight | mobilebert | google/mobilebert-uncased |
| | distilbert | distilbert-base-uncased |
| Long Text | longformer | allenai/longformer-base-4096 |
| | bigbird | google/bigbird-roberta-base |
| English (General) | bert | bert-base-uncased |
| | bert | bert-base-cased |
| | electra | google/electra-small-discriminator |
| | roberta | roberta-base |
| Multilingual | xlm | xlm-mlm-17-1280 |
| | xml | xlm-mlm-100-1280 |
| | bert | bert-base-multilingual-cased |
| | xlmroberta | xlm-roberta-base |
| | xlmroberta | xlm-roberta-large |
# References
1. Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
2. Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.
---
[^1]: Yes, I totally made up the meaningless long name. Actually, it is the German name of the *Sesame Street* character [Count von Count](https://de.wikipedia.org/wiki/Sesamstra%C3%9Fe#Graf_Zahl), meaning "Count (the noble title) Number". And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.
Owner
- Name: Transparent Social Analytics
- Login: gesistsa
- Kind: organization
- Location: Germany
- Repositories: 2
- Profile: https://github.com/gesistsa
Open Science Tools maintained by Transparent Social Analytics Team, GESIS
Citation (CITATION.cff)
# --------------------------------------------
# CITATION file created with {cffr} R package
# See also: https://docs.ropensci.org/cffr/
# --------------------------------------------
cff-version: 1.2.0
message: 'To cite package "grafzahl" in publications use:'
type: software
license: GPL-3.0-or-later
title: 'grafzahl: Supervised Machine Learning for Textual Data Using Transformers
and ''Quanteda'''
version: 0.0.11
doi: 10.5117/CCR2023.1.003.CHAN
identifiers:
- type: doi
value: 10.32614/CRAN.package.grafzahl
abstract: 'Duct tape the ''quanteda'' ecosystem (Benoit et al., 2018) <https://doi.org/10.21105/joss.00774>
to modern Transformer-based text classification models (Wolf et al., 2020) <https://doi.org/10.18653/v1/2020.emnlp-demos.6>,
in order to facilitate supervised machine learning for textual data. This package
mimics the behaviors of ''quanteda.textmodels'' and provides a function to setup
the ''Python'' environment to use the pretrained models from ''Hugging Face'' <https://huggingface.co/>.
More information: <https://doi.org/10.5117/CCR2023.1.003.CHAN>.'
authors:
- family-names: Chan
given-names: Chung-hong
email: chainsawtiney@gmail.com
orcid: https://orcid.org/0000-0002-6232-7530
preferred-citation:
type: article
title: 'grafzahl: fine-tuning Transformers for text data from within R.'
authors:
- family-names: Chan
given-names: Chung-hong
email: chainsawtiney@gmail.com
orcid: https://orcid.org/0000-0002-6232-7530
journal: Computational Communication Research
doi: 10.5117/CCR2023.1.003.CHAN
volume: '5'
issue: '1'
year: '2023'
start: 76-84
repository: https://CRAN.R-project.org/package=grafzahl
repository-code: https://github.com/gesistsa/grafzahl
url: https://gesistsa.github.io/grafzahl/
contact:
- family-names: Chan
given-names: Chung-hong
email: chainsawtiney@gmail.com
orcid: https://orcid.org/0000-0002-6232-7530
references:
- type: software
title: knitr
abstract: 'knitr: A General-Purpose Package for Dynamic Report Generation in R'
notes: Suggests
url: https://yihui.org/knitr/
repository: https://CRAN.R-project.org/package=knitr
authors:
- family-names: Xie
given-names: Yihui
email: xie@yihui.name
orcid: https://orcid.org/0000-0003-0645-5666
year: '2024'
doi: 10.32614/CRAN.package.knitr
- type: software
title: rmarkdown
abstract: 'rmarkdown: Dynamic Documents for R'
notes: Suggests
url: https://pkgs.rstudio.com/rmarkdown/
repository: https://CRAN.R-project.org/package=rmarkdown
authors:
- family-names: Allaire
given-names: JJ
email: jj@posit.co
- family-names: Xie
given-names: Yihui
email: xie@yihui.name
orcid: https://orcid.org/0000-0003-0645-5666
- family-names: Dervieux
given-names: Christophe
email: cderv@posit.co
orcid: https://orcid.org/0000-0003-4474-2498
- family-names: McPherson
given-names: Jonathan
email: jonathan@posit.co
- family-names: Luraschi
given-names: Javier
- family-names: Ushey
given-names: Kevin
email: kevin@posit.co
- family-names: Atkins
given-names: Aron
email: aron@posit.co
- family-names: Wickham
given-names: Hadley
email: hadley@posit.co
- family-names: Cheng
given-names: Joe
email: joe@posit.co
- family-names: Chang
given-names: Winston
email: winston@posit.co
- family-names: Iannone
given-names: Richard
email: rich@posit.co
orcid: https://orcid.org/0000-0003-3925-190X
year: '2024'
doi: 10.32614/CRAN.package.rmarkdown
- type: software
title: testthat
abstract: 'testthat: Unit Testing for R'
notes: Suggests
url: https://testthat.r-lib.org
repository: https://CRAN.R-project.org/package=testthat
authors:
- family-names: Wickham
given-names: Hadley
email: hadley@posit.co
year: '2024'
doi: 10.32614/CRAN.package.testthat
version: '>= 3.0.0'
- type: software
title: withr
abstract: 'withr: Run Code ''With'' Temporarily Modified Global State'
notes: Suggests
url: https://withr.r-lib.org
repository: https://CRAN.R-project.org/package=withr
authors:
- family-names: Hester
given-names: Jim
- family-names: Henry
given-names: Lionel
email: lionel@posit.co
- family-names: Müller
given-names: Kirill
email: krlmlr+r@mailbox.org
- family-names: Ushey
given-names: Kevin
email: kevinushey@gmail.com
- family-names: Wickham
given-names: Hadley
email: hadley@posit.co
- family-names: Chang
given-names: Winston
year: '2024'
doi: 10.32614/CRAN.package.withr
- type: software
title: jsonlite
abstract: 'jsonlite: A Simple and Robust JSON Parser and Generator for R'
notes: Imports
url: https://jeroen.r-universe.dev/jsonlite
repository: https://CRAN.R-project.org/package=jsonlite
authors:
- family-names: Ooms
given-names: Jeroen
email: jeroenooms@gmail.com
orcid: https://orcid.org/0000-0002-4035-0289
year: '2024'
doi: 10.32614/CRAN.package.jsonlite
- type: software
title: lime
abstract: 'lime: Local Interpretable Model-Agnostic Explanations'
notes: Imports
url: https://lime.data-imaginist.com
repository: https://CRAN.R-project.org/package=lime
authors:
- family-names: Hvitfeldt
given-names: Emil
email: emilhhvitfeldt@gmail.com
orcid: https://orcid.org/0000-0002-0679-1945
- family-names: Pedersen
given-names: Thomas Lin
email: thomasp85@gmail.com
orcid: https://orcid.org/0000-0002-5147-4711
- family-names: Benesty
given-names: Michaël
email: michael@benesty.fr
year: '2024'
doi: 10.32614/CRAN.package.lime
- type: software
title: quanteda
abstract: 'quanteda: Quantitative Analysis of Textual Data'
notes: Imports
url: https://quanteda.io
repository: https://CRAN.R-project.org/package=quanteda
authors:
- family-names: Benoit
given-names: Kenneth
email: kbenoit@lse.ac.uk
orcid: https://orcid.org/0000-0002-0797-564X
- family-names: Watanabe
given-names: Kohei
email: watanabe.kohei@gmail.com
orcid: https://orcid.org/0000-0001-6519-5265
- family-names: Wang
given-names: Haiyan
email: whyinsa@yahoo.com
orcid: https://orcid.org/0000-0003-4992-4311
- family-names: Nulty
given-names: Paul
email: paul.nulty@gmail.com
orcid: https://orcid.org/0000-0002-7214-4666
- family-names: Obeng
given-names: Adam
email: quanteda@binaryeagle.com
orcid: https://orcid.org/0000-0002-2906-4775
- family-names: Müller
given-names: Stefan
email: stefan.mueller@ucd.ie
orcid: https://orcid.org/0000-0002-6315-4125
- family-names: Matsuo
given-names: Akitaka
email: a.matsuo@essex.ac.uk
orcid: https://orcid.org/0000-0002-3323-6330
- family-names: Lowe
given-names: William
email: lowe@hertie-school.org
orcid: https://orcid.org/0000-0002-1549-6163
year: '2024'
doi: 10.32614/CRAN.package.quanteda
- type: software
title: reticulate
abstract: 'reticulate: Interface to ''Python'''
notes: Imports
url: https://rstudio.github.io/reticulate/
repository: https://CRAN.R-project.org/package=reticulate
authors:
- family-names: Ushey
given-names: Kevin
email: kevin@posit.co
- family-names: Allaire
given-names: JJ
email: jj@posit.co
- family-names: Tang
given-names: Yuan
email: terrytangyuan@gmail.com
orcid: https://orcid.org/0000-0001-5243-233X
year: '2024'
doi: 10.32614/CRAN.package.reticulate
- type: software
title: utils
abstract: 'R: A Language and Environment for Statistical Computing'
notes: Imports
authors:
- name: R Core Team
institution:
name: R Foundation for Statistical Computing
address: Vienna, Austria
year: '2024'
- type: software
title: stats
abstract: 'R: A Language and Environment for Statistical Computing'
notes: Imports
authors:
- name: R Core Team
institution:
name: R Foundation for Statistical Computing
address: Vienna, Austria
year: '2024'
- type: software
title: 'R: A Language and Environment for Statistical Computing'
notes: Depends
url: https://www.R-project.org/
authors:
- name: R Core Team
institution:
name: R Foundation for Statistical Computing
address: Vienna, Austria
year: '2024'
version: '>= 3.5'
GitHub Events
Total
- Issues event: 5
- Watch event: 1
- Delete event: 2
- Issue comment event: 1
- Push event: 11
- Pull request event: 9
- Fork event: 1
- Create event: 4
Last Year
- Issues event: 5
- Watch event: 1
- Delete event: 2
- Issue comment event: 1
- Push event: 11
- Pull request event: 9
- Fork event: 1
- Create event: 4
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 28
- Total pull requests: 16
- Average time to close issues: about 1 month
- Average time to close pull requests: 1 day
- Total issue authors: 8
- Total pull request authors: 3
- Average comments per issue: 1.21
- Average comments per pull request: 0.31
- Merged pull requests: 14
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 8
- Average time to close issues: 8 days
- Average time to close pull requests: 2 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.13
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- chainsawriot (21)
- LuigiC72 (1)
- ureber (1)
- barracuda156 (1)
- bachl (1)
- cbpuschmann (1)
- rgaiacs (1)
- tweedmann (1)
Pull Request Authors
- chainsawriot (10)
- ArthurMuehl (4)
- bachl (2)
Top Labels
Issue Labels
v0.1 (3)
bug (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 316 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
cran.r-project.org: grafzahl
Supervised Machine Learning for Textual Data Using Transformers and 'Quanteda'
- Homepage: https://gesistsa.github.io/grafzahl/
- Documentation: http://cran.r-project.org/web/packages/grafzahl/grafzahl.pdf
- License: GPL (≥ 3)
-
Latest release: 0.0.12
published 8 months ago
Rankings
Stargazers count: 9.8%
Forks count: 28.8%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Average: 36.4%
Downloads: 78.1%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.5 depends
- jsonlite * imports
- lime * imports
- quanteda * imports
- reticulate * imports
- stats * imports
- utils * imports
- quanteda.textmodels * suggests
- testthat >= 3.0.0 suggests
- withr * suggests