text

Using Transformers from HuggingFace in R

https://github.com/oscarkjell/text

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, pubmed.ncbi, ncbi.nlm.nih.gov
✓
Committers with academic emails
3 of 21 committers (14.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (20.1%) to scientific vocabulary

Keywords

deep-learning machine-learning nlp transformers

Last synced: 6 months ago · JSON representation

Repository

Using Transformers from HuggingFace in R

Basic Info

Host: GitHub
Owner: OscarKjell
Language: R
Default Branch: master
Homepage: https://r-text.org
Size: 37.8 MB

Statistics

Stars: 153
Watchers: 9
Forks: 31
Open Issues: 7
Releases: 0

Topics

deep-learning machine-learning nlp transformers

Created about 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog

README.Rmd

---
output: github_document #rmarkdown::html_vignette # #rmarkdown::html_vignette
---




```{r}
#| echo: false
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# text 


[![CRAN Status](https://www.r-pkg.org/badges/version/text)](https://CRAN.R-project.org/package=text)
[![Github build status](https://github.com/oscarkjell/text/workflows/R-CMD-check/badge.svg)](https://github.com/oscarkjell/text/actions)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing-1)
[![CRAN Downloads](https://cranlogs.r-pkg.org/badges/grand-total/text)](https://CRAN.R-project.org/package=text)
[![codecov](https://codecov.io/gh/oscarkjell/text/branch/master/graph/badge.svg?)](https://app.codecov.io/gh/oscarkjell/text)



## Overview
An R-package for analyzing natural language with transformers-based large language models. The `text` package is part of the *R Language Analysis Suite*, including `talk`, `text` and `topics`.

+ [`talk`](https://www.r-talk.org/) transforms voice recordings into text, audio features, or embeddings.



+ [`text`](https://www.r-text.org/) provides many language tasks such as converting digital text into word embeddings.



`talk` and `text` offer access to Large Language Models from Hugging Face.



+ [`topics`](https://www.r-topics.org/) visualizes language patterns into topics to generate psychological insights.

 

 
![](man/figures/talk_text_topics.svg){width=50%}



The *R Language Analysis Suite* is created through a collaboration between psychology and computer science to address research needs and ensure state-of-the-art techniques. The suite is continuously tested on Ubuntu, Mac OS and Windows using the latest stable R version.

The *text*-package has two main objectives:


* First, to serve R-users as a *point solution* for transforming text to state-of-the-art word embeddings that are ready to be used for downstream tasks. The package provides a user-friendly link to language models based on transformers from [Hugging Face](https://huggingface.co/).


* Second, to serve as an *end-to-end solution* that provides state-of-the-art AI techniques tailored for social and behavioral scientists.


Please reference our tutorial article when using the `text` package: [The text-package: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning](https://pubmed.ncbi.nlm.nih.gov/37126041/).



### Short installation guide
Most users simply need to run below installation code. 
For those experiencing problems or want more alternatives, please see the [Extended Installation Guide](https://www.r-text.org/articles/ext_install_guide.html).

For the text-package to work, you first have to install the text-package in R, and then make it work with text required python packages. 

1. Install text-version (at the moment the second step only works using the development version of text from GitHub).

[GitHub](https://github.com/) development version:

``` r
# install.packages("devtools")
devtools::install_github("oscarkjell/text")
```

[CRAN](https://CRAN.R-project.org/package=text) version:

``` r
install.packages("text")
```

2. Install and initialize text required python packages:

``` r
library(text)

# Install text required python packages in a conda environment (with defaults).
textrpp_install()

# Initialize the installed conda environment.
# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R. 
textrpp_initialize(save_profile = TRUE)
```


### Point solution for transforming text to embeddings
Recent significant advances in NLP research have resulted in improved representations of human language (i.e., language models). These language models have produced big performance gains in tasks related to understanding human language. Text are making these SOTA models  easily accessible through an interface to [HuggingFace](https://huggingface.co/docs/transformers/index) in Python.

*Text* provides many of the contemporary state-of-the-art language models that are based on deep learning to model word order and context. Multilingual language models can also represent several languages; multilingual BERT comprises *104 different languages*. 

*Table 1. Some of the available language models*
``` {r HuggingFface_tabble_short, echo=FALSE, results='asis'}
library(magrittr)

Models <- c("'bert-base-uncased'",
            "'roberta-base'",
            "'distilbert-base-cased'",
            "'bert-base-multilingual-cased'",
            "'xlm-roberta-large'"
            )

References <- c("[Devlin et al. 2019](https://aclanthology.org/N19-1423/)",
                "[Liu et al. 2019](https://arxiv.org/abs/1907.11692)",
                "[Sahn et al., 2019](https://arxiv.org/abs/1910.01108)",
                "[Devlin et al. 2019](https://aclanthology.org/N19-1423/)",
                "[Liu et al](https://arxiv.org/pdf/1907.11692)"
                )

Layers <- c("12",
            "12", 
            "6",
            "12",
            "24")

Language <- c("English",
              "English", 
              "English",
              "[104 top languages at Wikipedia](https://meta.wikimedia.org/wiki/List_of_Wikipedias)",
              "[100 language](https://huggingface.co/docs/transformers/multilingual)")

Dimensions <- c("768", 
                "768", 
                "768", 
                "768", 
                "1024")

Tables_short <- tibble::tibble(Models, References, Layers, Dimensions, Language)

knitr::kable(Tables_short, caption="", bootstrap_options = c("hover"), full_width = T)
```
  
See [HuggingFace](https://huggingface.co/models/) for a more comprehensive list of models. 


The ```textEmbed()``` function is the main embedding function in text; and can output contextualized embeddings for tokens (i.e., the embeddings for each single word  instance of each text) and texts (i.e., single embeddings per text taken from aggregating all token embeddings of the text).

```{r short_word_embedding_example, eval = FALSE, warning=FALSE, message=FALSE}
library(text)
# Transform the text data to BERT word embeddings

# Example text
texts <- c("I feel great!")

# Defaults
embeddings <- textEmbed(texts)
embeddings
```
See [Get Started](https://www.r-text.org/articles/text.html) for more information. 

### Language Analysis Tasks
It is also possible to access many language analysis tasks such as textClassify(), textGeneration(), and textTranslate().

```{r language_analysis_task_examples, eval = FALSE, warning=FALSE, message=FALSE}
library(text)

# Generate text from the prompt "I am happy to"
generated_text <- textGeneration("I am happy to",
                                 model = "gpt2")
generated_text
```

For a full list of language analysis tasks supported in text see the [References](https://www.r-text.org/reference/index.html)

### An end-to-end package
*Text* also provides functions to analyse the word embeddings with well-tested machine learning algorithms and statistics. The focus is to analyze and visualize text, and their relation to other text or numerical variables. For example, the `textTrain()` function is used to examine how well the word embeddings from a text can predict a numeric or categorical variable. Another example is functions plotting statistically significant words in the word embedding space.   

```{r DPP_plot, message=FALSE, warning=FALSE}
library(text) 
# Use data (DP_projections_HILS_SWLS_100) that have been pre-processed with the textProjectionData function; the preprocessed test-data included in the package is called: DP_projections_HILS_SWLS_100
plot_projection <- textProjectionPlot(
  word_data = DP_projections_HILS_SWLS_100,
  y_axes = TRUE,
  title_top = " Supervised Bicentroid Projection of Harmony in life words",
  x_axes_label = "Low vs. High HILS score",
  y_axes_label = "Low vs. High SWLS score",
  position_jitter_hight = 0.5,
  position_jitter_width = 0.8
)
plot_projection$final_plot

```


### Featured Bluesky Post

```{r, echo = FALSE, results = 'asis'}
cat('

Version 1.3 of the #r-text package is now available from #CRAN. 

This new version makes it easier to apply pre-trained language assessments from the #LBAM-library (r-text.org/articles/LBA...).

#mlsky #PsychSciSky #Statistics #PsychSciSky #StatsSky #NLP

[image or embed]
— Oscar Kjell (@oscarkjell.bsky.social) Dec 22, 2024 at 9:48


')

```

Owner

Name: Oscar Kjell
Login: OscarKjell
Kind: user
Location: Sweden

Website: https://oscarkjell.se
Twitter: OscarKjell
Repositories: 1
Profile: https://github.com/OscarKjell

GitHub Events

Total

Issues event: 24
Watch event: 21
Delete event: 9
Issue comment event: 36
Push event: 236
Pull request review comment event: 3
Pull request review event: 4
Pull request event: 23
Fork event: 1
Create event: 11

Last Year

Issues event: 24
Watch event: 21
Delete event: 9
Issue comment event: 36
Push event: 236
Pull request review comment event: 3
Pull request review event: 4
Pull request event: 23
Fork event: 1
Create event: 11

Committers

Last synced: over 1 year ago

All Time

Total Commits: 1,442
Total Committers: 21
Avg Commits per committer: 68.667
Development Distribution Score (DDS): 0.379

Past Year

Commits: 358
Committers: 12
Avg Commits per committer: 29.833
Development Distribution Score (DDS): 0.589

Top Committers

Name	Email	Commits
Oscar Kjell	o**l@h**m	896
Oscar Kjell	o**l@O**l	194
moomoofarm1	4****1	149
CarlViggo	c**o@i**m	77
Oscar Kjell	o**l@O**l	45
Salvatore Giorgi	s**i@g**m	19
Salvatore Giorgi	s**i@g**m	19
LeonAckermann	l**n@g**m	10
Adithya V Ganesan	v**n@g**m	9
Mingcen Wei (sAy)	4****i	4
Matt Cowgill	m**l@g**m	3
andy	h**s@c**u	3
AugustNilsson	6****n	3
Daniel Hamngren	d**l@w**m	2
Humbert Costas	h**s@g**m	2
Andrej Pawluczenko	a**o@g**m	2
George Ostrouchov	g**c@u**u	1
Teun van den Brand	t**d@g**m	1
oskarbang	7****g	1
Dustin Stoltz	6****z	1
Vasudha Varadarajan	v**n@c**u	1

Committer Domains (Top 20 + Academic)

cronus.cs.stonybrook.edu: 1 utk.edu: 1 worddiagnostics.com: 1 cs.stonybrook.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 71
Total pull requests: 160
Average time to close issues: 4 months
Average time to close pull requests: 2 days
Total issue authors: 46
Total pull request authors: 21
Average comments per issue: 2.83
Average comments per pull request: 0.11
Merged pull requests: 127
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 12
Pull requests: 27
Average time to close issues: 21 days
Average time to close pull requests: about 14 hours
Issue authors: 11
Pull request authors: 6
Average comments per issue: 1.92
Average comments per pull request: 0.07
Merged pull requests: 20
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

moomoofarm1 (6)
LuigiC72 (5)
tressoldi (4)
sebsilas (4)
scm1210 (3)
lilchow (3)
massimoaria (2)
lingdoc (2)
NewUser36 (2)
adamramey (2)
cenotechnology (1)
nelsonlrdsantos (1)
MattCowgill (1)
maria-pro (1)
promothesh (1)

Pull Request Authors

CarlViggo (86)
OscarKjell (32)
moomoofarm1 (27)
adithya8 (19)
LeonAckermann (19)
soni-n (13)
Marwolaeth (3)
dustinstoltz (2)
MattCowgill (2)
AugustNilsson (2)
teunbrand (2)
mingcenwei (2)
vasevarad (2)
sjgiorgi (2)
michaelgrund (2)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Packages

Total packages: 3
Total downloads:
- cran 1,893 last-month

Total dependent packages: 2
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 36
Total maintainers: 1

proxy.golang.org: github.com/oscarkjell/text

Documentation: https://pkg.go.dev/github.com/oscarkjell/text#section-documentation
Latest release: v1.4.0
published 11 months ago

Versions: 11
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

proxy.golang.org: github.com/OscarKjell/text

Documentation: https://pkg.go.dev/github.com/OscarKjell/text#section-documentation
Latest release: v1.4.0
published 11 months ago

Versions: 11
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

cran.r-project.org: text

Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

Homepage: https://r-text.org/
Documentation: http://cran.r-project.org/web/packages/text/text.pdf
License: GPL-3
Latest release: 1.7.0
published 6 months ago

Versions: 14
Dependent Packages: 2
Dependent Repositories: 1
Downloads: 1,893 Last month

Rankings

Stargazers count: 3.6%

Forks count: 3.8%

Average: 11.2%

Dependent packages count: 13.6%

Dependent repos count: 23.8%

Maintainers (1)

oscar.kjell@psy.lu.se

Last synced: 6 months ago

Dependencies

.github/workflows/System specific installation WithPy.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
goanpeca/setup-miniconda v1 composite
r-lib/actions/setup-pandoc v2-branch composite
r-lib/actions/setup-r v2-branch composite

.github/workflows/Virtual-Environment-Test.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-python v2 composite
r-lib/actions/setup-pandoc v2-branch composite
r-lib/actions/setup-r v2-branch composite

.github/workflows/dont run/not now in use/New.yaml actions

actions/checkout v2 composite
goanpeca/setup-miniconda v1 composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r master composite

.github/workflows/dont run/not now in use/System specific installation NoPy.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r master composite

.github/workflows/test-coverage-RCMD.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-python v2 composite
goanpeca/setup-miniconda v1 composite
r-lib/actions/setup-pandoc v2-branch composite
r-lib/actions/setup-r v2-branch composite

.github/workflows/test-coverage.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-python v2 composite
goanpeca/setup-miniconda v1 composite
r-lib/actions/setup-pandoc v2-branch composite
r-lib/actions/setup-r v2-branch composite

DESCRIPTION cran

R >= 4.00 depends
cowplot * imports
dplyr * imports
furrr * imports
future * imports
ggplot2 * imports
ggrepel * imports
magrittr * imports
overlapping * imports
parsnip * imports
purrr * imports
recipes * imports
reticulate * imports
rlang * imports
rsample * imports
stringi * imports
tibble * imports
tidyr * imports
tune * imports
workflows * imports
yardstick * imports
covr * suggests
glmnet * suggests
knitr * suggests
randomForest * suggests
ranger * suggests
rio * suggests
rmarkdown * suggests
testthat * suggests
utils * suggests
xml2 * suggests