pangoling

An R package for estimating the log-probabilities of words in a given context using transformer models.

https://github.com/ropensci/pangoling

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.8%) to scientific vocabulary

Keywords

nlp psycholinguistics r r-package rstats transformers
Last synced: 6 months ago · JSON representation

Repository

An R package for estimating the log-probabilities of words in a given context using transformer models.

Basic Info
Statistics
  • Stars: 12
  • Watchers: 1
  • Forks: 0
  • Open Issues: 5
  • Releases: 2
Topics
nlp psycholinguistics r r-package rstats transformers
Created over 3 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog Contributing License Codemeta

README.Rmd

---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# pangoling 


[![Codecov test coverage](https://codecov.io/gh/ropensci/pangoling/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ropensci/pangoling?branch=main)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
 [![R-CMD-check](https://github.com/ropensci/pangoling/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/pangoling/actions/workflows/R-CMD-check.yaml)
[![Project Status: active](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![DOI](https://zenodo.org/badge/497831295.svg)](https://zenodo.org/badge/latestdoi/497831295)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)
[![CRAN status](https://www.r-pkg.org/badges/version/pangoling)](https://CRAN.R-project.org/package=pangoling)
[![metacran downloads](https://cranlogs.r-pkg.org/badges/grand-total/pangoling)](https://cran.r-project.org/package=pangoling)




`pangoling`^[The logo of the package was created with [stable diffusion ](https://huggingface.co/spaces/stabilityai/stable-diffusion) and the R package [hexSticker](https://github.com/GuangchuangYu/hexSticker).] is an R package for 
estimating the predictability of words in a given context using transformer
models. The package provides an interface for utilizing pre-trained transformer 
models (such as GPT-2 or BERT) to obtain word probabilities. These word 
probabilities are often utilized as predictors in psycholinguistic studies. This
package can be useful for researchers in the field of psycholinguistics who want
to leverage the power of transformer models in their work.

The package is mostly a wrapper of the python package [`transformers`](https://pypi.org/project/transformers/) to process data in a convenient format. 


## Important! Limitations and bias

The training data of the most popular models (such as GPT-2) haven't been released, so one cannot inspect it. It's clear that the data contain a lot of unfiltered content from the internet, which is far from neutral. See for example the scope in the [openAI team's model card for GPT-2](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases), but it should be the same for many other models, and the [limitations and bias section of GPT-2 in Hugging Face website](https://huggingface.co/gpt2).

## Installation

To install the latest CRAN version of `pangoling` use:

```{r, eval = FALSE}
install.packages("pangoling")
```

To install the latest version from github use:

```{r, eval = FALSE}
install.packages("pangoling", repos = "https://ropensci.r-universe.dev")
```

`install_py_pangoling` function facilitates the installation of Python packages needed for using pangoling within an R environment, using the `reticulate` package for managing Python environments. This needs to be done once.

```{r, eval = FALSE}
install_py_pangoling()
```

## Example

This is a basic example which shows you how to get log-probabilities of words in a dataset:

```{r, message = FALSE}
library(pangoling)
library(tidytable) #fast alternative to dplyr
```

Given a (toy) dataset where sentences are organized with one word or short phrase in each row:

```{r, cache = TRUE}
sentences <- c("The apple doesn't fall far from the tree.", 
               "Don't judge a book by its cover.")
(df_sent <- strsplit(x = sentences, split = " ") |> 
  map_dfr(.f =  ~ data.frame(word = .x), .id = "sent_n"))
```

One can get the log-transformed probability of each word based on GPT-2 as follows:

```{r, cache = TRUE}
df_sent <- df_sent |>
  mutate(lp = causal_words_pred(word, by = sent_n))
df_sent
```


## How to cite

```{r, comment = NA }
citation("pangoling")
```

## How to contribute

See the [Contributing guidelines](.github/CONTRIBUTING.md). 


## Code of conduct

Please note that this package is released with a [Contributor
Code of Conduct](https://ropensci.org/code-of-conduct/). 
By contributing to this project, you agree to abide by its terms.

## See also

Another R package that act as a wrapper for [`transformers`](https://pypi.org/project/transformers/) is [`text`](https://r-text.org//) However, `text` is more general, and its focus 
is on Natural Language Processing and Machine Learning.

Owner

  • Name: rOpenSci
  • Login: ropensci
  • Kind: organization
  • Email: info@ropensci.org
  • Location: Berkeley, CA

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "identifier": "pangoling",
  "description": "Provides access to word predictability estimates using large language models (LLMs) based on 'transformer' architectures via integration with the 'Hugging Face' ecosystem <https://huggingface.co/>. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., 'GPT-2') and masked/bidirectional LLMs (e.g., 'BERT') to compute the probability of words, phrases, or tokens given their linguistic context. For details on GPT-2 and causal models, see Radford et al. (2019) <https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf>, for details on BERT and masked models, see Devlin et al. (2019) <doi:10.48550/arXiv.1810.04805>. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP).",
  "name": "pangoling: Access to Large Language Model Predictions",
  "relatedLink": [
    "https://docs.ropensci.org/pangoling/",
    "https://CRAN.R-project.org/package=pangoling"
  ],
  "codeRepository": "https://github.com/ropensci/pangoling",
  "issueTracker": "https://github.com/ropensci/pangoling/issues",
  "license": "https://spdx.org/licenses/MIT",
  "version": "1.0.3",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "R",
    "url": "https://r-project.org"
  },
  "runtimePlatform": "R version 4.4.1 (2024-06-14)",
  "provider": {
    "@id": "https://cran.r-project.org",
    "@type": "Organization",
    "name": "Comprehensive R Archive Network (CRAN)",
    "url": "https://cran.r-project.org"
  },
  "author": [
    {
      "@type": "Person",
      "givenName": "Bruno",
      "familyName": "Nicenboim",
      "email": "b.nicenboim@tilburguniversity.edu",
      "@id": "https://orcid.org/0000-0002-5176-3943"
    }
  ],
  "contributor": [
    {
      "@type": "Person",
      "givenName": "Chris",
      "familyName": "Emmerly"
    },
    {
      "@type": "Person",
      "givenName": "Giovanni",
      "familyName": "Cassani"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": "Bruno",
      "familyName": "Nicenboim",
      "email": "b.nicenboim@tilburguniversity.edu",
      "@id": "https://orcid.org/0000-0002-5176-3943"
    }
  ],
  "softwareSuggestions": [
    {
      "@type": "SoftwareApplication",
      "identifier": "brms",
      "name": "brms",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=brms"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "knitr",
      "name": "knitr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=knitr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "parallel",
      "name": "parallel"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "rmarkdown",
      "name": "rmarkdown",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rmarkdown"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "spelling",
      "name": "spelling",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=spelling"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "testthat",
      "name": "testthat",
      "version": ">= 3.0.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=testthat"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "tictoc",
      "name": "tictoc",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tictoc"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "covr",
      "name": "covr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=covr"
    }
  ],
  "softwareRequirements": {
    "1": {
      "@type": "SoftwareApplication",
      "identifier": "R",
      "name": "R",
      "version": ">= 4.1.0"
    },
    "2": {
      "@type": "SoftwareApplication",
      "identifier": "cachem",
      "name": "cachem",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=cachem"
    },
    "3": {
      "@type": "SoftwareApplication",
      "identifier": "data.table",
      "name": "data.table",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=data.table"
    },
    "4": {
      "@type": "SoftwareApplication",
      "identifier": "memoise",
      "name": "memoise",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=memoise"
    },
    "5": {
      "@type": "SoftwareApplication",
      "identifier": "reticulate",
      "name": "reticulate",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=reticulate"
    },
    "6": {
      "@type": "SoftwareApplication",
      "identifier": "rstudioapi",
      "name": "rstudioapi",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rstudioapi"
    },
    "7": {
      "@type": "SoftwareApplication",
      "identifier": "stats",
      "name": "stats"
    },
    "8": {
      "@type": "SoftwareApplication",
      "identifier": "tidyselect",
      "name": "tidyselect",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tidyselect"
    },
    "9": {
      "@type": "SoftwareApplication",
      "identifier": "tidytable",
      "name": "tidytable",
      "version": ">= 0.7.2",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tidytable"
    },
    "10": {
      "@type": "SoftwareApplication",
      "identifier": "utils",
      "name": "utils"
    },
    "SystemRequirements": null
  },
  "fileSize": "632.634KB",
  "citation": [
    {
      "@type": "SoftwareSourceCode",
      "datePublished": "2025",
      "author": [
        {
          "@type": "Person",
          "givenName": "Bruno",
          "familyName": "Nicenboim"
        }
      ],
      "name": "{pangoling}: {Access} to large language model predictions in {R}",
      "identifier": "10.5281/zenodo.7637526",
      "url": "https://github.com/ropensci/pangoling",
      "description": "R package version 1.0.3",
      "@id": "https://doi.org/10.5281/zenodo.7637526",
      "sameAs": "https://doi.org/10.5281/zenodo.7637526"
    },
    {
      "@type": "CreativeWork",
      "datePublished": "2020",
      "author": [
        {
          "@type": "Person",
          "givenName": "Thomas",
          "familyName": "Wolf"
        },
        {
          "@type": "Person",
          "givenName": "Lysandre",
          "familyName": "Debut"
        },
        {
          "@type": "Person",
          "givenName": "Victor",
          "familyName": "Sanh"
        },
        {
          "@type": "Person",
          "givenName": "Julien",
          "familyName": "Chaumond"
        },
        {
          "@type": "Person",
          "givenName": "Clement",
          "familyName": "Delangue"
        },
        {
          "@type": "Person",
          "givenName": "Anthony",
          "familyName": "Moi"
        },
        {
          "@type": "Person",
          "givenName": "Pierric",
          "familyName": "Cistac"
        },
        {
          "@type": "Person",
          "givenName": "Tim",
          "familyName": "Rault"
        },
        {
          "@type": "Person",
          "givenName": "Rmi",
          "familyName": "Louf"
        },
        {
          "@type": "Person",
          "givenName": "Morgan",
          "familyName": "Funtowicz"
        },
        {
          "@type": "Person",
          "givenName": "Joe",
          "familyName": "Davison"
        },
        {
          "@type": "Person",
          "givenName": "Sam",
          "familyName": "Shleifer"
        },
        {
          "@type": "Person",
          "givenName": "Patrick",
          "familyName": "von Platen"
        },
        {
          "@type": "Person",
          "givenName": "Clara",
          "familyName": "Ma"
        },
        {
          "@type": "Person",
          "givenName": "Yacine",
          "familyName": "Jernite"
        },
        {
          "@type": "Person",
          "givenName": "Julien",
          "familyName": "Plu"
        },
        {
          "@type": "Person",
          "givenName": "Canwen",
          "familyName": "Xu"
        },
        {
          "@type": "Person",
          "givenName": "Teven",
          "familyName": "Le Scao"
        },
        {
          "@type": "Person",
          "givenName": "Sylvain",
          "familyName": "Gugger"
        },
        {
          "@type": "Person",
          "givenName": "Mariama",
          "familyName": "Drame"
        },
        {
          "@type": "Person",
          "givenName": "Quentin",
          "familyName": "Lhoest"
        },
        {
          "@type": "Person",
          "givenName": [
            "Alexander",
            "M."
          ],
          "familyName": "Rush"
        }
      ],
      "name": "{HuggingFace's Transformers}: State-of-the-art Natural Language Processing",
      "url": "https://arxiv.org/abs/1910.03771"
    }
  ],
  "releaseNotes": "https://github.com/ropensci/pangoling/blob/master/NEWS.md",
  "readme": "https://github.com/ropensci/pangoling/blob/main/README.md",
  "contIntegration": [
    "https://app.codecov.io/gh/ropensci/pangoling?branch=main",
    "https://github.com/ropensci/pangoling/actions/workflows/R-CMD-check.yaml"
  ],
  "developmentStatus": [
    "https://lifecycle.r-lib.org/articles/stages.html#stable",
    "https://www.repostatus.org/#active"
  ],
  "review": {
    "@type": "Review",
    "url": "https://github.com/ropensci/software-review/issues/575",
    "provider": "https://ropensci.org"
  },
  "keywords": [
    "nlp",
    "psycholinguistics",
    "r",
    "r-package",
    "rstats",
    "transformers"
  ]
}

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 20
  • Watch event: 6
  • Issue comment event: 4
  • Push event: 37
  • Pull request event: 16
  • Fork event: 1
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 20
  • Watch event: 6
  • Issue comment event: 4
  • Push event: 37
  • Pull request event: 16
  • Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 13
  • Total pull requests: 22
  • Average time to close issues: 4 months
  • Average time to close pull requests: about 4 hours
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 0.46
  • Average comments per pull request: 0.0
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 12
  • Pull requests: 22
  • Average time to close issues: 16 days
  • Average time to close pull requests: about 4 hours
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 0.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • bnicenboim (12)
  • tmalsburg (1)
Pull Request Authors
  • bnicenboim (21)
Top Labels
Issue Labels
CRAN (4) enhancement (4) documentation (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 539 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
cran.r-project.org: pangoling

Access to Large Language Model Predictions

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 539 Last month
Rankings
Dependent packages count: 26.8%
Dependent repos count: 33.0%
Average: 48.9%
Downloads: 86.8%
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 2.10 depends
  • cachem * imports
  • data.table * imports
  • memoise * imports
  • reticulate * imports
  • tidyselect * imports
  • tidytable >= 0.7.2 imports
  • utils * imports
  • covr * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • spelling * suggests
  • testthat >= 3.0.0 suggests
  • tictoc * suggests