surprisal

A unified interface for computing surprisal (log probabilities) from language models! Supports neural, symbolic, and black-box API models.

https://github.com/aalok-sathe/surprisal

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Keywords

gpt language-modeling large-language-models log-likelihood next-word-prediction surprisal

Last synced: 10 months ago · JSON representation ·

Repository

A unified interface for computing surprisal (log probabilities) from language models! Supports neural, symbolic, and black-box API models.

Basic Info

Host: GitHub
Owner: aalok-sathe
License: mit
Language: Python
Default Branch: main
Homepage: https://aalok-sathe.github.io/surprisal/
Size: 648 KB

Statistics

Stars: 40
Watchers: 3
Forks: 10
Open Issues: 2
Releases: 3

Topics

gpt language-modeling large-language-models log-likelihood next-word-prediction surprisal

Created almost 4 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

surprisal

Compute surprisal from language models!

surprisal supports most Causal Language Models (GPT2- and Llama-like models) from Huggingface or local checkpoint, petals distributed models, as well as KenLM-based N-gram language models using the KenLM Python interface.

Masked Language Models (BERT-like models) are in the pipeline and will be supported at a future time (see #9).

Docs

Visit https://aalok-sathe.github.io/surprisal/surprisal.html.

Usage

The snippet below computes per-token surprisals for a list of sentences ```python from surprisal import AutoHuggingFaceModel, KenLMModel

sentences = [ "The cat is on the mat", "The cat is on the hat", "The cat is on the pizza", "The pizza is on the mat", "I told you that the cat is on the mat", "I told you the cat is on the mat", ]

m = AutoHuggingFaceModel.from_pretrained('gpt2') m.to('cuda') # optionally move your model to GPU!

k = KenLMModel(model_path='./literature.arpa')

for result in m.surprise(sentences): print(result) for result in k.surprise(sentences): print(result) and produces output of this sort (`gpt2`): The Ġcat Ġis Ġon Ġthe Ġmat
3.276 9.222 2.463 4.145 0.961 7.237
The Ġcat Ġis Ġon Ġthe Ġhat
3.276 9.222 2.463 4.145 0.961 9.955
The Ġcat Ġis Ġon Ġthe Ġpizza
3.276 9.222 2.463 4.145 0.961 8.212
The Ġpizza Ġis Ġon Ġthe Ġmat
3.276 10.860 3.212 4.910 0.985 8.379
I Ġtold Ġyou Ġthat Ġthe Ġcat Ġis Ġon Ġthe Ġmat 3.998 6.856 0.619 2.443 2.711 7.955 2.596 4.804 1.139 6.946 I Ġtold Ġyou Ġthe Ġcat Ġis Ġon Ġthe Ġmat
3.998 6.856 0.619 4.115 7.612 3.031 4.817 1.233 7.033 ```

extracting surprisal over a substring

A surprisal object can be aggregated over a subset of tokens that best match a span of words or characters. Word boundaries are inherited from the model's standard tokenizer, and may not be consistent across models, so using character spans when slicing is the default and recommended option. Surprisals are in log space, and therefore added over tokens during aggregation. For example: ```python

[s] = m.surprise("The cat is on the mat") s[3:6, "word"] 12.343366384506226 Ġon Ġthe Ġmat s[3:6, "char"] 9.222099304199219 Ġcat s[3:6] 9.222099304199219 Ġcat ```

You can use Surprisal.lineplot() to visualize the surprisals:

```python from matplotlib import pyplot as plt f, a = None, None for result in m.surprise(sentences): f, a = result.lineplot(f, a)

plt.show() ```

surprisal has a minimal CLI: ```python python -m surprisal -m distilgpt2 "I went to the train station today." I Ġwent Ġto Ġthe Ġtrain Ġstation Ġtoday . 4.984 5.729 0.812 1.723 7.317 0.497 4.600 2.528

python -m surprisal -m distilgpt2 "I went to the space station today." I Ġwent Ġto Ġthe Ġspace Ġstation Ġtoday . 4.984 5.729 0.812 1.723 8.425 0.707 5.182 2.574 ```

Installing

Because surprisal is used by people from different communities for different purposes, by default, core dependencies related to language modeling are marked optional. Depending on your use case, install surprisal with the appropriate extras.

Installing from GitHub release or PyPI (latest stable release)

Use a command like pip install surprisal[optional] or pip install git+https://github.com/aalok-sathe/surprisal.git[optoinal], replacing [optional] with whatever optional support you need. For multiple optional extras, use a comma-separated list: bash pip install surprisal[kenlm,transformers] Possible options include: transformers, kenlm, petals.

If you use uv for your existing project, use the -E option to add surprisal together with the desired optional dependencies: bash uv add surprisal[transformers,kenlm]

Installing from source (bleeding edge)

The -e flag allows an editable install, so you can make changes to surprisal. bash git clone https://github.com/aalok-sathe/surprisal.git pip install .[transformers] -e

Acknowledgments

Inspired from the now-inactive lm-scorer; thanks to folks from CPLlab and EvLab for comments and help.

License

Owner

Name: Aalok | आलोक
Login: aalok-sathe
Kind: user
Location: Cambridge, MA
Company: @MIT Brain & Cognitive Sciences

Website: https://aalok-sathe.gitlab.io
Twitter: aloxatel
Repositories: 16
Profile: https://github.com/aalok-sathe

interested in computation, cognition, and language. currently RA@ Evlab @mit BCS.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: surprisal
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Aalok
    family-names: Sathe
    affiliation: MIT
    orcid: 'https://orcid.org/0000-0002-5248-7557'
identifiers:
  - type: url
    value: 'https://github.com/aalok-sathe/surprisal/'
repository-code: 'https://github.com/aalok-sathe/surprisal/'
abstract: >-
  Compute surprisal from language models!


  surprisal supports most Causal Language Models
  (GPT2- and GPTneo-like models) from Huggingface or
  local checkpoint, as well as GPT3 models from
  OpenAI using their API!


  Masked Language Models (BERT-like models) are in
  the pipeline and will be supported at a future
  time.
keywords:
  - 'nlp, language, surprisal, psycholinguistics'
license: MIT

GitHub Events

Total

Issues event: 6
Watch event: 10
Delete event: 1
Issue comment event: 6
Push event: 14
Pull request event: 3
Fork event: 4
Create event: 2

Last Year

Issues event: 6
Watch event: 10
Delete event: 1
Issue comment event: 6
Push event: 14
Pull request event: 3
Fork event: 4
Create event: 2

Committers

Last synced: over 3 years ago

All Time

Total Commits: 66
Total Committers: 3
Avg Commits per committer: 22.0
Development Distribution Score (DDS): 0.121

Top Committers

Name	Email	Commits
aalok-sathe	a**e@m**u	58
Aalok \| आलोक	1**e@u**m	4
Stephan Meylan	m**n@g**m	4

Committer Domains (Top 20 + Academic)

mit.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 15
Total pull requests: 12
Average time to close issues: 10 months
Average time to close pull requests: 4 months
Total issue authors: 5
Total pull request authors: 2
Average comments per issue: 1.2
Average comments per pull request: 0.58
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 8 months
Issue authors: 2
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.5
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

aalok-sathe (9)
jennhu (2)
lpetley (1)
YashasviMantha (1)

Pull Request Authors

aalok-sathe (12)
smeylan (1)

Top Labels

Issue Labels

enhancement (6) documentation (2) good first issue (1)

Pull Request Labels

help wanted (1)

Packages

Total packages: 1
Total downloads:
- pypi 190 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 13
Total maintainers: 1

pypi.org: surprisal

A package to conveniently compute surprisals for text sequences and subsequences

Documentation: https://surprisal.readthedocs.io/
License: MIT
Latest release: 0.1.7
published 11 months ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 190 Last month

Rankings

Dependent packages count: 6.6%

Average: 21.0%

Downloads: 21.1%

Forks count: 23.2%

Stargazers count: 23.3%

Dependent repos count: 30.6%

Maintainers (1)

aalok-sathe

Last synced: 11 months ago

Dependencies

poetry.lock pypi

appnope 0.1.3 develop
argon2-cffi 21.3.0 develop
argon2-cffi-bindings 21.2.0 develop
asttokens 2.0.5 develop
attrs 22.1.0 develop
backcall 0.2.0 develop
beautifulsoup4 4.11.1 develop
black 22.6.0 develop
bleach 5.0.1 develop
cffi 1.15.1 develop
click 8.1.3 develop
debugpy 1.6.2 develop
decorator 5.1.1 develop
defusedxml 0.7.1 develop
entrypoints 0.4 develop
executing 0.9.1 develop
fastjsonschema 2.16.1 develop
ipykernel 6.15.1 develop
ipython 8.4.0 develop
ipython-genutils 0.2.0 develop
ipywidgets 7.7.1 develop
jedi 0.18.1 develop
jinja2 3.1.2 develop
jsonschema 4.8.0 develop
jupyter-client 7.3.4 develop
jupyter-core 4.11.1 develop
jupyterlab-pygments 0.2.2 develop
jupyterlab-widgets 1.1.1 develop
markupsafe 2.1.1 develop
matplotlib-inline 0.1.3 develop
mistune 0.8.4 develop
mypy-extensions 0.4.3 develop
nbclient 0.6.6 develop
nbconvert 6.5.0 develop
nbformat 5.4.0 develop
nest-asyncio 1.5.5 develop
notebook 6.4.12 develop
pandocfilters 1.5.0 develop
parso 0.8.3 develop
pathspec 0.9.0 develop
pexpect 4.8.0 develop
pickleshare 0.7.5 develop
platformdirs 2.5.2 develop
prometheus-client 0.14.1 develop
prompt-toolkit 3.0.30 develop
psutil 5.9.1 develop
ptyprocess 0.7.0 develop
pure-eval 0.2.2 develop
py 1.11.0 develop
pycparser 2.21 develop
pygments 2.12.0 develop
pyrsistent 0.18.1 develop
pywin32 304 develop
pywinpty 2.0.6 develop
pyzmq 23.2.0 develop
scipy 1.6.1 develop
seaborn 0.11.2 develop
send2trash 1.8.0 develop
soupsieve 2.3.2.post1 develop
stack-data 0.3.0 develop
terminado 0.15.0 develop
tinycss2 1.1.1 develop
tornado 6.2 develop
traitlets 5.3.0 develop
wcwidth 0.2.5 develop
webencodings 0.5.1 develop
widgetsnbextension 3.6.1 develop
certifi 2022.6.15
charset-normalizer 2.1.0
colorama 0.4.5
cycler 0.11.0
filelock 3.7.1
fonttools 4.34.4
huggingface-hub 0.8.1
idna 3.3
kiwisolver 1.4.4
matplotlib 3.5.2
numpy 1.23.1
packaging 21.3
pandas 1.4.3
pillow 9.2.0
plotext 5.0.2
pyparsing 3.0.9
python-dateutil 2.8.2
pytz 2022.1
pyyaml 6.0
regex 2022.7.25
requests 2.28.1
setuptools-scm 7.0.5
six 1.16.0
tokenizers 0.12.1
tomli 2.0.1
torch 1.12.0
tqdm 4.64.0
transformers 4.21.0
typing-extensions 4.3.0
urllib3 1.26.11

pyproject.toml pypi

black ^22.6.0 develop
ipykernel ^6.15.1 develop
ipython ^8.4.0 develop
ipywidgets ^7.7.1 develop
seaborn ^0.11.2 develop
matplotlib ^3.5.2
numpy ^1.23.1
pandas ^1.4.3
plotext ^5.0.2
python ^3.9
torch ^1.12.0
transformers ^4.20.1

.github/workflows/pylint.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/python-publish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

.github/workflows/docs.yml actions

actions/checkout v4 composite
actions/deploy-pages v2 composite
actions/setup-python v4 composite
actions/upload-pages-artifact v2 composite
dawidd6/action-download-artifact v2 composite

surprisal

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

surprisal

Docs

Usage

extracting surprisal over a substring

Installing

Installing from GitHub release or PyPI (latest stable release)

Installing from source (bleeding edge)

Acknowledgments

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: surprisal

Rankings

Maintainers (1)

Dependencies