surprisal
A unified interface for computing surprisal (log probabilities) from language models! Supports neural, symbolic, and black-box API models.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 3 committers (33.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Keywords
Repository
A unified interface for computing surprisal (log probabilities) from language models! Supports neural, symbolic, and black-box API models.
Basic Info
- Host: GitHub
- Owner: aalok-sathe
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://aalok-sathe.github.io/surprisal/
- Size: 648 KB
Statistics
- Stars: 40
- Watchers: 3
- Forks: 10
- Open Issues: 2
- Releases: 3
Topics
Metadata Files
README.md
surprisal
Compute surprisal from language models!
surprisal supports most Causal Language Models (GPT2- and Llama-like models) from Huggingface or local checkpoint, petals distributed models,
as well as KenLM-based N-gram language models using the KenLM Python interface.
Masked Language Models (BERT-like models) are in the pipeline and will be supported at a future time (see #9).
Docs 
Visit https://aalok-sathe.github.io/surprisal/surprisal.html.
Usage
The snippet below computes per-token surprisals for a list of sentences ```python from surprisal import AutoHuggingFaceModel, KenLMModel
sentences = [ "The cat is on the mat", "The cat is on the hat", "The cat is on the pizza", "The pizza is on the mat", "I told you that the cat is on the mat", "I told you the cat is on the mat", ]
m = AutoHuggingFaceModel.from_pretrained('gpt2') m.to('cuda') # optionally move your model to GPU!
k = KenLMModel(model_path='./literature.arpa')
for result in m.surprise(sentences):
print(result)
for result in k.surprise(sentences):
print(result)
and produces output of this sort (`gpt2`):
The Ġcat Ġis Ġon Ġthe Ġmat
3.276 9.222 2.463 4.145 0.961 7.237
The Ġcat Ġis Ġon Ġthe Ġhat
3.276 9.222 2.463 4.145 0.961 9.955
The Ġcat Ġis Ġon Ġthe Ġpizza
3.276 9.222 2.463 4.145 0.961 8.212
The Ġpizza Ġis Ġon Ġthe Ġmat
3.276 10.860 3.212 4.910 0.985 8.379
I Ġtold Ġyou Ġthat Ġthe Ġcat Ġis Ġon Ġthe Ġmat
3.998 6.856 0.619 2.443 2.711 7.955 2.596 4.804 1.139 6.946
I Ġtold Ġyou Ġthe Ġcat Ġis Ġon Ġthe Ġmat
3.998 6.856 0.619 4.115 7.612 3.031 4.817 1.233 7.033
```
extracting surprisal over a substring
A surprisal object can be aggregated over a subset of tokens that best match a span of words or characters. Word boundaries are inherited from the model's standard tokenizer, and may not be consistent across models, so using character spans when slicing is the default and recommended option. Surprisals are in log space, and therefore added over tokens during aggregation. For example: ```python
[s] = m.surprise("The cat is on the mat") s[3:6, "word"] 12.343366384506226 Ġon Ġthe Ġmat s[3:6, "char"] 9.222099304199219 Ġcat s[3:6] 9.222099304199219 Ġcat ```
You can use Surprisal.lineplot() to visualize the surprisals:
```python from matplotlib import pyplot as plt f, a = None, None for result in m.surprise(sentences): f, a = result.lineplot(f, a)
plt.show() ```

surprisal has a minimal CLI:
```python
python -m surprisal -m distilgpt2 "I went to the train station today."
I Ġwent Ġto Ġthe Ġtrain Ġstation Ġtoday .
4.984 5.729 0.812 1.723 7.317 0.497 4.600 2.528
python -m surprisal -m distilgpt2 "I went to the space station today." I Ġwent Ġto Ġthe Ġspace Ġstation Ġtoday . 4.984 5.729 0.812 1.723 8.425 0.707 5.182 2.574 ```
Installing
Because surprisal is used by people from different communities for different
purposes, by default, core dependencies related to language modeling are marked
optional. Depending on your use case, install surprisal with the appropriate
extras.
Installing from GitHub release or PyPI (latest stable release)
Use a command like pip install surprisal[optional] or pip install git+https://github.com/aalok-sathe/surprisal.git[optoinal], replacing [optional] with whatever optional support you need.
For multiple optional extras, use a comma-separated list:
bash
pip install surprisal[kenlm,transformers]
Possible options include: transformers, kenlm, petals.
If you use uv for your existing project, use the -E option to add
surprisal together with the desired optional dependencies:
bash
uv add surprisal[transformers,kenlm]
Installing from source (bleeding edge)
The -e flag allows an editable install, so you can make changes to surprisal.
bash
git clone https://github.com/aalok-sathe/surprisal.git
pip install .[transformers] -e
Acknowledgments
Inspired from the now-inactive lm-scorer; thanks to
folks from CPLlab and EvLab for comments and help.
License
MIT License. (C) 2022-25, contributors.
Owner
- Name: Aalok | आलोक
- Login: aalok-sathe
- Kind: user
- Location: Cambridge, MA
- Company: @MIT Brain & Cognitive Sciences
- Website: https://aalok-sathe.gitlab.io
- Twitter: aloxatel
- Repositories: 16
- Profile: https://github.com/aalok-sathe
interested in computation, cognition, and language. currently RA@ Evlab @mit BCS.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: surprisal
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Aalok
family-names: Sathe
affiliation: MIT
orcid: 'https://orcid.org/0000-0002-5248-7557'
identifiers:
- type: url
value: 'https://github.com/aalok-sathe/surprisal/'
repository-code: 'https://github.com/aalok-sathe/surprisal/'
abstract: >-
Compute surprisal from language models!
surprisal supports most Causal Language Models
(GPT2- and GPTneo-like models) from Huggingface or
local checkpoint, as well as GPT3 models from
OpenAI using their API!
Masked Language Models (BERT-like models) are in
the pipeline and will be supported at a future
time.
keywords:
- 'nlp, language, surprisal, psycholinguistics'
license: MIT
GitHub Events
Total
- Issues event: 6
- Watch event: 10
- Delete event: 1
- Issue comment event: 6
- Push event: 14
- Pull request event: 3
- Fork event: 4
- Create event: 2
Last Year
- Issues event: 6
- Watch event: 10
- Delete event: 1
- Issue comment event: 6
- Push event: 14
- Pull request event: 3
- Fork event: 4
- Create event: 2
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 66
- Total Committers: 3
- Avg Commits per committer: 22.0
- Development Distribution Score (DDS): 0.121
Top Committers
| Name | Commits | |
|---|---|---|
| aalok-sathe | a****e@m****u | 58 |
| Aalok | आलोक | 1****e@u****m | 4 |
| Stephan Meylan | m****n@g****m | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 15
- Total pull requests: 12
- Average time to close issues: 10 months
- Average time to close pull requests: 4 months
- Total issue authors: 5
- Total pull request authors: 2
- Average comments per issue: 1.2
- Average comments per pull request: 0.58
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 8 months
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- aalok-sathe (9)
- jennhu (2)
- lpetley (1)
- YashasviMantha (1)
Pull Request Authors
- aalok-sathe (12)
- smeylan (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 190 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 13
- Total maintainers: 1
pypi.org: surprisal
A package to conveniently compute surprisals for text sequences and subsequences
- Documentation: https://surprisal.readthedocs.io/
- License: MIT
-
Latest release: 0.1.7
published 7 months ago
Rankings
Maintainers (1)
Dependencies
- appnope 0.1.3 develop
- argon2-cffi 21.3.0 develop
- argon2-cffi-bindings 21.2.0 develop
- asttokens 2.0.5 develop
- attrs 22.1.0 develop
- backcall 0.2.0 develop
- beautifulsoup4 4.11.1 develop
- black 22.6.0 develop
- bleach 5.0.1 develop
- cffi 1.15.1 develop
- click 8.1.3 develop
- debugpy 1.6.2 develop
- decorator 5.1.1 develop
- defusedxml 0.7.1 develop
- entrypoints 0.4 develop
- executing 0.9.1 develop
- fastjsonschema 2.16.1 develop
- ipykernel 6.15.1 develop
- ipython 8.4.0 develop
- ipython-genutils 0.2.0 develop
- ipywidgets 7.7.1 develop
- jedi 0.18.1 develop
- jinja2 3.1.2 develop
- jsonschema 4.8.0 develop
- jupyter-client 7.3.4 develop
- jupyter-core 4.11.1 develop
- jupyterlab-pygments 0.2.2 develop
- jupyterlab-widgets 1.1.1 develop
- markupsafe 2.1.1 develop
- matplotlib-inline 0.1.3 develop
- mistune 0.8.4 develop
- mypy-extensions 0.4.3 develop
- nbclient 0.6.6 develop
- nbconvert 6.5.0 develop
- nbformat 5.4.0 develop
- nest-asyncio 1.5.5 develop
- notebook 6.4.12 develop
- pandocfilters 1.5.0 develop
- parso 0.8.3 develop
- pathspec 0.9.0 develop
- pexpect 4.8.0 develop
- pickleshare 0.7.5 develop
- platformdirs 2.5.2 develop
- prometheus-client 0.14.1 develop
- prompt-toolkit 3.0.30 develop
- psutil 5.9.1 develop
- ptyprocess 0.7.0 develop
- pure-eval 0.2.2 develop
- py 1.11.0 develop
- pycparser 2.21 develop
- pygments 2.12.0 develop
- pyrsistent 0.18.1 develop
- pywin32 304 develop
- pywinpty 2.0.6 develop
- pyzmq 23.2.0 develop
- scipy 1.6.1 develop
- seaborn 0.11.2 develop
- send2trash 1.8.0 develop
- soupsieve 2.3.2.post1 develop
- stack-data 0.3.0 develop
- terminado 0.15.0 develop
- tinycss2 1.1.1 develop
- tornado 6.2 develop
- traitlets 5.3.0 develop
- wcwidth 0.2.5 develop
- webencodings 0.5.1 develop
- widgetsnbextension 3.6.1 develop
- certifi 2022.6.15
- charset-normalizer 2.1.0
- colorama 0.4.5
- cycler 0.11.0
- filelock 3.7.1
- fonttools 4.34.4
- huggingface-hub 0.8.1
- idna 3.3
- kiwisolver 1.4.4
- matplotlib 3.5.2
- numpy 1.23.1
- packaging 21.3
- pandas 1.4.3
- pillow 9.2.0
- plotext 5.0.2
- pyparsing 3.0.9
- python-dateutil 2.8.2
- pytz 2022.1
- pyyaml 6.0
- regex 2022.7.25
- requests 2.28.1
- setuptools-scm 7.0.5
- six 1.16.0
- tokenizers 0.12.1
- tomli 2.0.1
- torch 1.12.0
- tqdm 4.64.0
- transformers 4.21.0
- typing-extensions 4.3.0
- urllib3 1.26.11
- black ^22.6.0 develop
- ipykernel ^6.15.1 develop
- ipython ^8.4.0 develop
- ipywidgets ^7.7.1 develop
- seaborn ^0.11.2 develop
- matplotlib ^3.5.2
- numpy ^1.23.1
- pandas ^1.4.3
- plotext ^5.0.2
- python ^3.9
- torch ^1.12.0
- transformers ^4.20.1
- actions/checkout v3 composite
- actions/setup-python v3 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- actions/checkout v4 composite
- actions/deploy-pages v2 composite
- actions/setup-python v4 composite
- actions/upload-pages-artifact v2 composite
- dawidd6/action-download-artifact v2 composite