Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.4%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: abetlen
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 58.6 KB
Statistics
  • Stars: 35
  • Watchers: 4
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

Program Constrained Language Model Sampling

The aim of this project is to investigate the use of program constraints on pre-trained language models to improve their ability to generate structured text as output.

Language models can accurately predict the next token in a sequence. Unfortunately generated text is not necessarily structured in a way that can be utilised by other programs. For example, a language model that extracts json objects from unstructured text may fail to produce a valid json object by adding additional trailing commas or introducing comments.

We propose a simple method to constrain the tokens that can be sampled by a language model by using an external program to decide which tokens in the vocabulary are valid at a given point in the sequence. We then force the language model to only sample over these valid tokens.

Overview

Constrained Language Model Sampling

  • Language Model: Any pre-trained language model that predicts the next token over some discrete probability distribution.
  • Prefix Checker: A program that takes a sequence of tokens as input and can decide if the sequence is a valid output or output prefix.

Related Work

Supported Language Models

  • [x] llama via llama.cpp
  • [ ] rwkv via rwkv.cpp

Supported Prefix Checkers

  • [x] JSON: Extracts JSON objects from unstructured text.
  • [ ] JSON Schema: Extracts JSON objects from unstructured text that match a given JSON schema.

Evaluations and Results

TODO

Owner

  • Name: Andrei
  • Login: abetlen
  • Kind: user
  • Location: San Francisco, California

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Program Constrained Language Model Sampling
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Andrei
    family-names: Betlen
    email: abetlen@gmail.com
repository-code: >-
  https://github.com/abetlen/program-constrained-language-model-sampling
license: MIT

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 7
  • Total Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Andrei Betlen a****n@g****m 7

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

poetry.lock pypi
  • Pygments 2.14.0 develop
  • appnope 0.1.3 develop
  • asttokens 2.2.1 develop
  • backcall 0.2.0 develop
  • cffi 1.15.1 develop
  • colorama 0.4.6 develop
  • comm 0.1.2 develop
  • debugpy 1.6.6 develop
  • decorator 5.1.1 develop
  • executing 1.2.0 develop
  • importlib-metadata 6.0.0 develop
  • ipykernel 6.21.3 develop
  • ipython 8.11.0 develop
  • jedi 0.18.2 develop
  • jupyter-client 8.0.3 develop
  • jupyter-core 5.2.0 develop
  • matplotlib-inline 0.1.6 develop
  • nest-asyncio 1.5.6 develop
  • packaging 23.0 develop
  • parso 0.8.3 develop
  • pexpect 4.8.0 develop
  • pickleshare 0.7.5 develop
  • platformdirs 3.1.1 develop
  • prompt-toolkit 3.0.38 develop
  • psutil 5.9.4 develop
  • ptyprocess 0.7.0 develop
  • pure-eval 0.2.2 develop
  • pycparser 2.21 develop
  • python-dateutil 2.8.2 develop
  • pywin32 305 develop
  • pyzmq 25.0.0 develop
  • six 1.16.0 develop
  • stack-data 0.6.2 develop
  • tornado 6.2 develop
  • traitlets 5.9.0 develop
  • wcwidth 0.2.6 develop
  • attrs 22.2.0
  • importlib-resources 5.12.0
  • jsonschema 4.17.3
  • llama-cpp-python 0.1.22
  • numpy 1.24.2
  • pkgutil_resolve_name 1.3.10
  • pyrsistent 0.19.3
  • tree-sitter 0.20.1
  • typing-extensions 4.5.0
  • zipp 3.15.0
pyproject.toml pypi
  • ipykernel ^6.21.3 develop
  • jsonschema ^4.17.3
  • llama-cpp-python ^0.1.22
  • numpy ^1.24.2
  • python ^3.8
  • tree-sitter ^0.20.1