judy

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

https://github.com/tnt-hoopsnake/judy

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Basic Info
  • Host: GitHub
  • Owner: TNT-Hoopsnake
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 697 KB
Statistics
  • Stars: 6
  • Watchers: 0
  • Forks: 0
  • Open Issues: 2
  • Releases: 4
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme Contributing License Citation

README.md

Judy

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Judy allows users to evaluate LLMs using a competent Judge LLM (such as GPT-4). Users can choose from a set of predefined scenarios sourced from recent research, or design their own. A scenario is a specific test designed to evaluate a particular aspect of an LLM. A scenario consists of:

  • Dataset: A source dataset to generate prompts to evaluate models against.
  • Task: A task to evaluate models on. Tasks for judge evaluations have been carefully designed by researchers to assess certain aspects of LLMs.
  • Metric: The metric(s) to use when evaluating the responses from a task. For example - accuracy, level of detail etc.

Framework Overview

Judy has been inspired by techniques used in research including HELM [1] and LLM-as-a-judge [2].


  • [1] Holistic Evaluation of Language Models - https://arxiv.org/abs/2211.09110
  • [2] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - https://arxiv.org/abs/2306.05685

Installation

Use the package manager pip to install Judy. Note: Judy requires python >= 3.10.

bash pip install judyeval

Alternate Installation

You can also install Judy directly from this git repo:

bash pip install git+https://github.com/TNT-Hoopsnake/judy

Getting Started

Setup configs

Judy uses 3 configuration files during evaluation. Only the run config is strictly necessary to begin with:

  • Dataset Config: Defines all of the datasets available to use in the evaluation run, how to download them and which class to use to format them. You don't have to worry about specifying this config unless you plan on adding new datasets. Judy will automatically use the example dataset config here unless you specify an alternate one using --dataset-config.
  • Evaluation Config: Defines all of the tasks and the metrics used to evaluate them. It also restricts which datasets and metrics can be used for each task. You don't have to worry about specifying this config unless you plan on adding new tasks or metrics. Judy will automatically use the example eval config here unless you specify an alternate one using --eval-config.
  • Run Config: Defines all of the settings to use for your evaluation run. The evaluation results for your run will store a copy (with sensitive details redacted) of these settings as metadata. An example run config is provided here

Setup model(s) to evaluate

Ensure you have API access to the models you wish to evaluate. We currently support two types of API formats:

  • OPENAI: The OpenAI API ChatCompletion endpoint (ref)
  • HUGGINGFACE: The HuggingFace Hosted Inference API (ref)

If you are hosting models locally you can use a package like LocalAI to get an OpenAI compatible REST API which can be used by Judy.

Judy Commands

A CLI interface is provided for viewing and editing Judy config files.

bash judy config

Run an evaluation as follows:

bash judy run --run-config run_config.yml --name disinfo-test --output ./results

After running an evaluation, you can serve a web app for viewing the results:

bash judy serve -r ./results

Web App Screenshots

The web app allows you to view your evaluation results.

| | | |---|---| Overview | App Runs Raw Results

Roadmap

Features

  • [x] Core framework
  • [x] Web app - to view evaluation results
  • [ ] Add perturbations - the ability to modify input datasets - with typos, synonymns etc.
  • [ ] Add adaptations - the ability to use different prompting techniques - such as Chain of Thought etc.

Scenarios

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate. Check out the contribution guide for more details.

Citation - BibTeX

@software{Hutchinson_Judy_-_LLM_2024, author = {Hutchinson, Linden and Raghavan, Rahul}, month = feb, title = {{Judy - LLM Evaluator}}, url = {https://github.com/TNT-Hoopsnake/judy}, version = {2.0.0}, year = {2024} }

Owner

  • Name: TNT-Hoopsnake
  • Login: TNT-Hoopsnake
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Hutchinson"
  given-names: "Linden"
- family-names: "Raghavan"
  given-names: "Rahul"
title: "Judy - LLM Evaluator"
version: 2.0.0
date-released: 2024-02-01
url: "https://github.com/TNT-Hoopsnake/judy"

GitHub Events

Total
Last Year