nutcracker

Large Model Evaluation Experiments

https://github.com/brucewlee/nutcracker

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary

Keywords

large-language-models llm llm-evaluation llmops
Last synced: 6 months ago · JSON representation ·

Repository

Large Model Evaluation Experiments

Basic Info
  • Host: GitHub
  • Owner: brucewlee
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 431 KB
Statistics
  • Stars: 7
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 23
Topics
large-language-models llm llm-evaluation llmops
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Nutcracker - Large Model Evaluation

Nutcracker is currently highly experimental

Nutcracker is like LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs

https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce


Installation

Route 1. PyPI

Install Nutcracker bash pip install nutcracker-py

Download Nutcracker DB bash git clone https://github.com/evaluation-tools/nutcracker-db

Route 2. GitHub

Install Nutcracker
bash git clone https://github.com/evaluation-tools/nutcracker pip install -e nutcracker

Download Nutcracker DB bash git clone https://github.com/evaluation-tools/nutcracker-db

Check all tasks implemented in Nutcracker DB's readme page.


QuickStart

Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)

STEP 1: Define Model
  • Define a simple model class with a "respond(self, user_prompt)" function.
  • We will use OpenAI here. But really, any api can be evaluated if the "respond(self, user_prompt)" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.) ```python from openai import OpenAI import os, logging, sys logging.basicConfig(level=logging.INFO) logging.getLogger('httpx').setLevel(logging.CRITICAL) os.environ["OPENAIAPIKEY"] = "" client = OpenAI()

class ChatGPT: def init(self): self.model = "gpt-3.5-turbo"

def respond(self, user_prompt):
    response_data = None
    while response_data is None:
        try:
            completion = client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "user", "content": f"{user_prompt}"}
                ],
                timeout=15,
            )
            response_data = completion.choices[0].message.content
            break
        except KeyboardInterrupt:
            sys.exit()
        except:
            print("Request timed out, retrying...")
    return response_data

```

STEP 2: Run Evaluation

```python from nutcracker.data import Task, Pile from nutcracker.runs import Schema from nutcracker.evaluator import MCQEvaluator, generate_report

this db_directory value should work off-the-shelf if you cloned both repositories in the same directory

truthfulqa = Task.loadfromdb(taskname='truthfulqa-mc1', dbdirectory='nutcracker-db/db')

sample 20 for demo

truthfulqa.sample(20, in_place=True)

running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses

experiment = Schema(model=ChatGPT(), data=truthfulqa) experiment.run()

running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations

evaluation = MCQEvaluator(data=truthfulqa) evaluation.run()

for i in range (0, len(truthfulqa)): print(truthfulqa[i].userprompt) print(truthfulqa[i].modelresponse) print(truthfulqa[i].correctoptions) print(truthfulqa[i].responsecorrect) print()

print(generatereport(truthfulqa, savepath='accuracy_report.txt')) ```


Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)

STEP 1: Understand the basis of Nutcracker
  • Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?).
  • Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: HERE)
  • Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.

STEP 2: Define Model
  • Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: HERE)

```python class LLaMA: def init(self): self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"

def query(self, payload):
    headers = {
        "Accept" : "application/json",
        "Authorization": "Bearer hf_XXXXX",
        "Content-Type": "application/json" 
    }
    response = requests.post(self.API_URL, headers=headers, json=payload)
    return response.json()

def respond(self, user_prompt):
    output = self.query({
        "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
    })
    return output[0]['generated_text']

```

STEP 3: Load Data

```python from nutcracker.data import Pile import logging logging.basicConfig(level=logging.INFO)

mmlu = Pile.loadfromdb('mmlu','nutcracker-db/db') ```

STEP 4: Run Experiment (Retrieve Model Responses)
  • Running evaluation updates each instance's model_response attribute within the data object, which is mmlu Pile in this case.
  • You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.

```python from nutcracker.runs import Schema mmlu.sample(n=1000, in_place = True)

experiment = Schema(model=LLaMA(), data=mmlu) experiment.run() mmlu.savetofile('mmlu-llama.pkl') ``` - You can load and check how the model responded.

python loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl') for i in range (0,len(loaded_mmlu)): print("\n\n\n---\n") print("Prompt:") print(loaded_mmlu[i].user_prompt) print("\nResponses:") print(loaded_mmlu[i].model_response)

STEP 5: Run Evaluation
  • LLMs often don’t respond in immediately recognizable letters like A, B, C, or D.
  • Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
  • We recommend using intent-matching for almost all use cases. We will publish a detailed research later.

python from nutcracker.evaluator import MCQEvaluator, generate_report evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True) evaluation.run() print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))

https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9


Tutorials

Owner

  • Name: Bruce W. Lee (이웅성)
  • Login: brucewlee
  • Kind: user
  • Location: Philadelphia, PA
  • Company: University of Pennsylvania

Research Scientist - NLP

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lee"
  given-names: "Bruce W."
  orcid: "https://orcid.org/0000-0003-0921-7794"
title: "Nutcracker"
version: 10.5281/zenodo.10688939
doi: 10.5281/zenodo.10680579
date-released: 2024.02.21
url: "https://github.com/brucewlee/nutcracker"

GitHub Events

Total
Last Year

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 76
  • Total Committers: 1
  • Avg Commits per committer: 76.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Bruce Lee b****s@s****u 76
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

setup.py pypi
  • openai >=1.10.0
  • pyyaml >=6.0.1
.github/workflows/pipy_release.yaml actions
  • actions/checkout v2 composite
  • actions/create-release v1 composite
  • actions/setup-python v2 composite
  • actions/upload-release-asset v1 composite
  • pypa/gh-action-pypi-publish master composite