nutcracker

Large Model Evaluation Experiments

https://github.com/brucewlee/nutcracker

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Keywords

large-language-models llm llm-evaluation llmops

Last synced: 6 months ago · JSON representation ·

Repository

Large Model Evaluation Experiments

Basic Info

Host: GitHub
Owner: brucewlee
License: other
Language: Python
Default Branch: main
Homepage:
Size: 431 KB

Statistics

Stars: 7
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 23

Topics

large-language-models llm llm-evaluation llmops

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Nutcracker - Large Model Evaluation

Nutcracker is currently highly experimental

Nutcracker is like LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs

https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce

Installation

Route 1. PyPI

Install Nutcracker bash pip install nutcracker-py

Download Nutcracker DB bash git clone https://github.com/evaluation-tools/nutcracker-db

Route 2. GitHub

Install Nutcracker
bash git clone https://github.com/evaluation-tools/nutcracker pip install -e nutcracker

Download Nutcracker DB bash git clone https://github.com/evaluation-tools/nutcracker-db

Check all tasks implemented in Nutcracker DB's readme page.

QuickStart

Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)

STEP 1: Define Model

Define a simple model class with a "respond(self, user_prompt)" function.
We will use OpenAI here. But really, any api can be evaluated if the "respond(self, user_prompt)" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.) ```python from openai import OpenAI import os, logging, sys logging.basicConfig(level=logging.INFO) logging.getLogger('httpx').setLevel(logging.CRITICAL) os.environ["OPENAIAPIKEY"] = "" client = OpenAI()

class ChatGPT: def init(self): self.model = "gpt-3.5-turbo"

def respond(self, user_prompt):
    response_data = None
    while response_data is None:
        try:
            completion = client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "user", "content": f"{user_prompt}"}
                ],
                timeout=15,
            )
            response_data = completion.choices[0].message.content
            break
        except KeyboardInterrupt:
            sys.exit()
        except:
            print("Request timed out, retrying...")
    return response_data

```

STEP 2: Run Evaluation

```python from nutcracker.data import Task, Pile from nutcracker.runs import Schema from nutcracker.evaluator import MCQEvaluator, generate_report

this db_directory value should work off-the-shelf if you cloned both repositories in the same directory

truthfulqa = Task.loadfromdb(taskname='truthfulqa-mc1', dbdirectory='nutcracker-db/db')

sample 20 for demo

truthfulqa.sample(20, in_place=True)

running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses

experiment = Schema(model=ChatGPT(), data=truthfulqa) experiment.run()

running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations

evaluation = MCQEvaluator(data=truthfulqa) evaluation.run()

for i in range (0, len(truthfulqa)): print(truthfulqa[i].userprompt) print(truthfulqa[i].modelresponse) print(truthfulqa[i].correctoptions) print(truthfulqa[i].responsecorrect) print()

print(generatereport(truthfulqa, savepath='accuracy_report.txt')) ```

Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)

STEP 1: Understand the basis of Nutcracker

Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?).
Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: HERE)
Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.

STEP 2: Define Model

Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: HERE)

```python class LLaMA: def init(self): self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"

def query(self, payload):
    headers = {
        "Accept" : "application/json",
        "Authorization": "Bearer hf_XXXXX",
        "Content-Type": "application/json" 
    }
    response = requests.post(self.API_URL, headers=headers, json=payload)
    return response.json()

def respond(self, user_prompt):
    output = self.query({
        "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
    })
    return output[0]['generated_text']

```

STEP 3: Load Data

```python from nutcracker.data import Pile import logging logging.basicConfig(level=logging.INFO)

mmlu = Pile.loadfromdb('mmlu','nutcracker-db/db') ```

STEP 4: Run Experiment (Retrieve Model Responses)

Running evaluation updates each instance's model_response attribute within the data object, which is mmlu Pile in this case.
You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.

```python from nutcracker.runs import Schema mmlu.sample(n=1000, in_place = True)

experiment = Schema(model=LLaMA(), data=mmlu) experiment.run() mmlu.savetofile('mmlu-llama.pkl') ``` - You can load and check how the model responded.

python loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl') for i in range (0,len(loaded_mmlu)): print("\n\n\n---\n") print("Prompt:") print(loaded_mmlu[i].user_prompt) print("\nResponses:") print(loaded_mmlu[i].model_response)

STEP 5: Run Evaluation

LLMs often don’t respond in immediately recognizable letters like A, B, C, or D.
Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
We recommend using intent-matching for almost all use cases. We will publish a detailed research later.

python from nutcracker.evaluator import MCQEvaluator, generate_report evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True) evaluation.run() print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))

https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9

Tutorials

Evaluating on HuggingFace Inference Endpoints -> HERE / Medium
Understanding Instance-Task-Pile -> HERE / Medium

Owner

Name: Bruce W. Lee (이웅성)
Login: brucewlee
Kind: user
Location: Philadelphia, PA
Company: University of Pennsylvania

Website: brucewlee.github.io
Repositories: 3
Profile: https://github.com/brucewlee

Research Scientist - NLP

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lee"
  given-names: "Bruce W."
  orcid: "https://orcid.org/0000-0003-0921-7794"
title: "Nutcracker"
version: 10.5281/zenodo.10688939
doi: 10.5281/zenodo.10680579
date-released: 2024.02.21
url: "https://github.com/brucewlee/nutcracker"

GitHub Events

Total

Last Year

Committers

Last synced: 8 months ago

All Time

Total Commits: 76
Total Committers: 1
Avg Commits per committer: 76.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 2
Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Bruce Lee	b**s@s**u	76

Committer Domains (Top 20 + Academic)

seas.upenn.edu: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

setup.py pypi

openai >=1.10.0
pyyaml >=6.0.1

.github/workflows/pipy_release.yaml actions

actions/checkout v2 composite
actions/create-release v1 composite
actions/setup-python v2 composite
actions/upload-release-asset v1 composite
pypa/gh-action-pypi-publish master composite

nutcracker

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Nutcracker - Large Model Evaluation

Installation

Route 1. PyPI

Route 2. GitHub

QuickStart

Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)

STEP 1: Define Model

STEP 2: Run Evaluation

this db_directory value should work off-the-shelf if you cloned both repositories in the same directory

sample 20 for demo

running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses

running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations

Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)

STEP 1: Understand the basis of Nutcracker

STEP 2: Define Model

STEP 3: Load Data

STEP 4: Run Experiment (Retrieve Model Responses)

STEP 5: Run Evaluation

Tutorials

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies