Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Keywords
Repository
Large Model Evaluation Experiments
Basic Info
Statistics
- Stars: 7
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 23
Topics
Metadata Files
README.md
Nutcracker - Large Model Evaluation
Nutcracker is currently highly experimental
Nutcracker is like LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs
https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce
Installation
Route 1. PyPI
Install Nutcracker
bash
pip install nutcracker-py
Download Nutcracker DB
bash
git clone https://github.com/evaluation-tools/nutcracker-db
Route 2. GitHub
Install Nutcracker
bash
git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker
Download Nutcracker DB
bash
git clone https://github.com/evaluation-tools/nutcracker-db
Check all tasks implemented in Nutcracker DB's readme page.
QuickStart
Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)
STEP 1: Define Model
- Define a simple model class with a "respond(self, user_prompt)" function.
- We will use OpenAI here. But really, any api can be evaluated if the "respond(self, user_prompt)" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.) ```python from openai import OpenAI import os, logging, sys logging.basicConfig(level=logging.INFO) logging.getLogger('httpx').setLevel(logging.CRITICAL) os.environ["OPENAIAPIKEY"] = "" client = OpenAI()
class ChatGPT: def init(self): self.model = "gpt-3.5-turbo"
def respond(self, user_prompt):
response_data = None
while response_data is None:
try:
completion = client.chat.completions.create(
model=self.model,
messages=[
{"role": "user", "content": f"{user_prompt}"}
],
timeout=15,
)
response_data = completion.choices[0].message.content
break
except KeyboardInterrupt:
sys.exit()
except:
print("Request timed out, retrying...")
return response_data
```
STEP 2: Run Evaluation
```python from nutcracker.data import Task, Pile from nutcracker.runs import Schema from nutcracker.evaluator import MCQEvaluator, generate_report
this db_directory value should work off-the-shelf if you cloned both repositories in the same directory
truthfulqa = Task.loadfromdb(taskname='truthfulqa-mc1', dbdirectory='nutcracker-db/db')
sample 20 for demo
truthfulqa.sample(20, in_place=True)
running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses
experiment = Schema(model=ChatGPT(), data=truthfulqa) experiment.run()
running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations
evaluation = MCQEvaluator(data=truthfulqa) evaluation.run()
for i in range (0, len(truthfulqa)): print(truthfulqa[i].userprompt) print(truthfulqa[i].modelresponse) print(truthfulqa[i].correctoptions) print(truthfulqa[i].responsecorrect) print()
print(generatereport(truthfulqa, savepath='accuracy_report.txt')) ```
Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)
STEP 1: Understand the basis of Nutcracker
- Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?).
- Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: HERE)
- Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.
STEP 2: Define Model
- Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: HERE)
```python class LLaMA: def init(self): self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"
def query(self, payload):
headers = {
"Accept" : "application/json",
"Authorization": "Bearer hf_XXXXX",
"Content-Type": "application/json"
}
response = requests.post(self.API_URL, headers=headers, json=payload)
return response.json()
def respond(self, user_prompt):
output = self.query({
"inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
})
return output[0]['generated_text']
```
STEP 3: Load Data
```python from nutcracker.data import Pile import logging logging.basicConfig(level=logging.INFO)
mmlu = Pile.loadfromdb('mmlu','nutcracker-db/db') ```
STEP 4: Run Experiment (Retrieve Model Responses)
- Running evaluation updates each instance's model_response attribute within the data object, which is mmlu Pile in this case.
- You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.
```python from nutcracker.runs import Schema mmlu.sample(n=1000, in_place = True)
experiment = Schema(model=LLaMA(), data=mmlu) experiment.run() mmlu.savetofile('mmlu-llama.pkl') ``` - You can load and check how the model responded.
python
loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
print("\n\n\n---\n")
print("Prompt:")
print(loaded_mmlu[i].user_prompt)
print("\nResponses:")
print(loaded_mmlu[i].model_response)
STEP 5: Run Evaluation
- LLMs often don’t respond in immediately recognizable letters like A, B, C, or D.
- Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
- We recommend using intent-matching for almost all use cases. We will publish a detailed research later.
python
from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))
https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9
Tutorials
- Evaluating on HuggingFace Inference Endpoints -> HERE / Medium
- Understanding Instance-Task-Pile -> HERE / Medium
Owner
- Name: Bruce W. Lee (이웅성)
- Login: brucewlee
- Kind: user
- Location: Philadelphia, PA
- Company: University of Pennsylvania
- Website: brucewlee.github.io
- Repositories: 3
- Profile: https://github.com/brucewlee
Research Scientist - NLP
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Lee" given-names: "Bruce W." orcid: "https://orcid.org/0000-0003-0921-7794" title: "Nutcracker" version: 10.5281/zenodo.10688939 doi: 10.5281/zenodo.10680579 date-released: 2024.02.21 url: "https://github.com/brucewlee/nutcracker"
GitHub Events
Total
Last Year
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Bruce Lee | b****s@s****u | 76 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- openai >=1.10.0
- pyyaml >=6.0.1
- actions/checkout v2 composite
- actions/create-release v1 composite
- actions/setup-python v2 composite
- actions/upload-release-asset v1 composite
- pypa/gh-action-pypi-publish master composite