https://github.com/amazon-science/prefeval

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Size: 25.9 MB

Statistics

Stars: 15
Watchers: 2
Forks: 2
Open Issues: 6
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

PrefEval Benchmark: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

| Website | Paper | Data |

mainfigure

🏆Performance Leaderboard on Subset Tasks🏆

Ranked by performance in the Reminder (10 Turns) column. This table presents the performance results for the topic: Travel-Restaurants. (With explicit preference and generation task)

| Model | Zero-shot (10 Turns) | Reminder (10 Turns) | Zero-shot (300 Turns) | Reminder (300 Turns) | |--------------------|----------------------|-------------------------|-----------------------|----------------------| | o1-preview | 0.50 | 0.98 | 0.14 | 0.98 | | GPT-4o | 0.07 | 0.98 | 0.05 | 0.23 | | Claude-3-Sonnet | 0.05 | 0.96 | 0.04 | 0.36 | | Gemini-1.5-Pro | 0.07 | 0.91 | 0.09 | 0.05 | | Mistral-8x7B | 0.08 | 0.84 | - | - | | Mistral-7B | 0.03 | 0.75 | - | - | | Claude-3-Haiku | 0.05 | 0.68 | 0.02 | 0.02 | | Llama3-8B | 0.00 | 0.57 | - | - | | Claude-3.5-Sonnet| 0.07 | 0.45 | 0.02 | 0.02 | | Llama3-70B | 0.11 | 0.37 | - | - |

Dataset Location

The preference evaluation dataset is located in the benchmark_dataset directory.

Data Format

The dataset is provided in json format and contains the following attributes: 1. Explicit Preference. ``` { "preference": [string] The user's stated preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. }

2. Implicit Preference - Choice-based Conversation { "preference": [string] The user's explicit preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. "implicitquery": [string] A secondary query that offers further insight into the user’s preference, where the assistant provides multiple options. "options": [list] A set of options that the assistant presents in response to the user's implicit query, some of which align with and others that violate the user’s implied preference. "conversation": { "query": [string] ImplicitQuery, "assistantoptions": [string] The assistant's presenting multiple options, some aligned and some misaligned with the user's preference, "userselection": [string] The user's choice or rejection of certain options. "assistantacknowledgment": [string] The assistant's recognition of the user’s choice. }, "alignedop": [string] The option that aligns with the user’s preference. } ``` 3. Implicit Preference - Persona-driven Conversation

{ "preference": [string] The user's explicit preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. "persona": [string] The assigned persona guiding the conversation, e.g., "a retired postal worker enjoying his golden years.", "conversation": { "turn1": { "user": [string], "assistant": [string] }, "turn2": { "user": [string], "assistant": [string] }, ..., "turnN": { "user": [string], "assistant": [string] } }, }

Benchmarking on PrefEval

Environment Setup

Create a conda environment:

conda create -n prefeval python=3.10 -y conda activate prefeval

Install the required dependencies:

pip install -r requirements.txt

Set up AWS credentials for calling Bedrock API. - Follow the instruction here to install aws cli. - Run the following command and enter your aws credentials: AWS Access Key ID and AWS Secret Access Key aws configure

Example Usages:

The following scripts demonstrate how to benchmark various scenarios. You can flexibly modify the arguments within these scripts to assess different topics, preference styles, and inter-turn conversation numbers to create varying task difficulties.

Example 1: Benchmark Generation Tasks

cd example_scripts

Benchmark Claude 3 Haiku with zero-shot on explicit preferences, using 3 inter-turns for the travel restaurant topic: bash run_and_eval_explicit.sh
Benchmark Claude 3 Haiku with zero-shot on implicit preferences, using persona-based preferences and 2 inter-turns: bash run_and_eval_implicit.sh

Example 2: Benchmark Classification Tasks

Benchmark classification tasks on all topics with explicit/implicit preferences, using Claude 3 Haiku with zero-shot and 0 inter-turns: bash run_mcq_task.sh

Example 3: Test 5 baselines methods

Test 5 baseline methods on explicit preferences: zero-shot, reminder, chain-of-thought, RAG, self-critic.

bash run_and_eval_explicit_baselines.sh

Note: All benchmarking results will be saved in the benchmark_results/ directory.

SFT Code

Code and instructions for SFT (Supervised Fine-Tuning) are located in the SFT/ directory.

Benchmark preference and query pair generation:

We provides code for generating preference-query pairs. While our final benchmark dataset includes extensive human filtering and iterative labeling, we provide the initial sampling code for reproducibility.

cd benchmark_dataset python claude_generate_preferences_questions.py

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Issues event: 10
Watch event: 22
Delete event: 4
Issue comment event: 15
Push event: 6
Public event: 1
Pull request event: 15
Fork event: 4
Create event: 8

Last Year

Issues event: 10
Watch event: 22
Delete event: 4
Issue comment event: 15
Push event: 6
Public event: 1
Pull request event: 15
Fork event: 4
Create event: 8

Dependencies

requirements.txt pypi

Jinja2 ==3.1.5
accelerate ==0.34.2
aiohappyeyeballs ==2.4.0
aiohttp ==3.10.11
aiosignal ==1.3.1
annotated-types ==0.7.0
anyio ==4.6.0
attrs ==24.2.0
beautifulsoup4 ==4.12.3
bitsandbytes ==0.44.1
boto3 ==1.35.23
botocore ==1.35.23
bs4 ==0.0.2
cachetools ==5.5.0
click ==8.1.7
cloudpickle ==3.1.0
datasets ==3.0.0
dill ==0.3.8
diskcache ==5.6.3
einops ==0.8.0
exceptiongroup ==1.2.2
fastapi ==0.115.2
filelock ==3.16.1
fonttools ==4.54.1
fsspec ==2024.6.1
gguf ==0.10.0
google-ai-generativelanguage ==0.6.10
google-api-core ==2.20.0
google-api-python-client ==2.147.0
google-auth ==2.35.0
google-auth-httplib2 ==0.2.0
google-generativeai ==0.8.2
googleapis-common-protos ==1.65.0
grpcio ==1.66.1
grpcio-status ==1.66.1
h11 ==0.14.0
httpcore ==1.0.5
httplib2 ==0.22.0
httpx ==0.27.2
huggingface-hub ==0.25.0
idna ==3.7
importlib_metadata ==8.5.0
joblib ==1.4.2
jsonschema ==4.23.0
kaleido ==0.2.1
kiwisolver ==1.4.7
lark ==1.2.2
matplotlib ==3.9.2
mpmath ==1.3.0
msgpack ==1.1.0
msgspec ==0.18.6
nest-asyncio ==1.6.0
networkx ==3.3
numba ==0.60.0
numpy ==1.26.4
openai ==1.49.0
opencv-python-headless ==4.10.0.84
outlines ==0.0.46
pandas ==2.2.2
peft ==0.13.2
pillow ==10.4.0
plotly ==5.24.1
prometheus-fastapi-instrumentator ==7.0.0
prometheus_client ==0.21.0
proto-plus ==1.24.0
protobuf ==5.28.2
psutil ==6.0.0
pycountry ==24.6.1
pydantic ==2.9.2
python-dateutil ==2.9.0.post0
python-dotenv ==1.0.1
pytz ==2024.2
ray ==2.37.0
regex ==2024.9.11
requests ==2.32.3
safetensors ==0.4.5
scikit-learn ==1.5.2
scipy ==1.14.1
seaborn ==0.13.2
sentencepiece ==0.2.0
six ==1.16.0
sniffio ==1.3.1
soupsieve ==2.6
starlette ==0.40.0
tenacity ==9.0.0
threadpoolctl ==3.5.0
tiktoken ==0.7.0
tokenizers ==0.20.1
torch ==2.4.0
torchvision ==0.19.0
tqdm ==4.66.5
transformers ==4.45.2
triton ==3.0.0
typing_extensions ==4.12.2
tzdata ==2024.1
uritemplate ==4.1.1
urllib3 ==2.2.2
uvicorn ==0.32.0
uvloop ==0.21.0
vllm ==0.6.3
watchfiles ==0.24.0
websockets ==13.1
xformers ==0.0.27.post2
xxhash ==3.5.0
yarl ==1.11.1
zipp ==3.20.2
zstandard ==0.22.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science