https://github.com/amazon-science/prefeval

https://github.com/amazon-science/prefeval

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 25.9 MB
Statistics
  • Stars: 15
  • Watchers: 2
  • Forks: 2
  • Open Issues: 6
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

PrefEval Benchmark: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

| Website | Paper | Data |

mainfigure


🏆Performance Leaderboard on Subset Tasks🏆

Ranked by performance in the Reminder (10 Turns) column. This table presents the performance results for the topic: Travel-Restaurants. (With explicit preference and generation task)

| Model | Zero-shot (10 Turns) | Reminder (10 Turns) | Zero-shot (300 Turns) | Reminder (300 Turns) | |--------------------|----------------------|-------------------------|-----------------------|----------------------| | o1-preview | 0.50 | 0.98 | 0.14 | 0.98 | | GPT-4o | 0.07 | 0.98 | 0.05 | 0.23 | | Claude-3-Sonnet | 0.05 | 0.96 | 0.04 | 0.36 | | Gemini-1.5-Pro | 0.07 | 0.91 | 0.09 | 0.05 | | Mistral-8x7B | 0.08 | 0.84 | - | - | | Mistral-7B | 0.03 | 0.75 | - | - | | Claude-3-Haiku | 0.05 | 0.68 | 0.02 | 0.02 | | Llama3-8B | 0.00 | 0.57 | - | - | | Claude-3.5-Sonnet| 0.07 | 0.45 | 0.02 | 0.02 | | Llama3-70B | 0.11 | 0.37 | - | - |


Dataset Location

The preference evaluation dataset is located in the benchmark_dataset directory.

Data Format

The dataset is provided in json format and contains the following attributes: 1. Explicit Preference. ``` { "preference": [string] The user's stated preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. }

2. Implicit Preference - Choice-based Conversation { "preference": [string] The user's explicit preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. "implicitquery": [string] A secondary query that offers further insight into the user’s preference, where the assistant provides multiple options. "options": [list] A set of options that the assistant presents in response to the user's implicit query, some of which align with and others that violate the user’s implied preference. "conversation": { "query": [string] ImplicitQuery, "assistantoptions": [string] The assistant's presenting multiple options, some aligned and some misaligned with the user's preference, "userselection": [string] The user's choice or rejection of certain options. "assistantacknowledgment": [string] The assistant's recognition of the user’s choice. }, "alignedop": [string] The option that aligns with the user’s preference. } ``` 3. Implicit Preference - Persona-driven Conversation

{ "preference": [string] The user's explicit preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. "persona": [string] The assigned persona guiding the conversation, e.g., "a retired postal worker enjoying his golden years.", "conversation": { "turn1": { "user": [string], "assistant": [string] }, "turn2": { "user": [string], "assistant": [string] }, ..., "turnN": { "user": [string], "assistant": [string] } }, }

Benchmarking on PrefEval

Environment Setup

Create a conda environment:

conda create -n prefeval python=3.10 -y conda activate prefeval

Install the required dependencies:

pip install -r requirements.txt

Set up AWS credentials for calling Bedrock API. - Follow the instruction here to install aws cli. - Run the following command and enter your aws credentials: AWS Access Key ID and AWS Secret Access Key aws configure

Example Usages:

The following scripts demonstrate how to benchmark various scenarios. You can flexibly modify the arguments within these scripts to assess different topics, preference styles, and inter-turn conversation numbers to create varying task difficulties.

Example 1: Benchmark Generation Tasks

cd example_scripts

  1. Benchmark Claude 3 Haiku with zero-shot on explicit preferences, using 3 inter-turns for the travel restaurant topic: bash run_and_eval_explicit.sh
  2. Benchmark Claude 3 Haiku with zero-shot on implicit preferences, using persona-based preferences and 2 inter-turns: bash run_and_eval_implicit.sh

Example 2: Benchmark Classification Tasks

  1. Benchmark classification tasks on all topics with explicit/implicit preferences, using Claude 3 Haiku with zero-shot and 0 inter-turns: bash run_mcq_task.sh

Example 3: Test 5 baselines methods

  1. Test 5 baseline methods on explicit preferences: zero-shot, reminder, chain-of-thought, RAG, self-critic.

bash run_and_eval_explicit_baselines.sh

Note: All benchmarking results will be saved in the benchmark_results/ directory.


SFT Code

Code and instructions for SFT (Supervised Fine-Tuning) are located in the SFT/ directory.


Benchmark preference and query pair generation:

We provides code for generating preference-query pairs. While our final benchmark dataset includes extensive human filtering and iterative labeling, we provide the initial sampling code for reproducibility.

cd benchmark_dataset python claude_generate_preferences_questions.py

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Issues event: 10
  • Watch event: 22
  • Delete event: 4
  • Issue comment event: 15
  • Push event: 6
  • Public event: 1
  • Pull request event: 15
  • Fork event: 4
  • Create event: 8
Last Year
  • Issues event: 10
  • Watch event: 22
  • Delete event: 4
  • Issue comment event: 15
  • Push event: 6
  • Public event: 1
  • Pull request event: 15
  • Fork event: 4
  • Create event: 8

Dependencies

requirements.txt pypi
  • Jinja2 ==3.1.5
  • accelerate ==0.34.2
  • aiohappyeyeballs ==2.4.0
  • aiohttp ==3.10.11
  • aiosignal ==1.3.1
  • annotated-types ==0.7.0
  • anyio ==4.6.0
  • attrs ==24.2.0
  • beautifulsoup4 ==4.12.3
  • bitsandbytes ==0.44.1
  • boto3 ==1.35.23
  • botocore ==1.35.23
  • bs4 ==0.0.2
  • cachetools ==5.5.0
  • click ==8.1.7
  • cloudpickle ==3.1.0
  • datasets ==3.0.0
  • dill ==0.3.8
  • diskcache ==5.6.3
  • einops ==0.8.0
  • exceptiongroup ==1.2.2
  • fastapi ==0.115.2
  • filelock ==3.16.1
  • fonttools ==4.54.1
  • fsspec ==2024.6.1
  • gguf ==0.10.0
  • google-ai-generativelanguage ==0.6.10
  • google-api-core ==2.20.0
  • google-api-python-client ==2.147.0
  • google-auth ==2.35.0
  • google-auth-httplib2 ==0.2.0
  • google-generativeai ==0.8.2
  • googleapis-common-protos ==1.65.0
  • grpcio ==1.66.1
  • grpcio-status ==1.66.1
  • h11 ==0.14.0
  • httpcore ==1.0.5
  • httplib2 ==0.22.0
  • httpx ==0.27.2
  • huggingface-hub ==0.25.0
  • idna ==3.7
  • importlib_metadata ==8.5.0
  • joblib ==1.4.2
  • jsonschema ==4.23.0
  • kaleido ==0.2.1
  • kiwisolver ==1.4.7
  • lark ==1.2.2
  • matplotlib ==3.9.2
  • mpmath ==1.3.0
  • msgpack ==1.1.0
  • msgspec ==0.18.6
  • nest-asyncio ==1.6.0
  • networkx ==3.3
  • numba ==0.60.0
  • numpy ==1.26.4
  • openai ==1.49.0
  • opencv-python-headless ==4.10.0.84
  • outlines ==0.0.46
  • pandas ==2.2.2
  • peft ==0.13.2
  • pillow ==10.4.0
  • plotly ==5.24.1
  • prometheus-fastapi-instrumentator ==7.0.0
  • prometheus_client ==0.21.0
  • proto-plus ==1.24.0
  • protobuf ==5.28.2
  • psutil ==6.0.0
  • pycountry ==24.6.1
  • pydantic ==2.9.2
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.0.1
  • pytz ==2024.2
  • ray ==2.37.0
  • regex ==2024.9.11
  • requests ==2.32.3
  • safetensors ==0.4.5
  • scikit-learn ==1.5.2
  • scipy ==1.14.1
  • seaborn ==0.13.2
  • sentencepiece ==0.2.0
  • six ==1.16.0
  • sniffio ==1.3.1
  • soupsieve ==2.6
  • starlette ==0.40.0
  • tenacity ==9.0.0
  • threadpoolctl ==3.5.0
  • tiktoken ==0.7.0
  • tokenizers ==0.20.1
  • torch ==2.4.0
  • torchvision ==0.19.0
  • tqdm ==4.66.5
  • transformers ==4.45.2
  • triton ==3.0.0
  • typing_extensions ==4.12.2
  • tzdata ==2024.1
  • uritemplate ==4.1.1
  • urllib3 ==2.2.2
  • uvicorn ==0.32.0
  • uvloop ==0.21.0
  • vllm ==0.6.3
  • watchfiles ==0.24.0
  • websockets ==13.1
  • xformers ==0.0.27.post2
  • xxhash ==3.5.0
  • yarl ==1.11.1
  • zipp ==3.20.2
  • zstandard ==0.22.0