https://github.com/amazon-science/prefeval
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: other
- Language: Python
- Default Branch: main
- Size: 25.9 MB
Statistics
- Stars: 15
- Watchers: 2
- Forks: 2
- Open Issues: 6
- Releases: 0
Metadata Files
README.md
PrefEval Benchmark: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs
🏆Performance Leaderboard on Subset Tasks🏆
Ranked by performance in the Reminder (10 Turns) column. This table presents the performance results for the topic: Travel-Restaurants. (With explicit preference and generation task)
| Model | Zero-shot (10 Turns) | Reminder (10 Turns) | Zero-shot (300 Turns) | Reminder (300 Turns) | |--------------------|----------------------|-------------------------|-----------------------|----------------------| | o1-preview | 0.50 | 0.98 | 0.14 | 0.98 | | GPT-4o | 0.07 | 0.98 | 0.05 | 0.23 | | Claude-3-Sonnet | 0.05 | 0.96 | 0.04 | 0.36 | | Gemini-1.5-Pro | 0.07 | 0.91 | 0.09 | 0.05 | | Mistral-8x7B | 0.08 | 0.84 | - | - | | Mistral-7B | 0.03 | 0.75 | - | - | | Claude-3-Haiku | 0.05 | 0.68 | 0.02 | 0.02 | | Llama3-8B | 0.00 | 0.57 | - | - | | Claude-3.5-Sonnet| 0.07 | 0.45 | 0.02 | 0.02 | | Llama3-70B | 0.11 | 0.37 | - | - |
Dataset Location
The preference evaluation dataset is located in the benchmark_dataset directory.
Data Format
The dataset is provided in json format and contains the following attributes: 1. Explicit Preference. ``` { "preference": [string] The user's stated preference that the LLM should follow. "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference. "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging. }
2. Implicit Preference - Choice-based Conversation
{
"preference": [string] The user's explicit preference that the LLM should follow.
"question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference.
"explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging.
"implicitquery": [string] A secondary query that offers further insight into the user’s preference, where the assistant provides multiple options.
"options": [list] A set of options that the assistant presents in response to the user's implicit query, some of which align with and others that violate the user’s implied preference.
"conversation": {
"query": [string] ImplicitQuery,
"assistantoptions": [string] The assistant's presenting multiple options, some aligned and some misaligned with the user's preference,
"userselection": [string] The user's choice or rejection of certain options.
"assistantacknowledgment": [string] The assistant's recognition of the user’s choice.
},
"alignedop": [string] The option that aligns with the user’s preference.
}
```
3. Implicit Preference - Persona-driven Conversation
{
"preference": [string] The user's explicit preference that the LLM should follow.
"question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference.
"explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging.
"persona": [string] The assigned persona guiding the conversation, e.g., "a retired postal worker enjoying his golden years.",
"conversation": {
"turn1": { "user": [string], "assistant": [string] },
"turn2": { "user": [string], "assistant": [string] },
...,
"turnN": { "user": [string], "assistant": [string] }
},
}
Benchmarking on PrefEval
Environment Setup
Create a conda environment:
conda create -n prefeval python=3.10 -y
conda activate prefeval
Install the required dependencies:
pip install -r requirements.txt
Set up AWS credentials for calling Bedrock API.
- Follow the instruction here to install aws cli.
- Run the following command and enter your aws credentials: AWS Access Key ID and AWS Secret Access Key
aws configure
Example Usages:
The following scripts demonstrate how to benchmark various scenarios. You can flexibly modify the arguments within these scripts to assess different topics, preference styles, and inter-turn conversation numbers to create varying task difficulties.
Example 1: Benchmark Generation Tasks
cd example_scripts
- Benchmark Claude 3 Haiku with zero-shot on explicit preferences, using 3 inter-turns for the travel restaurant topic:
bash run_and_eval_explicit.sh - Benchmark Claude 3 Haiku with zero-shot on implicit preferences, using persona-based preferences and 2 inter-turns:
bash run_and_eval_implicit.sh
Example 2: Benchmark Classification Tasks
- Benchmark classification tasks on all topics with explicit/implicit preferences, using Claude 3 Haiku with zero-shot and 0 inter-turns:
bash run_mcq_task.sh
Example 3: Test 5 baselines methods
- Test 5 baseline methods on explicit preferences: zero-shot, reminder, chain-of-thought, RAG, self-critic.
bash run_and_eval_explicit_baselines.sh
Note: All benchmarking results will be saved in the benchmark_results/ directory.
SFT Code
Code and instructions for SFT (Supervised Fine-Tuning) are located in the SFT/ directory.
Benchmark preference and query pair generation:
We provides code for generating preference-query pairs. While our final benchmark dataset includes extensive human filtering and iterative labeling, we provide the initial sampling code for reproducibility.
cd benchmark_dataset
python claude_generate_preferences_questions.py
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Issues event: 10
- Watch event: 22
- Delete event: 4
- Issue comment event: 15
- Push event: 6
- Public event: 1
- Pull request event: 15
- Fork event: 4
- Create event: 8
Last Year
- Issues event: 10
- Watch event: 22
- Delete event: 4
- Issue comment event: 15
- Push event: 6
- Public event: 1
- Pull request event: 15
- Fork event: 4
- Create event: 8
Dependencies
- Jinja2 ==3.1.5
- accelerate ==0.34.2
- aiohappyeyeballs ==2.4.0
- aiohttp ==3.10.11
- aiosignal ==1.3.1
- annotated-types ==0.7.0
- anyio ==4.6.0
- attrs ==24.2.0
- beautifulsoup4 ==4.12.3
- bitsandbytes ==0.44.1
- boto3 ==1.35.23
- botocore ==1.35.23
- bs4 ==0.0.2
- cachetools ==5.5.0
- click ==8.1.7
- cloudpickle ==3.1.0
- datasets ==3.0.0
- dill ==0.3.8
- diskcache ==5.6.3
- einops ==0.8.0
- exceptiongroup ==1.2.2
- fastapi ==0.115.2
- filelock ==3.16.1
- fonttools ==4.54.1
- fsspec ==2024.6.1
- gguf ==0.10.0
- google-ai-generativelanguage ==0.6.10
- google-api-core ==2.20.0
- google-api-python-client ==2.147.0
- google-auth ==2.35.0
- google-auth-httplib2 ==0.2.0
- google-generativeai ==0.8.2
- googleapis-common-protos ==1.65.0
- grpcio ==1.66.1
- grpcio-status ==1.66.1
- h11 ==0.14.0
- httpcore ==1.0.5
- httplib2 ==0.22.0
- httpx ==0.27.2
- huggingface-hub ==0.25.0
- idna ==3.7
- importlib_metadata ==8.5.0
- joblib ==1.4.2
- jsonschema ==4.23.0
- kaleido ==0.2.1
- kiwisolver ==1.4.7
- lark ==1.2.2
- matplotlib ==3.9.2
- mpmath ==1.3.0
- msgpack ==1.1.0
- msgspec ==0.18.6
- nest-asyncio ==1.6.0
- networkx ==3.3
- numba ==0.60.0
- numpy ==1.26.4
- openai ==1.49.0
- opencv-python-headless ==4.10.0.84
- outlines ==0.0.46
- pandas ==2.2.2
- peft ==0.13.2
- pillow ==10.4.0
- plotly ==5.24.1
- prometheus-fastapi-instrumentator ==7.0.0
- prometheus_client ==0.21.0
- proto-plus ==1.24.0
- protobuf ==5.28.2
- psutil ==6.0.0
- pycountry ==24.6.1
- pydantic ==2.9.2
- python-dateutil ==2.9.0.post0
- python-dotenv ==1.0.1
- pytz ==2024.2
- ray ==2.37.0
- regex ==2024.9.11
- requests ==2.32.3
- safetensors ==0.4.5
- scikit-learn ==1.5.2
- scipy ==1.14.1
- seaborn ==0.13.2
- sentencepiece ==0.2.0
- six ==1.16.0
- sniffio ==1.3.1
- soupsieve ==2.6
- starlette ==0.40.0
- tenacity ==9.0.0
- threadpoolctl ==3.5.0
- tiktoken ==0.7.0
- tokenizers ==0.20.1
- torch ==2.4.0
- torchvision ==0.19.0
- tqdm ==4.66.5
- transformers ==4.45.2
- triton ==3.0.0
- typing_extensions ==4.12.2
- tzdata ==2024.1
- uritemplate ==4.1.1
- urllib3 ==2.2.2
- uvicorn ==0.32.0
- uvloop ==0.21.0
- vllm ==0.6.3
- watchfiles ==0.24.0
- websockets ==13.1
- xformers ==0.0.27.post2
- xxhash ==3.5.0
- yarl ==1.11.1
- zipp ==3.20.2
- zstandard ==0.22.0