unified-prompt-selection
[TACL 2024] Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis
Science Score: 28.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary
Repository
[TACL 2024] Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis
Basic Info
Statistics
- Stars: 11
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis
Official code for the paper Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis accepted at TACL 2024.
You can read the summary of our paper in this Twitter Thread.
Use the following to cite our paper:
jsx
@article{yang2024improving,
title={Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis},
author={Yang, Sohee and Kim, Jonghyeon and Jang, Joel and Ye, Seonghyeon and Lee, Hyunji and Seo, Minjoon},
journal={TACL},
year={2024}
}
This repository provides a set of tools for easy utilization and unified evaluation of different probability-based prompt selection methods.
If you're interested in reproducing the experimental results mentioned in the paper, please refer to the Reproduction section.
Features
- Extraction of the language model's output probability necessary for prompt selection score calculation.
- Probability-based prompt selection methods.
- Calibration methods.
- Task-wise prompt selection and Instance-wise prompt selection.
- Custom prompt addition that conforms to the template format of promptsource.
Installation
Installation from the source. Python's virtual or Conda environments are recommended.
bash
git clone git@github.com:soheeyang/unified-prompt-selection.git
cd unified-prompt-selection
pip install -r requirements.txt
Python version 3.9+ is required.
Overview

LLMs predict the essential $p(y|x,t)$ through inference with given prompt candidates and datasets to calculate prompt selection scores. In the subsequent prompt selection process, the extracted $p(y|x,t)$ is loaded to calculate prompt selection scores. Based on these scores, a prompt is chosen, and the selection result is returned. OTR(One-Token Response) Converter is used when calculating $p(y|x,t)$ by utilizing only the first token logits.
Quick Start
You can quickly explore the core functionalities by referring to the following notebook files.
LLM Inference & Prompt Selection
By running run_prompt_selection.py, you can extract $p(y|x,t)$ and select a prompt.
The extracted $p(y|x,t)$ through LLM inference is stored in the './extraction/results' directory. When running run_prompt_selection.py with the same configuration, it utilizes the previously saved extraction results without additional inference. For detailed information about inference, please refer to the $p(y|x,t)$ Extraction section.
bash
python run_prompt_selection.py
Running the command as shown above will execute Prompt Selection according to the predefined default arguments.
You can also execute various combinations by adding -m or --multirun as follows:
bash
python run_prompt_selection.py -m \
method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \
calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \
decoder=opt-1.3b,opt-2.7b,opt-6.7b,opt-30b,opt-66b,gpt-neo-1.3b,gpt-neo-2.7b,gpt-j-6b,gpt2-xl,bloom-3b \
dataset=sst2,ag_news,cb,imdb,newspop,rte,sst5,tweet_emotion,tweet_irony,piqa,copa,hellaswag,story_cloze \
prompt=base_prompts,v12_prompts,v2_prompts,fewshot_prompt \
first_token=false,true \
sum_log_prob=false,true \
fewshot=null,'1,2,4' \
filter=false,true \
unbalance=false,true
We used hydra to manage complex configurations. You can check the configurations in ./conf, and besides specifying arguments on the command line, you can modify the arguments by editing the ./conf/config.yaml file.
After executing python run_prompt_selection.py, you can verify the prompt selection result through the CLI. The result of executing the above command is as follows:
```bash [2023-10-30 02:38:35,738][main][INFO] - -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Evaluation results using predictions from selected prompt.
* X
Accuracy: 0.5906
F1 score: 0.5001
Prediction: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ...
Target: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1] ...
Correct: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1] ...
* A
Accuracy: 0.9186
F1 score: 0.9183
Prediction: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1] ...
Target: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1] ...
Correct: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1] ...
* P
Accuracy: 0.5906
F1 score: 0.5001
Prediction: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ...
Target: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1] ...
Correct: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1] ...
* PA
Accuracy: 0.9186
F1 score: 0.9183
Prediction: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1] ...
Target: [1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1] ...
Correct: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1] ...
Note that some datasets are missing label values,
so check the target results.
The predictions of the selected prompt were saved in the following file.
'./results/dataset=glue_sst2_validation__decoder=facebook--opt-2.7b__prompt=base_prompts__first_token=False__sum_log_prob=False__num_samples=1000__seed=42__fewshot=None__do_eval=True/method=MI__all_tokens=True__one_hot=False__select_for_each_x=False__cali_type=cbm__cali_norm_type=softmax__filter=False__unbalance=False.json'
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
You can obtain four different results depending on whether calibration is applied:
X: without applying any calibration
A: applying calibration only for Answer selection
P: applying calibration only for Prompt selection
PA: applying calibration for both Prompt selection and Answer selection.
The prompt selection result is saved in the ‘./results/dataset=""__decoder=""__prompt=""__first_token=""__sum_log_prob=""__num_samples=""__seed=""__fewshot=""__do_eval=""/method=""__using_all_tokens=""__one_hot=""__select_for_each_x=""__filter=""__unbalance="".json’ file. The format of the result file is as follows:
Prompt selection result format
Notes: - When applying prompt selection to a dataset without ground truth labels, prompt selection is possible, but evaluation results cannot be verified. Before applying prompt selection to a dataset, check whether ground truth labels are available or not or check whether the evaluation results for datasets without ground truth labels are based on the Target of the prompt selection results.
Probability-based Prompt Selection Method

The following probability-based prompt selection methods are available: 'MI', 'GE', 'LE', 'MDL', 'ZLP', 'ZPM', 'ZMV', and 'PPL'.
To use a specific prompt selection method, pass the desired method to method. You can find detailed descriptions of each method in section 2.2 Existing Approaches of the paper.
Variants created by Prompt Selection Methods
The following methods are variants that modify the score calculation formula of existing Probability-based prompt selection methods: 'MIG', 'MIL,' 'MIGL', 'GEM', 'MDLM', and 'PPLL'.
You can check the arguments specific to these probability-based prompt selection methods in the ./conf/method directory.
If a method name is followed by '_L', it means that select_for_each_x is set to 'True', and instance-wise prompt selection is performed. The methods that support instance-wise prompt selection are 'MDL', 'MI', and 'PPL'.
If a method name is followed by '_G', it means that one_hot is set to 'True', and one-hot $p(y|x,t)$ is used for GE calculation.
Adding Custom Prompt Selection Method
If you want to add a new prompt selection method, refer to the ./method/methods.py module. Check other methods in this module, and create a new method according to the types and dimensions of input and output values.
||Task-wise Method|Instance-wise Method|
|:---------:|:-------:|:------------:|
|Input Variable|template_prob|tensor_dict_prob|
|Input Variable Type|torch.Tensor|torch.Tensor|
|Input Variable Size|[X, Y]|[T, X, Y]|
|Output Value Type|torch.Tensor|Tuple[List[float], List[int]]|
|Output Value Size|[](Scalar)|([X], [X])|
- T : The number of prompts
- X : The number of instances
- Y : The number of answer choices
The Instance-wise Method has two lists as output values. The first list should contain the results of calculated instance-wise prompt selection scores using the added method, and the second list should contain the indices of the selected prompts for each instance.
The newly added method is utilized in the get_*_wise_ps_scores function in the ./method/score.py module.
python
methodFuncMap = {
'MI': get_mi_g if one_hot else get_mi,
'GE': get_ge if one_hot else get_ge_m,
'LE': get_le,
'MDL': get_le,
'PPL': get_ppl,
'ZLP': get_zlp,
'ZPM': get_zpm,
'ZMV': get_zmv,
}
When adding a method, follow the format of the method in the methodFuncMap inside the get_*_wise_ps_result function. Additionally, add a yaml file representing the new method to ./conf/method directory. Refer to the yaml file format of other methods in ./conf/method
Calibration Method
Three calibration methods are available, allowing you to observe the changes in prompt selection results before and after applying each calibration.
You can check the arguments related to calibration in the ./conf/calibration directory.
You can choose one of the following calibration methods by specifying 'cbm', 'cc', or 'pmi' to cali_type. The default value is 'cbm'.
- cbm: Calibration By Marginalization, an enhanced calibration method proposed in this work that demonstrates effective results in prompt selection compared to existing methods.
- cc: Contextual Calibration, a calibration method proposed in Calibrate before use: Improving Few-Shot performance of language models.
- pmi: Domain Conditional PMI, a calibration method proposed in Surface form competition: Why the highest probability answer isn’t always right.
You can specify the normalization criteria for $\tilde{q}(y|x,t)$ using cali_norm_type. The options are 'softmax' or 'mean', and the default value is 'softmax'.
Dynamic Dataset
Dynamic datasets are datasets where answer choices are not fixed and vary for each instance. The experiment includes COPA, PIQA, StoryCloze, and Hellaswag as dynamic datasets. With such datasets, you can adjust the label distribution differently and observe the changes in prompt selection results based on the label distribution.
If you set unbalance to 'True', you can adjust the label distribution of COPA, PIQA, StoryCloze, and Hellaswag to be unbalanced at a ratio of 100:0 and observe the prompt selection results.
Filter
In Zero-Label Prompt Selection, a low-quality prompt filtering algorithm using k-means clustering and $p(y|x,t)$ was proposed. By setting filter to 'True', prompts can be filtered using the zero-label prompt selection (ZPS) filtering algorithm.
$p(y|x,t)$ Extraction
To perform probability-based prompt selection, you need to extract the output probability of the language model, denoted as $p(y|x,t)$, where $x,t$ represents the instantiated prompt.
If desired, it is also possible to perform LLM inference excluding prompt selection and extract $p(y|x,t)$ using the run_inference.py script.
bash
python run_inference.py \
decoder=opt-2.7b \
dataset=sst2 \
prompt=base_prompts \
first_token=false \
sum_log_prob=false \
num_samples=1000 \
seed=42 \
fewshot=null \
do_eval=true \
mixed_precision=no
The arguments that affect extraction can be found in ./conf/config.yaml.
To calculate $p(y|x,t)$ using only the first token of $y$, set the first_token to 'True'. If set to 'False', all tokens of $y$ are used.
When calculating $p(y|x,t)$ using all tokens over $y$, you have the option to calculate either the mean log-probability or the sum of log-probability for all tokens over $y$. To extract the sum of log-probability for all tokens over $y$, set sum_log_prob to 'True'.
To adjust the dataset sample size for prompt selection, use the num_samples.
For reproducibility, you can set the random seed by using the seed.
If you want to select one of the fewshot prompts with a different number and order of examples for a single prompt template, you can use The values that can be passed to Passing '1' saves eight outputs of fewshot prompts with a single example, and then one of the eight prompts is selected. Passing '2' saves 20 outputs, and passing '4' saves 72 outputs. If multiple values are passed by concatenating them with ',', 100 (8+20+72) outputs are saved, and one of the 100 fewshot prompts, which is expected to work best, is selected.
fewshot.
Detailed Explanation of the
fewshot
fewshot are '1', '2', '4', '1,2', '1,4', '2,4', '1,2,4'. Each number represents the number of randomly selected examples from the training dataset for the fewshot prompt.
Through the do_eval, you can choose whether to record evaluation results ('Accuracy', 'Macro F1') for each prompt during the inference stage for extraction. If set to 'False,' prompt selection can still be performed even if there is no ground truth label in the dataset. However, you will only be able to verify what the final selected prompt is and its predictions.
Notes:
- The results of $p(y|x,t)$ extraction are saved under the ./extraction/results/dataset=""__decoder=""__prompt=""__first_token=""__sum_log_prob=""__num_samples=""__seed=""__fewshot=""__do_eval="" directory.
- If results have already been extracted with the same arguments (i.e., the directory with the same name exists under ./extraction/results), the existing results will be provided, and additional extraction will not be performed.
- If a directory with the same name exists but the names of the selected prompt templates have changed in the settings, additional extraction will only be carried out for prompts with names that were not previously extracted.
- $p(y|x,t)$ extraction results from each prompt will be saved as JSON files under the directory. The output file has the following format:
Extraction output format
Decoder
If you want to specify a particular model, you can pass the desired model name to the decoder. The models used in the experiment can be found in ./conf/decoder.
You can also add new models by adding .yaml files under ./conf/decoder. You can refer to existing files to configure the arguments.
Pass the desired model name to model_name_or_path. Adding models is limited to the decoder models supported by the Hugging Face transformers library.
❗ Verify the pad_token_id of the tokenizer to be used and pass the verified pad_token_id to ignore_index.
To apply model parallelism, use the parallelize.
You can adjust the batch size by using the per_device_eval_batch_size.
Dataset
If you want to specify a particular dataset, you can pass the desired dataset name to dataset. The datasets used in the experiment can be found in ./conf/dataset.
You can also add new datasets by adding .yaml files under ./conf/dataset. You can refer to existing files to configure the arguments. To manage and utilize prompts, we rely on promptsource. Therefore, adding new datasets is limited to classification datasets existing in promptsource.
dataset_name corresponds to the path argument of the load_dataset method in the Hugging Face datasets library.
dataset_config_name corresponds to the name argument.
split corresponds to the split argument.
DATASET_KWARGS are automatically determined based on the configured dataset_name, dataset_config_name, and split.
DATASET_INFO is required for OTR (=One Token Response, first_token=True).
- num_classes should be set to the number of label categories.
- label should be set to the column name containing ground truth labels for the dataset. If it doesn't exist, you can set it to null. In such cases, don't forget to set the do_eval option to 'False' as well.
- is_dynamic should be set to True if the dataset you want to add is a dynamic task; otherwise, set it to False. For an explanation of dynamic tasks, refer to Dynamic Dataset.
- choices should be set to values corresponding to the answer_choices associated with the dataset's prompt templates.
TEMPLATE_INFO is required for Custom Prompt Addition. Set text_format to the column name containing text in the appropriate format and jinja_suffix to the column name containing ground truth labels in the format specified by jinja.
Note:
- If you want to apply a dataset that is not present in promptsource, you will need a
templates.yamlfile that adheres to the format used by promptsource. We recommend creating this file by referring to examples in the promptsource documentation andtemplates.yamlfiles for other datasets. Also, please note that if there are no ground truth labels, you can create the jinja section without addingjinja_suffix.
Prompt
You can find the list of prompts used in the experiment in the .yaml files under conf/prompt.
The prompts you want to use can be configured through prompt. Create a .yaml file under conf/prompt and pass the name of the created file to prompt.
prompt_config_name affects the directory name where the results of $p(y|x,t)$ extraction are stored ('prompt=&{prompt_config_name}').
By entering the names of prompt templates as a list in template_names, you can select prompts from those templates.
If you want to add new prompts, refer to Custom Prompt Addition.
Adding Custom Prompt
To manage and utilize prompts, we rely on promptsource. For more information on adding prompts, refer to promptsource.
You can add a prompt by running add_prompt.py.
Here is an example of adding a prompt for the 'ag_news' dataset:
bash
python add_prompt.py dataset=ag_news
Instruction
When running python add_prompt.py dataset=ag_news, follow the instructions below:
``` Instruction example: "{{ text }}" is about
Enter the instruction for the new prompt, using the example format provided above: ```
The placeholder '{{ ... }}' represents the location where the instance of the dataset will be inserted. The ... refers to the column name that contains the instances for each dataset.
``` Instruction example: "{{ text }}" is about
Enter the instruction for the new prompt, using the example format provided above: {{ text }} this is testaddprompt.
Please verify that the entered instruction is correct: {{ text }} this is testaddprompt.
If the instruction is correct, press 'y'; otherwise, enter another key. If you choose another key, you will be able to edit the instruction: y ```
Follow the provided example and input the desired instruction accordingly.
Notes:
- ```
Instruction example:
"{{ text }}" is about Enter the instruction for the new prompt, using the example format provided
above:
this is testaddprompt. The instruction is missing '{{ text }}'. Please enter the instruction again.
```The instruction must include '{{ ... }}'.
Answer Choices
``` Answer_choices example: politics ||| sports ||| business ||| science
Enter the answer_choices for the new prompt, using the example format provided above: ```
After entering the instruction, input the answer_choices.
``` Answer_choices example: politics ||| sports ||| business ||| science
Enter the answer_choices for the new prompt, using the example format provided above: politics ||| sports ||| business ||| science
Please verify that the entered answer_choices are correct: politics ||| sports ||| business ||| science
If the answerchoices are correct, press 'y'; otherwise, enter another key. If you choose another key, you will be able to edit the answerchoices: y ```
Enter the desired answer_choices according to the provided example.
Notes:
- ```
Answer_choices example:
politics ||| sports ||| business ||| science Enter the answer_choices for the new prompt, using the example format provided above:
politics sports business science Each answerchoice must be entered separated by " ||| ". Please enter the answerchoices again. ```If you omit " ||| ", you will need to enter the answer_choices again.
If the number of answerchoices is different from the number of labels, you will need to enter the answerchoices again.
``` Answer_choices example: politics ||| sports ||| business ||| science
Enter the answer_choices for the new prompt, using the example format provided above: politics ||| sports
The number of answerchoices must be 4. Please enter the answerchoices again. ```
Prompt Name
Prompt name is used to identify and utilize the added prompt. After entering the answer_choices, input the prompt name to identify the added prompt.
``` Enter the prompt name: testaddprompt
Please verify that the entered prompt name is correct: testaddprompt
If the prompt name is correct, press 'y'; otherwise, enter another key. If you choose another key, you will be able to edit the prompt name: y ```
Notes: -
The prompt name cannot be duplicated.
``` Enter prompt name: prompt_00
A prompt with the same name already exists. ```
Checking Custom Prompt
Added prompts are saved in the following files, depending on the
dataset_nameanddataset_config_name:./extraction/promptsource/templates/{dataset_name/dataset_config_name}/templates.yamlYou can check the example prompt that was added in the following file:
./extraction/promptsource/templates/ag_news/templates.yamlHere is an example of how to use the added prompt:
```python
Load an example from the datasets ag_news
from datasets import loaddataset dataset = loaddataset("ag_news", split="train") example = dataset[1]
Load prompts for this dataset
from extraction.promptsource.templates import DatasetTemplates agnewsprompts = DatasetTemplates('ag_news')
Select a prompt by its name
prompt = agnewsprompts["testaddprompt"]
Apply the prompt to the example
result = prompt.apply(example) print("INPUT: ", result[0]) INPUT: Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market. this is testaddprompt. print("TARGET: ", result[1]) TARGET: business ```
This example demonstrates how to load the agnews dataset, load the prompts for the dataset, select the prompt named "testadd_prompt," and apply it to an example from the dataset. The result shows the input with the applied prompt and the corresponding target label.
Reproduction
❗ To download the extracted p(y|x,t) output for the experiment, approximately 50GB of free space is required. If you don't have enough space, you can directly download the Prompt Selection Score result.If you want to reproduce the entire set of experimental results, including results for all datasets and models, please follow the instructions below.
Preparing $p(y|x,t)$
To reproduce the results, you need to extract the required $p(y|x,t)$ for prompt selection. The $p(y|x,t)$ used in the experiment can be downloaded using the following command:
bash python ./reproduction/download_experimental_results.py --result inferenceNotes:
- The prompts used in the experiment can be found in the
./extraction/promptsource/templatesdirectory. The prompts used in the experiment are specified in thetemplates.yamlfile located in the directory of the dataset you want to reproduce.
Prompt Selection Score Result
After preparing the $p(y|x,t)$, you can use the following command to download all prompt selection results in the experiment:
bash python ./reproduction/download_experimental_results.py --result prompt_selectionAll datasets, models, and prompt selection method results used in the experiment will be saved in the
./reproduction/ps_resultsdirectory.Visualizing the Result
Once you have prepared
./reproduction/ps_results, you can refer tofigures_ver2_tacl.ipynbandfigures_ver1_arxiv.ipynbto recreate the figures presented in the paper.
Reproduce by Running Commands Directly
Extracting $p(y|x,t)$ for all prompts, datasets, and models used in the experiment requires a significant amount of resources and time. However, if you want to reproduce the results directly, you can use the following commands:
bash python run_prompt_selection.py -m \ method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \ calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \ decoder=opt-1.3b,opt-2.7b,opt-6.7b,opt-30b,opt-66b,gpt-neo-1.3b,gpt-neo-2.7b,gpt-j-6b,gpt2-xl,bloom-3b \ dataset=sst2,ag_news,cb,imdb,newspop,rte,sst5,tweet_emotion,tweet_irony \ prompt=base_prompts \ sum_log_prob=false \ filter=false,truebash python run_prompt_selection.py -m \ method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \ calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \ decoder=opt-2.7b \ dataset=sst2,ag_news,cb,imdb,newspop,rte,sst5,tweet_emotion,tweet_irony \ prompt=v1_prompts,v12_prompts \ sum_log_prob=false \ filter=false,truebash python run_prompt_selection.py -m \ method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \ calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \ decoder=opt-2.7b \ dataset=sst2,ag_news,cb,imdb,newspop,rte,sst5,tweet_emotion,tweet_irony \ prompt=fewshot_prompt \ fewshot='1,2,4' \ sum_log_prob=false \ filter=false,truebash python run_prompt_selection.py -m \ method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \ calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \ decoder=opt-1.3b,opt-2.7b,opt-6.7b,opt-30b,opt-66b,gpt-neo-1.3b,gpt-neo-2.7b,gpt-j-6b,gpt2-xl,bloom-3b \ dataset=piqa,copa,hellaswag,story_cloze \ prompt=base_prompts \ sum_log_prob=true \ filter=false,true \ unbalance=false,truebash python run_prompt_selection.py -m \ method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \ calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \ decoder=opt-2.7b \ dataset=piqa,copa,hellaswag,story_cloze \ prompt=v1_prompts,v2_prompts \ sum_log_prob=true \ filter=false,true \ unbalance=false,truebash python run_prompt_selection.py -m \ method=MI,MI_G,MI_L,MI_GL,GE,GE_M,LE,MDL,MDL_M,ZLP,ZPM,ZMV,PPL,PPL_L \ calibration=cbm-softmax,cbm-mean,cc-softmax,cc-mean,pmi-softmax,pmi-mean \ decoder=opt-2.7b \ dataset=piqa,copa,hellaswag,story_cloze \ prompt=fewshot_prompt \ fewshot='1,2,4' \ sum_log_prob=true \ filter=false,true \ unbalance=false,true- The prompts used in the experiment can be found in the
Owner
- Name: Sohee Yang
- Login: soheeyang
- Kind: user
- Location: London, United Kingdom
- Company: UCL/DeepMind
- Website: https://soheeyang.github.io
- Twitter: soheeyang_
- Repositories: 25
- Profile: https://github.com/soheeyang
PhD student/Intern at UCL/DeepMind. Previously MS student at KAIST AI and research engineer at Naver Clova. NLP & ML. Wherever curiosity leads me.
Citation (CITATION.bib)
@article{yang2024improving,
title={Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis},
author={Yang, Sohee and Kim, Jonghyeon and Jang, Joel and Ye, Seonghyeon and Lee, Hyunji and Seo, Minjoon},
journal={TACL},
year={2024}
}
GitHub Events
Total
- Watch event: 1
- Push event: 1
- Pull request event: 2
- Create event: 1
Last Year
- Watch event: 1
- Push event: 1
- Pull request event: 2
- Create event: 1
Dependencies
- accelerate *
- datasets *
- huggingface_hub *
- hydra-core ==1.3.2
- isort ==5.8.0
- jinja2 *
- pandas ==1.5.3
- plotly *
- protobuf *
- py7zr *
- pytimedinput ==2.0.1
- pyyaml >=5
- requests *
- scikit-learn *
- sentencepiece *
- torch *
- transformers ==4.27.1