ptec
Code repository corresponding to the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation" (NAACL 2024).
Science Score: 41.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Keywords
Repository
Code repository corresponding to the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation" (NAACL 2024).
Basic Info
- Host: GitHub
- Owner: EQTPartners
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://arxiv.org/abs/2309.12075
- Size: 10.3 MB
Statistics
- Stars: 7
- Watchers: 8
- Forks: 2
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation
⚠️ This repository has migrated ⚠️
For an up to date codebase, issues, and pull requests, please continue to the new repository. This repository will not be maintained any further, and issues and pull requests may be ignored.
This repository contains the code accompanying the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation". We also recommend to read our blog post "How EQT Motherbrain uses LLMs to map companies to industry sectors".
Installation
After cloning this repository, the necessary packages can be installed with: ```bash pip install -r requirements.txt pip install -e .
if using a vertex ai notebook with CUDA
pip3 install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir ```
Reproducibility
All experiments, including hyperparameter search, can be reproduced by running the following batch files:
bash
bash preprocessing/preprocessing.sh
bash sectors/experiments/run_experiments_gpu.sh
bash sectors/experiments/run_experiments_cpu.sh
Usage
The scripts can also be run individually:
Preprocessing
The preprocessed data for the hatespeech dataset is already contained in this repository. However, it can be rerun with
bash
python preprocessing/get_dataset.py
python preprocessing/preprocess_data.py # this line will take ~10 min as it summarizes long descriptions and keyword lists
The preprocessed dataset can be augmented by applying paraphrasing with vicuna:
bash
python preprocessing/paraphrase_augmentation.py
This will create a new dataset data/[DATASET]/train_augmented.json.
Running The Experiments
For test runs, all the following commands include the --model_name=bigscience/bloom-560m flag, as this can easily be run on a cpu. However, it can also be replaced with other huggingface hosted LLaMa or Bloom models. By default it uses huggyllama/llama-7b. All experimental results will be saved as json files in the results/[DATASET]/ directory.
N-shot experiments
bash
python sectors/experiments/nshot/nshot.py --model_name bigscience/bloom-560m
In order to use gpt-3.5-turbo as a model for n-shot prompting, a .env file with the OpenAI API credentials needs to be added to the root directory of this repository:
bash
OPENAI_SECRET_KEY = "secret key"
OPENAI_ORGANIZATION_ID = "org id"
Embedding Promximity
For these experiments, the embeddings still have to be generated by running the following code
```bash python embeddingproximity/generateembeddings.py --model_name bigscience/bloom-560m
for augmented data
python embeddingproximity/generateembeddings.py --model_name bigscience/bloom-560m --augmented augmented ```
Then, the following code runs all embedding proximity experiments: ```bash python embeddingproximity/vectorsimilarity.py --modelname bigscience/bloom-560m python embeddingproximity/vectorsimilarity.py --modelname bigscience/bloom-560m --augmented augmented
python embeddingproximity/vectorsimilarity.py --type RadiusNN --modelname bigscience/bloom-560m python embeddingproximity/vectorsimilarity.py --type RadiusNN --modelname bigscience/bloom-560m --augmented augmented
python embeddingproximity/classificationhead/classificationhead.py --modelname bigscience/bloom-560m python embeddingproximity/classificationhead/classificationhead.py --modelname bigscience/bloom-560m --augmented augmented ```
Prompt Tuning
bash
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 --augmented augmented
PTEC
bash
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 --augmented augmented
Other Resources
For an example of applying Trie Search, see notebooks/constrained_beam_search.ipynb
Citation
If you use or refer to this repository in your research, please cite our paper:
BibTeX
bash
@inproceedings{buchner2023prompt,
title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
author={Buchner, V. L. and Cao, L. and Kalo, J.-C. and von Ehrenheim, V.},
booktitle={to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year={2024}
}
APA
Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2024). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation. to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)
MLA
Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Owner
- Name: EQT
- Login: EQTPartners
- Kind: organization
- Website: https://eqtpartners.com
- Repositories: 8
- Profile: https://github.com/EQTPartners
Citation (CITATION.md)
# Citation
If you use or refer to this repository in your research, please cite our paper:
### BibTeX
```bash
@inproceedings{buchner2023prompt,
title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
author={Buchner, V. L. and Cao, L. and Kalo, J.-C. and von Ehrenheim, V.},
booktitle={to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year={2024}
}
```
### APA
Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2024). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation. to appear *In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)*
### MLA
Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." to appear *In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)*, 2024.
GitHub Events
Total
- Issues event: 1
Last Year
- Issues event: 1