ptec

Code repository corresponding to the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation" (NAACL 2024).

https://github.com/eqtpartners/ptec

Science Score: 41.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

llm nlp
Last synced: 10 months ago · JSON representation ·

Repository

Code repository corresponding to the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation" (NAACL 2024).

Basic Info
Statistics
  • Stars: 7
  • Watchers: 8
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Topics
llm nlp
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation

InstallationReproducibilityUsagePaperNAACL 2024 PresentationBlog PostCitation

⚠️ This repository has migrated ⚠️

For an up to date codebase, issues, and pull requests, please continue to the new repository. This repository will not be maintained any further, and issues and pull requests may be ignored.

This repository contains the code accompanying the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation". We also recommend to read our blog post "How EQT Motherbrain uses LLMs to map companies to industry sectors".

Installation

After cloning this repository, the necessary packages can be installed with: ```bash pip install -r requirements.txt pip install -e .

if using a vertex ai notebook with CUDA

pip3 install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir ```

Reproducibility

All experiments, including hyperparameter search, can be reproduced by running the following batch files:

bash bash preprocessing/preprocessing.sh bash sectors/experiments/run_experiments_gpu.sh bash sectors/experiments/run_experiments_cpu.sh

Usage

The scripts can also be run individually:

Preprocessing

The preprocessed data for the hatespeech dataset is already contained in this repository. However, it can be rerun with bash python preprocessing/get_dataset.py python preprocessing/preprocess_data.py # this line will take ~10 min as it summarizes long descriptions and keyword lists

The preprocessed dataset can be augmented by applying paraphrasing with vicuna: bash python preprocessing/paraphrase_augmentation.py This will create a new dataset data/[DATASET]/train_augmented.json.

Running The Experiments

For test runs, all the following commands include the --model_name=bigscience/bloom-560m flag, as this can easily be run on a cpu. However, it can also be replaced with other huggingface hosted LLaMa or Bloom models. By default it uses huggyllama/llama-7b. All experimental results will be saved as json files in the results/[DATASET]/ directory.

N-shot experiments

bash python sectors/experiments/nshot/nshot.py --model_name bigscience/bloom-560m

In order to use gpt-3.5-turbo as a model for n-shot prompting, a .env file with the OpenAI API credentials needs to be added to the root directory of this repository:

bash OPENAI_SECRET_KEY = "secret key" OPENAI_ORGANIZATION_ID = "org id"

Embedding Promximity

For these experiments, the embeddings still have to be generated by running the following code

```bash python embeddingproximity/generateembeddings.py --model_name bigscience/bloom-560m

for augmented data

python embeddingproximity/generateembeddings.py --model_name bigscience/bloom-560m --augmented augmented ```

Then, the following code runs all embedding proximity experiments: ```bash python embeddingproximity/vectorsimilarity.py --modelname bigscience/bloom-560m python embeddingproximity/vectorsimilarity.py --modelname bigscience/bloom-560m --augmented augmented

python embeddingproximity/vectorsimilarity.py --type RadiusNN --modelname bigscience/bloom-560m python embeddingproximity/vectorsimilarity.py --type RadiusNN --modelname bigscience/bloom-560m --augmented augmented

python embeddingproximity/classificationhead/classificationhead.py --modelname bigscience/bloom-560m python embeddingproximity/classificationhead/classificationhead.py --modelname bigscience/bloom-560m --augmented augmented ```

Prompt Tuning

bash python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 --augmented augmented

PTEC

bash python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 --augmented augmented

Other Resources

For an example of applying Trie Search, see notebooks/constrained_beam_search.ipynb

Citation

If you use or refer to this repository in your research, please cite our paper:

BibTeX

bash @inproceedings{buchner2023prompt, title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation}, author={Buchner, V. L. and Cao, L. and Kalo, J.-C. and von Ehrenheim, V.}, booktitle={to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year={2024} }

APA

Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2024). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation. to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)

MLA

Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.

Owner

  • Name: EQT
  • Login: EQTPartners
  • Kind: organization

Citation (CITATION.md)

# Citation

If you use or refer to this repository in your research, please cite our paper:

### BibTeX
```bash
@inproceedings{buchner2023prompt,
  title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
  author={Buchner, V. L. and Cao, L. and Kalo, J.-C. and von Ehrenheim, V.},
  booktitle={to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2024}
}
```

### APA
Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2024). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation. to appear *In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)*

### MLA
Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." to appear *In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)*, 2024.

GitHub Events

Total
  • Issues event: 1
Last Year
  • Issues event: 1