https://github.com/amazon-science/context-aware-llm-clustering

https://github.com/amazon-science/context-aware-llm-clustering

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 30.3 KB
Statistics
  • Stars: 9
  • Watchers: 5
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Steps to reproduce results:

  1. Download Amazon reviews datasets from https://jmcauley.ucsd.edu/data/amazon/links.html and preprocess them following this paper https://dl.acm.org/doi/10.1145/3580305.3599519. Place the preprocessed datasets under data/
  2. Setup the python environment using the commands given below.
  3. Run promptllm.py to gather LLMc's clusterings for Amazon reviews datasets. Then, run parse_clusters.py to parse the outputs.
  4. Run preprocess.py to create two files for each dataset: idx2text.json and clusterings.json
  5. Run main.py for clustering using the commands given below.

Setup conda environment

conda create -n llm_cluster python=3.10.9 source activate llm_cluster pip install pytz pandas tqdm matplotlib pyarrow pydot pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117 pip install transformers==4.35.2 pip install accelerate==0.25.0 pip install awscli boto3 botocore==1.31.63 --upgrade pip install peft==0.4.0 pip install bitsandbytes==0.40.0 pip install scikit-learn==1.2.2 pip install sentencepiece==0.1.99 pip install protobuf pip install evaluate

Table 2 (showing commands for Arts dataset)

python unsupervised.py --dataset Arts --val_size 1000 --test_size 3000 python main.py --dataset Arts --set_enc_type nia --train_size 1 --val_size 1000 --test_size 3000 --loss_type scl python main.py --pretrain 1 --dataset Arts --set_enc_type sia_hid_mean --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_hid_mean --train_size 1 --val_size 1000 --test_size 3000 --loss_type triplet_neutral --load_ckpt_path '../outputs/Arts/pretrain_sia_hid_mean|triplet_neutral|margin:0.3|cutoff:0.0|C:0.15|r:0.5|tau:0.5|max_items:None|max_clusters:None|train_size:0.8|model_name:google-flan-t5-base/clus_checkpoint_best.bin' --lr 5e-5

Vary loss functions

python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type scl python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type cross_entropy python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type basic

Vary set encoders

python main.py --dataset Arts --set_enc_type nia --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type fia --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_hid_mean --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_mean --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral

Finetuning ablation study

python main.py --dataset Arts --set_enc_type sia_hid_mean --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral --load_ckpt_path '../outputs/Arts/pretrain_sia_hid_mean|triplet_neutral|margin:0.3|cutoff:0.0|C:0.15|r:0.5|tau:0.5|max_items:None|max_clusters:None|train_size:0.8|model_name:google-flan-t5-base/clus_checkpoint_best.bin' --lr 5e-5

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels