https://github.com/amazon-science/context-aware-llm-clustering

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: acm.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Size: 30.3 KB

Statistics

Stars: 9
Watchers: 5
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Contributing License Code of conduct

Steps to reproduce results:

Download Amazon reviews datasets from https://jmcauley.ucsd.edu/data/amazon/links.html and preprocess them following this paper https://dl.acm.org/doi/10.1145/3580305.3599519. Place the preprocessed datasets under data/
Setup the python environment using the commands given below.
Run promptllm.py to gather LLMc's clusterings for Amazon reviews datasets. Then, run parse_clusters.py to parse the outputs.
Run preprocess.py to create two files for each dataset: idx2text.json and clusterings.json
Run main.py for clustering using the commands given below.

Setup conda environment

conda create -n llm_cluster python=3.10.9 source activate llm_cluster pip install pytz pandas tqdm matplotlib pyarrow pydot pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117 pip install transformers==4.35.2 pip install accelerate==0.25.0 pip install awscli boto3 botocore==1.31.63 --upgrade pip install peft==0.4.0 pip install bitsandbytes==0.40.0 pip install scikit-learn==1.2.2 pip install sentencepiece==0.1.99 pip install protobuf pip install evaluate

Table 2 (showing commands for Arts dataset)

python unsupervised.py --dataset Arts --val_size 1000 --test_size 3000 python main.py --dataset Arts --set_enc_type nia --train_size 1 --val_size 1000 --test_size 3000 --loss_type scl python main.py --pretrain 1 --dataset Arts --set_enc_type sia_hid_mean --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_hid_mean --train_size 1 --val_size 1000 --test_size 3000 --loss_type triplet_neutral --load_ckpt_path '../outputs/Arts/pretrain_sia_hid_mean|triplet_neutral|margin:0.3|cutoff:0.0|C:0.15|r:0.5|tau:0.5|max_items:None|max_clusters:None|train_size:0.8|model_name:google-flan-t5-base/clus_checkpoint_best.bin' --lr 5e-5

Vary loss functions

python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type scl python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type cross_entropy python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet python main.py --dataset Arts --set_enc_type sia_first --train_size 3000 --val_size 1000 --test_size 3000 --loss_type basic

Vary set encoders

python main.py --dataset Arts --set_enc_type nia --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type fia --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_hid_mean --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral python main.py --dataset Arts --set_enc_type sia_mean --train_size 3000 --val_size 1000 --test_size 3000 --loss_type triplet_neutral

Finetuning ablation study

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science