https://github.com/amazon-science/anollm-large-language-models-for-tabular-anomaly-detection
https://github.com/amazon-science/anollm-large-language-models-for-tabular-anomaly-detection
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 104 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
AnoLLM: Large Language Models for Tabular Anomaly Detection (ICLR 2025)
This repository contains the implementation of the paper:
AnoLLM: Large Language Models for Tabular Anomaly Detection
International Conference on Learning Representations (ICLR 2025)
Che-Ping Tsai, Ganyu Teng, Phil Wallis, Wei Ding.
Introduction
AnoLLM is a novel framework that leverages large language models (LLMs) for unsupervised tabular anomaly detection. It can effectively handle mixed-type tabular data (e.g., continuous/numerical, discrete/categorical, and texts) by adapting a pre-trained LLM with serialized tabular data in the text format. During inference, AnoLLM assigns anomaly scores based on the negative log-likelihood generated by the LLM. Our empirical results indicate that AnoLLM delivers the best performance on six benchmark datasets with mixed feature types.
Installing Dependencies
Python version: 3.10
Create environment
conda create -n anollm python=3.10
conda activate anollm
Install packages
pip install -r requirements.txt
Install Torch, ensuring that the version you choose is compatible with your CUDA version.
pip install torch==2.3.1
Overwrite pyod version to avoid bugs
pip install pyod==2.0.1
Rerun our experiments
- Download the following datasets from Kaggle and put them to
data/[dataset_name]/ - Run the corresponding scripts for each experiment:
bash scripts/exp1-mixed_benchmark/run_anollm.sh bash scripts/exp1-mixed_benchmark/run_baselines.sh bash scripts/exp2-odds/run_anollm.sh bash scripts/exp2-odds/run_baselines.sh bash scripts/exp3-binning_effect/run_binning_odds.sh bash scripts/exp4-model_size/run_anollm_1.7B_mixed.sh bash scripts/exp4-model_size/run_anollm_1.7B_odds.sh
Using your own datasets
To use a custom dataset, create a dataframe with the following structure: {feature_name:feature_values}. Please refer to load_dataset() function in src/data_utils.py for further guidance.
Training Models
For AnoLLM, we use the following command:
CUDA_VISIBLE_DEVICES=$TRAIN_GPUS torchrun --nproc_per_node=$n_train_node train_anollm.py --dataset $dataset --n_splits $n_splits --split_idx 0 --binning standard --setting semi_supervised --max_steps 2000 --batch_size $batch_size --model $model
Check the argument parser in train_anollm.py for options for datasets and models
For baselines, we use the following command:
CUDA_VISIBLE_DEVICES=0 python evaluate_baselines.py --dataset $dataset --n_splits $n_splits --normalize --setting semi_supervised --split_idx $split_idx
Check the argument parser in evaluate_baselines.py for options for datasets
Evaluation
To evaluate AnoLLM, we use the following command:
CUDA_VISIBLE_DEVICES=$INFERENCE_GPUS torchrun --nproc_per_node=$n_test_node evaluate_anollm.py --dataset $dataset --n_splits $n_splits --split_idx 0 --setting semi_supervised --batch_size $eval_batch_size --n_permutations $n_permutations --model $model --binning standard
We evaluate the quality of synthetic data using metrics from various aspects.
python src/get_results.py --dataset $dataset --n_splits $n_splits --setting semi_supervised
License
This project is licensed under the Apache-2.0 License.
Acknowledgement
Baselines were adapted from https://github.com/vicliv/DTE. Part of the code was adapted from https://github.com/kathrinse/be_great. Thanks to all the authors for their great works!
Reference
@inproceedings{tsai2025anollm,
title={AnoLLM: Large Language Models for Tabular Anomaly Detection},
author={Tsai, Che-Ping and Teng, Ganyu and Wallis, Phil and Ding, Wei},
booktitle={The thirteenth International Conference on Learning Representations},
year={2025},
note={Accepted, to appear},
}
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Issues event: 4
- Watch event: 16
- Delete event: 2
- Issue comment event: 4
- Public event: 1
- Push event: 1
- Pull request event: 3
- Fork event: 1
- Create event: 3
Last Year
- Issues event: 4
- Watch event: 16
- Delete event: 2
- Issue comment event: 4
- Public event: 1
- Push event: 1
- Pull request event: 3
- Fork event: 1
- Create event: 3
Dependencies
- adbench ==0.1.11
- datasets ==2.20.0
- deepod ==0.4.1
- feature_engine ==1.8.3
- gensim ==4.3.3
- numpy ==1.26.4
- pandas ==2.2.2
- peft ==0.11.1
- scikit_learn ==1.6.1
- scipy ==1.13.1
- tf-keras ==2.16.0
- tqdm ==4.66.4
- transformers ==4.48.2
- ucimlrepo ==0.0.7
- wandb ==0.17.4