https://github.com/amazon-science/anollm-large-language-models-for-tabular-anomaly-detection

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Size: 104 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct

AnoLLM: Large Language Models for Tabular Anomaly Detection (ICLR 2025)

This repository contains the implementation of the paper:

AnoLLM: Large Language Models for Tabular Anomaly Detection
International Conference on Learning Representations (ICLR 2025)
Che-Ping Tsai, Ganyu Teng, Phil Wallis, Wei Ding.

Introduction

AnoLLM is a novel framework that leverages large language models (LLMs) for unsupervised tabular anomaly detection. It can effectively handle mixed-type tabular data (e.g., continuous/numerical, discrete/categorical, and texts) by adapting a pre-trained LLM with serialized tabular data in the text format. During inference, AnoLLM assigns anomaly scores based on the negative log-likelihood generated by the LLM. Our empirical results indicate that AnoLLM delivers the best performance on six benchmark datasets with mixed feature types.

Installing Dependencies

Python version: 3.10

Create environment

conda create -n anollm python=3.10 conda activate anollm

Install packages

pip install -r requirements.txt

Install Torch, ensuring that the version you choose is compatible with your CUDA version. pip install torch==2.3.1

Overwrite pyod version to avoid bugs pip install pyod==2.0.1

Rerun our experiments

Download the following datasets from Kaggle and put them to data/[dataset_name]/
- vifd (Vehicle Insurance Fraud Detection)
- fraudecom (Fraud E-commerce)
Run the corresponding scripts for each experiment: bash scripts/exp1-mixed_benchmark/run_anollm.sh bash scripts/exp1-mixed_benchmark/run_baselines.sh bash scripts/exp2-odds/run_anollm.sh bash scripts/exp2-odds/run_baselines.sh bash scripts/exp3-binning_effect/run_binning_odds.sh bash scripts/exp4-model_size/run_anollm_1.7B_mixed.sh bash scripts/exp4-model_size/run_anollm_1.7B_odds.sh

Using your own datasets

To use a custom dataset, create a dataframe with the following structure: {feature_name:feature_values}. Please refer to load_dataset() function in src/data_utils.py for further guidance.

Training Models

For AnoLLM, we use the following command:

CUDA_VISIBLE_DEVICES=$TRAIN_GPUS torchrun --nproc_per_node=$n_train_node train_anollm.py --dataset $dataset --n_splits $n_splits --split_idx 0 --binning standard --setting semi_supervised --max_steps 2000 --batch_size $batch_size --model $model Check the argument parser in train_anollm.py for options for datasets and models

For baselines, we use the following command:

CUDA_VISIBLE_DEVICES=0 python evaluate_baselines.py --dataset $dataset --n_splits $n_splits --normalize --setting semi_supervised --split_idx $split_idx

Check the argument parser in evaluate_baselines.py for options for datasets

Evaluation

To evaluate AnoLLM, we use the following command: CUDA_VISIBLE_DEVICES=$INFERENCE_GPUS torchrun --nproc_per_node=$n_test_node evaluate_anollm.py --dataset $dataset --n_splits $n_splits --split_idx 0 --setting semi_supervised --batch_size $eval_batch_size --n_permutations $n_permutations --model $model --binning standard

We evaluate the quality of synthetic data using metrics from various aspects. python src/get_results.py --dataset $dataset --n_splits $n_splits --setting semi_supervised

License

This project is licensed under the Apache-2.0 License.

Acknowledgement

Baselines were adapted from https://github.com/vicliv/DTE. Part of the code was adapted from https://github.com/kathrinse/be_great. Thanks to all the authors for their great works!

Reference

@inproceedings{tsai2025anollm, title={AnoLLM: Large Language Models for Tabular Anomaly Detection}, author={Tsai, Che-Ping and Teng, Ganyu and Wallis, Phil and Ding, Wei}, booktitle={The thirteenth International Conference on Learning Representations}, year={2025}, note={Accepted, to appear}, }

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Issues event: 4
Watch event: 16
Delete event: 2
Issue comment event: 4
Public event: 1
Push event: 1
Pull request event: 3
Fork event: 1
Create event: 3

Last Year

Issues event: 4
Watch event: 16
Delete event: 2
Issue comment event: 4
Public event: 1
Push event: 1
Pull request event: 3
Fork event: 1
Create event: 3

Dependencies

requirements.txt pypi

adbench ==0.1.11
datasets ==2.20.0
deepod ==0.4.1
feature_engine ==1.8.3
gensim ==4.3.3
numpy ==1.26.4
pandas ==2.2.2
peft ==0.11.1
scikit_learn ==1.6.1
scipy ==1.13.1
tf-keras ==2.16.0
tqdm ==4.66.4
transformers ==4.48.2
ucimlrepo ==0.0.7
wandb ==0.17.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science