llama-classification

Text classification with Foundation Language Model LLaMA

https://github.com/sh0416/llama-classification

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Keywords

classification foundation-model gpt language-model llama pytorch

Last synced: 6 months ago · JSON representation ·

Repository

Text classification with Foundation Language Model LLaMA

Basic Info

Host: GitHub
Owner: sh0416
License: gpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 2.46 MB

Statistics

Stars: 114
Watchers: 3
Forks: 9
Open Issues: 0
Releases: 0

Topics

classification foundation-model gpt language-model llama pytorch

Created almost 3 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

Text classification using LLaMA

This repository provides a basic codebase for text classification using LLaMA.

What system do I use for development?

Device: Nvidia 1xV100 GPU
Device Memory: 34G
Host Memory: 252G

If you need other information about hardware, please open an issue.

How to use

Experimental setup

Get the checkpoint from official LLaMA repository from here.
1-1. I assume that the checkpoint would be located in the project root direction and the contents would be arranged as follow. text checkpoints ├── llama │ ├── 7B │ │ ├── checklist.chk │ │ ├── consolidated.00.pth │ │ └── params.json │ └── tokenizer.model
Prepare your python environment. I recommend using anaconda to segregate your local machine CUDA version. bash conda create -y -n llama-classification python=3.8 conda activate llama-classification conda install cudatoolkit=11.7 -y -c nvidia conda list cudatoolkit # to check what cuda version is installed (11.7) pip install -r requirements.txt

Method: Direct

Direct is to compare the conditional probability p(y|x).

Preprocess the data from huggingface datasets using the following scripts. From now on, we use the agnews dataset. ```bash python runpreprocessdirectagnews.py python runpreprocessdirectagnews.py --sample=False --datapath=real/inputsdirectag_news.json # Use it for full evaluation ```
Inference to compute the conditional probability using LLaMA and predict class. bash torchrun --nproc_per_node 1 run_evaluate_direct_llama.py \ --data_path samples/inputs_direct_ag_news.json \ --output_path samples/outputs_direct_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Calibration is to improve direct method with calibration method.

Calibrate using the following command. bash torchrun --nproc_per_node 1 run_evaluate_direct_calibrate_llama.py \ --direct_input_path samples/inputs_direct_ag_news.json \ --direct_output_path samples/outputs_direct_ag_news.json \ --output_path samples/outputs_direct_calibrate_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Method: Channel

Channel is to compare the conditional probability p(x|y).

Preprocess the data from huggingface datasets using the following scripts. From now on, we use the agnews dataset. ```bash python runpreprocesschannelagnews.py python runpreprocesschannelagnews.py --sample=False --datapath=real/inputschannelag_news.json # Use it for full evaluation ```
Inference to compute the conditional probability using LLaMA and predict class. bash torchrun --nproc_per_node 1 run_evaluate_channel_llama.py \ --data_path samples/inputs_channel_ag_news.json \ --output_path samples/outputs_channel_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Method: Pure generation

To evaluate using generate mode, you can use the preprocessed direct version. bash torchrun --nproc_per_node 1 run_evaluate_generate_llama.py \ --data_path samples/inputs_direct_ag_news.json \ --output_path samples/outputs_generate_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Experiments

| Dataset | numexamples| k | method | accuracy | inference time | |:---:|:---:|:---:|:---:|:---:|:---:| | agnews | 7600 | 1 | direct | 0.7682 | 00:38:40 | | agnews | 7600 | 1 | direct+calibrated | 0.8567 | 00:38:40 | | agnews | 7600 | 1 | channel | 0.7825 | 00:38:37 |

Todo list

[x] Implement channel method
[ ] Experimental report
- [x] Direct
- [x] Channel
- [ ] Generation
[ ] Implement other calibration method
[ ] Support other dataset inside the huggingface datasets
[ ] Implement LLM.int8
[ ] Other evaluation metric to measure the different characteristic of foundation model (LLaMA)

Final remark

I am really appreciate for the LLaMA project team to publish a checkpoint and their efficient inference code. Much of work in this repository is done based on the official repository.
For the reader, don't hesitate to open issue or pull requests. You can give me..
- Any issue about other feature requests
- Any issue about the detailed implementation
- Any discussion about the research direction

Citation

It would be welcome citing my work if you use my codebase for your research.

@software{Lee_Simple_Text_Classification_2023, author = {Lee, Seonghyeon}, month = {3}, title = {{Simple Text Classification Codebase using LLaMA}}, url = {https://github.com/github/sh0416/llama-classification}, version = {1.1.0}, year = {2023} }

Owner

Name: Seonghyeon
Login: sh0416
Kind: user
Location: Seoul
Company: POSTECH

Website: https://www.linkedin.com/in/seonghyeondrewlee/
Repositories: 3
Profile: https://github.com/sh0416

Ph. D. Candidate in POSTECH Researcher in ScatterLab Inc.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use my codebase or your research, please cite it as below. It would be welcome :)"
authors:
  - family-names: Lee
    given-names: Seonghyeon
title: "Simple Text Classification Codebase using LLaMA"
version: 1.1.0
date-released: 2023-03-19
url: "https://github.com/github/sh0416/llama-classification"

GitHub Events

Total

Watch event: 9

Last Year

Watch event: 9

Committers

Last synced: 7 months ago

All Time

Total Commits: 15
Total Committers: 1
Avg Commits per committer: 15.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Seonghyeon	s**w@g**m	15

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: 20 days
Average time to close pull requests: 1 minute
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 2.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

llama-classification

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Text classification using LLaMA

What system do I use for development?

How to use

Experimental setup

Method: Direct

Method: Channel

Method: Pure generation

Experiments

Todo list

Final remark

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels