llama-classification
Text classification with Foundation Language Model LLaMA
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Keywords
Repository
Text classification with Foundation Language Model LLaMA
Basic Info
Statistics
- Stars: 114
- Watchers: 3
- Forks: 9
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Text classification using LLaMA
This repository provides a basic codebase for text classification using LLaMA.
What system do I use for development?
- Device: Nvidia 1xV100 GPU
- Device Memory: 34G
- Host Memory: 252G
If you need other information about hardware, please open an issue.
How to use
Experimental setup
Get the checkpoint from official LLaMA repository from here.
1-1. I assume that the checkpoint would be located in the project root direction and the contents would be arranged as follow.text checkpoints ├── llama │ ├── 7B │ │ ├── checklist.chk │ │ ├── consolidated.00.pth │ │ └── params.json │ └── tokenizer.modelPrepare your python environment. I recommend using anaconda to segregate your local machine CUDA version.
bash conda create -y -n llama-classification python=3.8 conda activate llama-classification conda install cudatoolkit=11.7 -y -c nvidia conda list cudatoolkit # to check what cuda version is installed (11.7) pip install -r requirements.txt
Method: Direct
Direct is to compare the conditional probability p(y|x).
Preprocess the data from huggingface datasets using the following scripts. From now on, we use the agnews dataset. ```bash python runpreprocessdirectagnews.py python runpreprocessdirectagnews.py --sample=False --datapath=real/inputsdirectag_news.json # Use it for full evaluation ```
Inference to compute the conditional probability using LLaMA and predict class.
bash torchrun --nproc_per_node 1 run_evaluate_direct_llama.py \ --data_path samples/inputs_direct_ag_news.json \ --output_path samples/outputs_direct_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model
Calibration is to improve direct method with calibration method.
- Calibrate using the following command.
bash torchrun --nproc_per_node 1 run_evaluate_direct_calibrate_llama.py \ --direct_input_path samples/inputs_direct_ag_news.json \ --direct_output_path samples/outputs_direct_ag_news.json \ --output_path samples/outputs_direct_calibrate_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model
Method: Channel
Channel is to compare the conditional probability p(x|y).
Preprocess the data from huggingface datasets using the following scripts. From now on, we use the agnews dataset. ```bash python runpreprocesschannelagnews.py python runpreprocesschannelagnews.py --sample=False --datapath=real/inputschannelag_news.json # Use it for full evaluation ```
Inference to compute the conditional probability using LLaMA and predict class.
bash torchrun --nproc_per_node 1 run_evaluate_channel_llama.py \ --data_path samples/inputs_channel_ag_news.json \ --output_path samples/outputs_channel_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model
Method: Pure generation
- To evaluate using
generatemode, you can use the preprocessed direct version.bash torchrun --nproc_per_node 1 run_evaluate_generate_llama.py \ --data_path samples/inputs_direct_ag_news.json \ --output_path samples/outputs_generate_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model
Experiments
| Dataset | numexamples| k | method | accuracy | inference time | |:---:|:---:|:---:|:---:|:---:|:---:| | agnews | 7600 | 1 | direct | 0.7682 | 00:38:40 | | agnews | 7600 | 1 | direct+calibrated | 0.8567 | 00:38:40 | | agnews | 7600 | 1 | channel | 0.7825 | 00:38:37 |
Todo list
- [x] Implement channel method
- [ ] Experimental report
- [x] Direct
- [x] Channel
- [ ] Generation
- [ ] Implement other calibration method
- [ ] Support other dataset inside the huggingface datasets
- [ ] Implement LLM.int8
- [ ] Other evaluation metric to measure the different characteristic of foundation model (LLaMA)
Final remark
- I am really appreciate for the LLaMA project team to publish a checkpoint and their efficient inference code. Much of work in this repository is done based on the official repository.
- For the reader, don't hesitate to open issue or pull requests. You can give me..
- Any issue about other feature requests
- Any issue about the detailed implementation
- Any discussion about the research direction
Citation
It would be welcome citing my work if you use my codebase for your research.
@software{Lee_Simple_Text_Classification_2023,
author = {Lee, Seonghyeon},
month = {3},
title = {{Simple Text Classification Codebase using LLaMA}},
url = {https://github.com/github/sh0416/llama-classification},
version = {1.1.0},
year = {2023}
}
Owner
- Name: Seonghyeon
- Login: sh0416
- Kind: user
- Location: Seoul
- Company: POSTECH
- Website: https://www.linkedin.com/in/seonghyeondrewlee/
- Repositories: 3
- Profile: https://github.com/sh0416
Ph. D. Candidate in POSTECH Researcher in ScatterLab Inc.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use my codebase or your research, please cite it as below. It would be welcome :)"
authors:
- family-names: Lee
given-names: Seonghyeon
title: "Simple Text Classification Codebase using LLaMA"
version: 1.1.0
date-released: 2023-03-19
url: "https://github.com/github/sh0416/llama-classification"
GitHub Events
Total
- Watch event: 9
Last Year
- Watch event: 9
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 2
- Total pull requests: 1
- Average time to close issues: 20 days
- Average time to close pull requests: 1 minute
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 2.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- WuNein (1)
- sunyuhan19981208 (1)
Pull Request Authors
- sh0416 (1)