llama-classification

Text classification with Foundation Language Model LLaMA

https://github.com/sh0416/llama-classification

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary

Keywords

classification foundation-model gpt language-model llama pytorch
Last synced: 4 months ago · JSON representation ·

Repository

Text classification with Foundation Language Model LLaMA

Basic Info
  • Host: GitHub
  • Owner: sh0416
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 2.46 MB
Statistics
  • Stars: 114
  • Watchers: 3
  • Forks: 9
  • Open Issues: 0
  • Releases: 0
Topics
classification foundation-model gpt language-model llama pytorch
Created almost 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

Text classification using LLaMA

This repository provides a basic codebase for text classification using LLaMA.

What system do I use for development?

  • Device: Nvidia 1xV100 GPU
  • Device Memory: 34G
  • Host Memory: 252G

If you need other information about hardware, please open an issue.

How to use

Experimental setup

  1. Get the checkpoint from official LLaMA repository from here.
    1-1. I assume that the checkpoint would be located in the project root direction and the contents would be arranged as follow. text checkpoints ├── llama │ ├── 7B │ │ ├── checklist.chk │ │ ├── consolidated.00.pth │ │ └── params.json │ └── tokenizer.model

  2. Prepare your python environment. I recommend using anaconda to segregate your local machine CUDA version. bash conda create -y -n llama-classification python=3.8 conda activate llama-classification conda install cudatoolkit=11.7 -y -c nvidia conda list cudatoolkit # to check what cuda version is installed (11.7) pip install -r requirements.txt

Method: Direct

Direct is to compare the conditional probability p(y|x).

  1. Preprocess the data from huggingface datasets using the following scripts. From now on, we use the agnews dataset. ```bash python runpreprocessdirectagnews.py python runpreprocessdirectagnews.py --sample=False --datapath=real/inputsdirectag_news.json # Use it for full evaluation ```

  2. Inference to compute the conditional probability using LLaMA and predict class. bash torchrun --nproc_per_node 1 run_evaluate_direct_llama.py \ --data_path samples/inputs_direct_ag_news.json \ --output_path samples/outputs_direct_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Calibration is to improve direct method with calibration method.

  1. Calibrate using the following command. bash torchrun --nproc_per_node 1 run_evaluate_direct_calibrate_llama.py \ --direct_input_path samples/inputs_direct_ag_news.json \ --direct_output_path samples/outputs_direct_ag_news.json \ --output_path samples/outputs_direct_calibrate_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Method: Channel

Channel is to compare the conditional probability p(x|y).

  1. Preprocess the data from huggingface datasets using the following scripts. From now on, we use the agnews dataset. ```bash python runpreprocesschannelagnews.py python runpreprocesschannelagnews.py --sample=False --datapath=real/inputschannelag_news.json # Use it for full evaluation ```

  2. Inference to compute the conditional probability using LLaMA and predict class. bash torchrun --nproc_per_node 1 run_evaluate_channel_llama.py \ --data_path samples/inputs_channel_ag_news.json \ --output_path samples/outputs_channel_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Method: Pure generation

  1. To evaluate using generate mode, you can use the preprocessed direct version. bash torchrun --nproc_per_node 1 run_evaluate_generate_llama.py \ --data_path samples/inputs_direct_ag_news.json \ --output_path samples/outputs_generate_ag_news.json \ --ckpt_dir checkpoints/llama/7B \ --tokenizer_path checkpoints/llama/tokenizer.model

Experiments

| Dataset | numexamples| k | method | accuracy | inference time | |:---:|:---:|:---:|:---:|:---:|:---:| | agnews | 7600 | 1 | direct | 0.7682 | 00:38:40 | | agnews | 7600 | 1 | direct+calibrated | 0.8567 | 00:38:40 | | agnews | 7600 | 1 | channel | 0.7825 | 00:38:37 |

Todo list

  • [x] Implement channel method
  • [ ] Experimental report
    • [x] Direct
    • [x] Channel
    • [ ] Generation
  • [ ] Implement other calibration method
  • [ ] Support other dataset inside the huggingface datasets
  • [ ] Implement LLM.int8
  • [ ] Other evaluation metric to measure the different characteristic of foundation model (LLaMA)

Final remark

  • I am really appreciate for the LLaMA project team to publish a checkpoint and their efficient inference code. Much of work in this repository is done based on the official repository.
  • For the reader, don't hesitate to open issue or pull requests. You can give me..
    • Any issue about other feature requests
    • Any issue about the detailed implementation
    • Any discussion about the research direction

Citation

It would be welcome citing my work if you use my codebase for your research.

@software{Lee_Simple_Text_Classification_2023, author = {Lee, Seonghyeon}, month = {3}, title = {{Simple Text Classification Codebase using LLaMA}}, url = {https://github.com/github/sh0416/llama-classification}, version = {1.1.0}, year = {2023} }

Owner

  • Name: Seonghyeon
  • Login: sh0416
  • Kind: user
  • Location: Seoul
  • Company: POSTECH

Ph. D. Candidate in POSTECH Researcher in ScatterLab Inc.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use my codebase or your research, please cite it as below. It would be welcome :)"
authors:
  - family-names: Lee
    given-names: Seonghyeon
title: "Simple Text Classification Codebase using LLaMA"
version: 1.1.0
date-released: 2023-03-19
url: "https://github.com/github/sh0416/llama-classification"

GitHub Events

Total
  • Watch event: 9
Last Year
  • Watch event: 9

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 15
  • Total Committers: 1
  • Avg Commits per committer: 15.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Seonghyeon s****w@g****m 15

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 2
  • Total pull requests: 1
  • Average time to close issues: 20 days
  • Average time to close pull requests: 1 minute
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 2.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • WuNein (1)
  • sunyuhan19981208 (1)
Pull Request Authors
  • sh0416 (1)
Top Labels
Issue Labels
Pull Request Labels