llm-dp-finetune

End-to-end codebase for finetuning LLMs (LLaMA 2, 3, etc.) with or without DP

https://github.com/jyhong836/llm-dp-finetune

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary

Keywords

dp finetuning llm
Last synced: 6 months ago · JSON representation ·

Repository

End-to-end codebase for finetuning LLMs (LLaMA 2, 3, etc.) with or without DP

Basic Info
  • Host: GitHub
  • Owner: jyhong836
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 51.8 KB
Statistics
  • Stars: 12
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Topics
dp finetuning llm
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License Code of conduct Citation Security Support

README.md

Private Finetuning for LLMs (LLM-PFT)

Build Build Build

The codebase for LLM DP/scrubbing/undefended finetuning in LLM-PBE arXiv.

This code is modified from the code of pii-leakage. This repository supports fine-tuning latest LLMs, Flair Named Entity Recognition (NER) models, and Private AI API (for scrubbing). It allows fine-tuning (i) undefended, (ii) differentially-private and (iii) scrubbed language models on ECHR and Enron.

The repository differs pii-leakage in these ways: 1. We replace opacus with a customized version of fast-dp which is more memory efficient and is compatible with latest pytorch 2.0, cuda and distributed training (e.g., deepspeed). 2. We can support latest LLMs, e.g., LLAMA. 3. (WIP) We extend the scrubbing tool from Flair to Private AI. 4. We exclude PII analysis tools but focus on fine-tuning.

Build & Run

We recommend setting up a conda environment for this project. ```shell conda create -n llm-pft python=3.10 -y conda activate llm-pft

pip install torch

if running nvcc -V yields empty, do this:

conda install cuda -c nvidia -y

pip install torch==2.2.0 --index-url https://download.pytorch.org/whl/cu121

pip install -e . ```

Troubleshooting: * FlashAttention only support Ampere and newer. - update transformer to the latest - Add below if still not working. ``` import torch

to disable flash attn?

torch.backends.cuda.enablememefficientsdp(False) - Remove fp16 and bf16 from both deepspeed and config if still see this. * cannot find symbollink for model (typically due to flash attn) - install nvcc, e.g. by `conda install cuda -c nvidia -y` * `RuntimeError: 'weight' must be 2-D` with llama2 and zero3. Try to update below packages to the machted versions: shell pip install transformers==4.29.0 pip install pydantic==1.10 pip install deepspeed~=0.8.3 # if fast tokenization is used pip install tokenizers==0.13.3 pip install fastDP@git+https://github.com/jyhong836/fast-differential-privacy.git # for zero grad DP stage3() error. `` *PrivacyEngineDistributedstage2and3.init..zerogradDPstage3() got an unexpected keyword argument 'settonone'. Install a fixed version of fast-dp bypip install fastDP@git+https://github.com/jyhong836/fast-differential-privacy.git. * If you encounter the following error message when running the attack: `` if self.padtokenid is not None and self.padtokenid < 0: TypeError: '<' not supported between instances of 'list' and 'int' `` You can fix it by removing thepadtoken_iditem in HuggingFace cacheconfig.json(e.g., the path may be like~/.cache/huggingface/hub/models--LLM-PBE--Llama3.1-8b-instruct-LLMPC-Red-Team/snapshots/xxx/config.json`) and run again.

Usage

We explain the following functions. The scripts are in the ./examples folder and run configurations are in the ./configs folder. * Fine-Tune: Fine-tune a pre-trained LM on a dataset (optionally with DP or scrubbing).

Fine-Tuning

We demonstrate how to fine-tune a LLaMA2 model on the ECHR dataset (i) without defenses, (ii) with scrubbing and (iii) with differentially private training (ε=8).

No Defense shell export CUDA_VISIBLE_DEVICES=2,3,4,5 deepspeed fine_tune.py --config_path=configs/fine-tune/echr-llama2-7b-undefended.yml

With Scrubbing

Note: All PII will be scrubbed from the dataset. Scrubbing is a one-time operation that requires tagging all PII in the dataset first which can take many hours depending on your setup. We do not provide tagged datasets. shell export CUDA_LAUNCH_BLOCKING=1 deepspeed fine_tune.py --config_path=configs/fine-tune/echr-llama2-7b-scrubbed.yml

With DP (ε=8.0)

Note: We use the dp-transformers wrapper around PyTorch's opacus library. ```shell

if device ID's are not 0,1,2,3, then do below

export CUDAVISIBLEDEVICES=2,3,4,5 deepspeed --numgpus=4 finetune.py --configpath=configs/fine-tune/echr-llama2-7b-dp8.yml `` NOTE: Don't directly run the script withpython finetune.py ...` which does not apply DP with ZERO.

Datasets

The provided ECHR dataset wrapper already tags all PII in the dataset. The PII tagging is done using the Flair NER modules and can take several hours depending on your setup, but is a one-time operation that will be cached in subsequent runs.

Citation

Please consider citing the following paper if you found our work useful.

@article{li2024llm, title={LLM-PBE: Assessing Data Privacy in Large Language Models}, author={Li, Qinbin and Hong, Junyuan and Xie, Chulin and Tan, Jeffrey and Xin, Rachel and Hou, Junyi and Yin, Xavier and Wang, Zhun and Hendrycks, Dan and Wang, Zhangyang and others}, journal={Proceedings of the VLDB Endowment}, volume={17}, number={11}, pages={3201--3214}, year={2024}, publisher={VLDB Endowment} } @InProceedings{lukas2023analyzing, title = {Analyzing Leakage of Personally Identifiable Information in Language Models}, author = {Lukas, Nils and Salem, Ahmed and Sim, Robert and Tople, Shruti and Wutschitz, Lukas and Zanella-B{\'e}guelin, Santiago}, booktitle = {2023 IEEE Symposium on Security and Privacy (SP)}, year = {2023}, publisher = {IEEE Computer Society}, pages = {346-363}, doi = {10.1109/SP46215.2023.00154} }

Owner

  • Name: Junyuan Hong
  • Login: jyhong836
  • Kind: user
  • Company: Michigan State University

Researcher on Federated Learning, Privacy, Trustworthy ML.

Citation (CITATION.bib)

@article{li2024llm,
  title={LLM-PBE: Assessing Data Privacy in Large Language Models},
  author={Li, Qinbin and Hong, Junyuan and Xie, Chulin and Tan, Jeffrey and Xin, Rachel and Hou, Junyi and Yin, Xavier and Wang, Zhun and Hendrycks, Dan and Wang, Zhangyang and others},
  journal={Proceedings of the VLDB Endowment},
  volume={17},
  number={11},
  pages={3201--3214},
  year={2024},
  publisher={VLDB Endowment}
}
@InProceedings{lukas2023analyzing,
  title      = {Analyzing Leakage of Personally Identifiable Information in Language Models},
  author     = {Lukas, Nils and Salem, Ahmed and Sim, Robert and Tople, Shruti and Wutschitz, Lukas and Zanella-B{\'e}guelin, Santiago},
  booktitle  = {2023 IEEE Symposium on Security and Privacy (SP)},
  year       = {2023},
  publisher  = {IEEE Computer Society},
  pages      = {346-363},
  doi        = {10.1109/SP46215.2023.00154}
}

GitHub Events

Total
  • Issues event: 1
  • Watch event: 7
  • Issue comment event: 1
  • Fork event: 3
Last Year
  • Issues event: 1
  • Watch event: 7
  • Issue comment event: 1
  • Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: 5 months
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 5 months
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Zuo-Lihan (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • torch *
  • tqdm *
  • transformers *
setup.py pypi