llm-dp-finetune
End-to-end codebase for finetuning LLMs (LLaMA 2, 3, etc.) with or without DP
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Keywords
Repository
End-to-end codebase for finetuning LLMs (LLaMA 2, 3, etc.) with or without DP
Basic Info
Statistics
- Stars: 12
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Private Finetuning for LLMs (LLM-PFT)
The codebase for LLM DP/scrubbing/undefended finetuning in LLM-PBE .
This code is modified from the code of pii-leakage. This repository supports fine-tuning latest LLMs, Flair Named Entity Recognition (NER) models, and Private AI API (for scrubbing). It allows fine-tuning (i) undefended, (ii) differentially-private and (iii) scrubbed language models on ECHR and Enron.
The repository differs pii-leakage in these ways: 1. We replace opacus with a customized version of fast-dp which is more memory efficient and is compatible with latest pytorch 2.0, cuda and distributed training (e.g., deepspeed). 2. We can support latest LLMs, e.g., LLAMA. 3. (WIP) We extend the scrubbing tool from Flair to Private AI. 4. We exclude PII analysis tools but focus on fine-tuning.
Build & Run
We recommend setting up a conda environment for this project. ```shell conda create -n llm-pft python=3.10 -y conda activate llm-pft
pip install torch
if running nvcc -V yields empty, do this:
conda install cuda -c nvidia -y
pip install torch==2.2.0 --index-url https://download.pytorch.org/whl/cu121
pip install -e . ```
Troubleshooting:
* FlashAttention only support Ampere and newer.
- update transformer to the latest
- Add below if still not working.
```
import torch
to disable flash attn?
torch.backends.cuda.enablememefficientsdp(False)
- Remove fp16 and bf16 from both deepspeed and config if still see this.
* cannot find symbollink for model (typically due to flash attn)
- install nvcc, e.g. by `conda install cuda -c nvidia -y`
* `RuntimeError: 'weight' must be 2-D` with llama2 and zero3. Try to update below packages to the machted versions:
shell
pip install transformers==4.29.0
pip install pydantic==1.10
pip install deepspeed~=0.8.3
# if fast tokenization is used
pip install tokenizers==0.13.3
pip install fastDP@git+https://github.com/jyhong836/fast-differential-privacy.git # for zero grad DP stage3() error.
``
*PrivacyEngineDistributedstage2and3.init.. Install a fixed version of fast-dp bypip install fastDP@git+https://github.com/jyhong836/fast-differential-privacy.git.
* If you encounter the following error message when running the attack:
``
if self.padtokenid is not None and self.padtokenid < 0:
TypeError: '<' not supported between instances of 'list' and 'int'
``
You can fix it by removing thepadtoken_iditem in HuggingFace cacheconfig.json(e.g., the path may be like~/.cache/huggingface/hub/models--LLM-PBE--Llama3.1-8b-instruct-LLMPC-Red-Team/snapshots/xxx/config.json`) and run again.
Usage
We explain the following functions. The scripts are in the ./examples folder and
run configurations are in the ./configs folder.
* Fine-Tune: Fine-tune a pre-trained LM on a dataset (optionally with DP or scrubbing).
Fine-Tuning
We demonstrate how to fine-tune a LLaMA2 model on the ECHR dataset
(i) without defenses, (ii) with scrubbing and (iii) with differentially private training (ε=8).
No Defense
shell
export CUDA_VISIBLE_DEVICES=2,3,4,5
deepspeed fine_tune.py --config_path=configs/fine-tune/echr-llama2-7b-undefended.yml
With Scrubbing
Note: All PII will be scrubbed from the dataset. Scrubbing is a one-time operation that requires tagging all PII in the dataset first
which can take many hours depending on your setup. We do not provide tagged datasets.
shell
export CUDA_LAUNCH_BLOCKING=1
deepspeed fine_tune.py --config_path=configs/fine-tune/echr-llama2-7b-scrubbed.yml
With DP (ε=8.0)
Note: We use the dp-transformers wrapper around PyTorch's opacus library. ```shell
if device ID's are not 0,1,2,3, then do below
export CUDAVISIBLEDEVICES=2,3,4,5
deepspeed --numgpus=4 finetune.py --configpath=configs/fine-tune/echr-llama2-7b-dp8.yml
``
NOTE: Don't directly run the script withpython finetune.py ...` which does not apply DP with ZERO.
Datasets
The provided ECHR dataset wrapper already tags all PII in the dataset. The PII tagging is done using the Flair NER modules and can take several hours depending on your setup, but is a one-time operation that will be cached in subsequent runs.
Citation
Please consider citing the following paper if you found our work useful.
@article{li2024llm,
title={LLM-PBE: Assessing Data Privacy in Large Language Models},
author={Li, Qinbin and Hong, Junyuan and Xie, Chulin and Tan, Jeffrey and Xin, Rachel and Hou, Junyi and Yin, Xavier and Wang, Zhun and Hendrycks, Dan and Wang, Zhangyang and others},
journal={Proceedings of the VLDB Endowment},
volume={17},
number={11},
pages={3201--3214},
year={2024},
publisher={VLDB Endowment}
}
@InProceedings{lukas2023analyzing,
title = {Analyzing Leakage of Personally Identifiable Information in Language Models},
author = {Lukas, Nils and Salem, Ahmed and Sim, Robert and Tople, Shruti and Wutschitz, Lukas and Zanella-B{\'e}guelin, Santiago},
booktitle = {2023 IEEE Symposium on Security and Privacy (SP)},
year = {2023},
publisher = {IEEE Computer Society},
pages = {346-363},
doi = {10.1109/SP46215.2023.00154}
}
Owner
- Name: Junyuan Hong
- Login: jyhong836
- Kind: user
- Company: Michigan State University
- Website: https://jyhong.gitlab.io
- Twitter: hjy836
- Repositories: 47
- Profile: https://github.com/jyhong836
Researcher on Federated Learning, Privacy, Trustworthy ML.
Citation (CITATION.bib)
@article{li2024llm,
title={LLM-PBE: Assessing Data Privacy in Large Language Models},
author={Li, Qinbin and Hong, Junyuan and Xie, Chulin and Tan, Jeffrey and Xin, Rachel and Hou, Junyi and Yin, Xavier and Wang, Zhun and Hendrycks, Dan and Wang, Zhangyang and others},
journal={Proceedings of the VLDB Endowment},
volume={17},
number={11},
pages={3201--3214},
year={2024},
publisher={VLDB Endowment}
}
@InProceedings{lukas2023analyzing,
title = {Analyzing Leakage of Personally Identifiable Information in Language Models},
author = {Lukas, Nils and Salem, Ahmed and Sim, Robert and Tople, Shruti and Wutschitz, Lukas and Zanella-B{\'e}guelin, Santiago},
booktitle = {2023 IEEE Symposium on Security and Privacy (SP)},
year = {2023},
publisher = {IEEE Computer Society},
pages = {346-363},
doi = {10.1109/SP46215.2023.00154}
}
GitHub Events
Total
- Issues event: 1
- Watch event: 7
- Issue comment event: 1
- Fork event: 3
Last Year
- Issues event: 1
- Watch event: 7
- Issue comment event: 1
- Fork event: 3
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: 5 months
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 2.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: 5 months
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 2.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Zuo-Lihan (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- torch *
- tqdm *
- transformers *