https://github.com/chijames/pruneme
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Last synced: 7 months ago
·
JSON representation
Repository
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of arcee-ai/PruneMe
Created over 1 year ago
· Last pushed over 1 year ago
https://github.com/chijames/PruneMe/blob/main/
# Layer Pruning of Large Language Models
This repository hosts the [unofficial] implementation of a layer pruning strategy for Large Language Models (LLMs) based on the insights from the paper "[The Unreasonable Ineffectiveness of the Deeper Layers](https://arxiv.org/abs/2403.17887)" by Andrey Gromov et al. The paper empirically demonstrates that LLMs can have a substantial number of their layers removed with minimal performance degradation, especially on question-answering benchmarks.
## Summary of the Paper
The authors present a straightforward method for pruning layers from popular pretrained LLMs. They identify optimal blocks of layers for pruning by analyzing similarity across layers. The approach suggests that deeper layers in these LLMs are often more redundant than previously thought. After pruning, the model can be "healed" using Parameter-Efficient Fine-tuning (PEFT) methods like QLoRA to recover from the pruning-induced performance loss. These findings open avenues for significant computational resource reduction during both fine-tuning and inference stages, potentially improving memory and latency without severe impacts on performance.
## Repository Structure
The codebase is divided into two main parts:
### First compute the block/layer similarity given a dataset
Navigate to the `compute_block_similarity` directory to run the layer similarity computation script.
1. Open `run_layer_similarity.sh`.
2. Replace the placeholder arguments with your model and dataset information:
```bash
python layer_similarity.py --model_path "mistralai/Mistral-7B-Instruct-v0.2" \
--dataset "arcee-ai/sec-data-mini" \
--dataset_column "text" \
--batch_size 8 \
--max_length 1024 \
--layers_to_skip 8 \
--dataset_size 4000 \
--dataset_subset "train"
```
3. Execute the script to get the most optimum layer/block range to prune.
### Create the new model with the [Mergekit](https://github.com/arcee-ai/mergekit)
After identifying the layers to prune, you can proceed to the `slice_with_mergekit` directory. Here, you can leverage the MergeKit library to prune the identified layers from the model effectively.
## Observations and Insights
Our findings corroborate the theory presented in the paper `The Unreasonable Ineffectiveness of the Deeper Layers`that some portions of deeper layers in LLMs like `mistralai/Mistral-7B-Instruct-v0.2` can be pruned with minimal performance degradation. Specifically, we observed a pattern of high redundancy in deeper layers, aligning with the hypothesis that not all layers contribute equally to the model's performance. This redundancy indicates potential for computational efficiency improvements without sacrificing the quality of outcomes. Following this strategy, the pruned model can be found in the [arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer](https://huggingface.co/arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer), and it can generate coherent text, demonstrating the practical applicability of our pruning approach.

## Use Cases
The layer pruning strategy detailed in the paper and implemented in this codebase is beneficial for:
- Reducing the computational requirements of LLMs without significant performance loss.
- Facilitating fine-tuning and potential continuous pretraining of pruned models.
- Initial experiments suggest the pruned models can still generate coherent text.
- When combined with techniques like QLoRA, pruned models can heal from performance drops efficiently.
## Advanced Use Cases
By pruning larger models and continuing pre-training with domain-specific datasets, this strategy could revolutionize the way we merge smaller domain-adapted LLMs into larger ones through Data Flow Space (DFS) Merging, facilitating efficient domain adaptation.
Owner
- Login: chijames
- Kind: user
- Repositories: 3
- Profile: https://github.com/chijames
GitHub Events
Total
- Push event: 6
Last Year
- Push event: 6