From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
SubFit introduces non-contiguous submodule replacement in LLMs, achieving superior compression with 84.6% accuracy at 25% sparsity, using residual fitting without retraining.
Key Findings
Methodology
This paper proposes SubFit, a residual-based replacement framework that operates at the submodule level within Transformer models. Unlike traditional methods that prune entire layers or contiguous blocks, SubFit employs a non-contiguous selection strategy based on residual contribution scores for Attention and FeedForward modules. The process involves: • scoring submodules by their impact on residual streams using a residual rotation and magnitude metric; • selecting low-impact submodules independently across depths; • fitting a low-rank surrogate model for each removed submodule using calibration data, which captures the residual contribution via a shared basis for FeedForward modules and layer-specific parameters for Attention modules; • replacing the submodules with lightweight residual bypasses that approximate their residual outputs. This approach requires only calibration data, no retraining, and leverages low-rank structures to reduce parameters and computation, enabling efficient model compression while maintaining task performance.
Key Results
- Extensive experiments on ten LLMs, including five base and five instruction-tuned models, demonstrate that at 25% sparsity, SubFit retains 84.6% of downstream task accuracy with only 2.42× perplexity increase, outperforming four baseline methods which reach 81.6% accuracy with 4.34× perplexity. The method achieves better trade-offs across multiple sparsity levels, especially under aggressive compression, with perplexity gaps growing from 0.11× at 12.5% to 1.92× at 25%.
- Inference speedup is substantial; at 25% sparsity, the time-to-first-token (TTFT) accelerates by 1.18× to 1.40×, and KV-cache usage reduces proportionally, confirming practical deployment benefits. The approach also exhibits high stability, with low variance in perplexity and accuracy across models and sparsity levels.
- Ablation studies reveal that non-contiguous submodule selection significantly improves robustness over traditional contiguous block pruning, especially at higher compression ratios. Residual fitting of FeedForward modules plays a dominant role in maintaining model performance, with shared low-rank bases providing parameter efficiency without sacrificing accuracy.
Significance
This work marks a paradigm shift in LLM compression by moving from layer-wise contiguous pruning to flexible, non-contiguous submodule-level selection. It leverages the inherent redundancy in pretrained transformers, enabling high compression ratios without retraining. The method addresses key industry challenges—reducing deployment costs, improving inference speed, and lowering memory footprint—while preserving model accuracy. Its simplicity, requiring only calibration data, makes it highly applicable for real-world scenarios, including edge deployment and resource-constrained environments. The insights into residual contributions of Attention and FeedForward modules deepen our understanding of transformer internal dynamics, opening avenues for more nuanced model optimization strategies.
Technical Contribution
The core technical innovations include: • a residual contribution scoring mechanism that quantifies the impact of each submodule on the residual stream, guiding non-contiguous selection; • a low-rank surrogate model for residual approximation, with closed-form solutions for parameter fitting based on calibration statistics; • a shared low-rank basis for FeedForward modules across layers, reducing parameter count; • a lightweight residual bypass that replaces the original submodule, preserving the residual contribution without internal computation re-derivation; • an end-to-end post-training compression pipeline that requires no further training or fine-tuning, only calibration data, making it highly practical for deployment.
Novelty
This study is the first to systematically implement non-contiguous submodule-level replacement in transformer-based LLMs, contrasting with prior work that focused on contiguous layer or block pruning. The key novelty lies in the residual-based scoring and low-rank surrogate fitting, which enable precise approximation of residual contributions from individual submodules. This approach exploits the internal redundancy of transformers more granularly, leading to higher compression ratios with minimal performance loss. The shared basis for FeedForward modules across layers is also a novel engineering solution that balances parameter efficiency with accuracy, setting this work apart from existing pruning and replacement methods.
Limitations
- The method relies on calibration data, which may limit its effectiveness in scenarios with highly domain-specific tasks or unseen data distributions. Excessive compression beyond 37.5% may cause significant performance degradation due to the limitations of low-rank approximation.
- The assumption of low-rank residuals may not hold for all submodules, especially in highly non-linear or complex components, potentially increasing approximation errors.
- Scaling to extremely large models (e.g., hundreds of billions of parameters) poses computational challenges in covariance estimation and eigen-decomposition steps, requiring further optimization.
Future Work
Future directions include extending the residual fitting framework to incorporate non-linear surrogate models, enabling better approximation of complex submodules. Exploring adaptive, data-driven selection strategies that consider task-specific importance could further improve compression-performance trade-offs. Additionally, integrating hardware-aware optimization and quantization techniques may enhance deployment efficiency. Investigating the applicability of this approach to multimodal models and real-time adaptive compression remains an open avenue, aiming to make large models more accessible and efficient across diverse applications.
AI Executive Summary
Large Language Models (LLMs) have revolutionized natural language processing, demonstrating unprecedented capabilities across a wide range of tasks. However, their enormous size—often hundreds of billions of parameters—poses significant challenges for deployment, especially in resource-constrained environments. Traditional model compression techniques, such as pruning entire layers or contiguous blocks, have made strides in reducing model size and inference latency. Yet, these methods often suffer from performance drops at high compression ratios, due to the rigid structure of layer-wise pruning and the inability to fully exploit the internal redundancy of transformers.
This paper introduces a novel approach called SubFit, which fundamentally rethinks the granularity of model compression. Instead of removing entire layers or contiguous blocks, SubFit operates at the submodule level—specifically targeting the Attention and FeedForward components within each transformer layer. The key insight is that the redundancy in these submodules is not confined to contiguous regions, and different submodules exhibit varying degrees of importance. By scoring each submodule based on its residual contribution to the model's residual stream, SubFit identifies those with minimal impact and selectively replaces them with lightweight residual bypasses.
The core technical innovation lies in fitting a low-rank surrogate model for each removed submodule using only calibration data. This surrogate captures the residual contribution of the submodule via shared low-rank bases for FeedForward modules and layer-specific parameters for Attention modules. The process involves a closed-form solution for parameter estimation, avoiding the need for retraining or fine-tuning, which significantly simplifies deployment. The residual bypass approximates the submodule's effect, preserving the model's overall performance while reducing parameters and computational load.
Extensive experiments across ten diverse LLMs—including base and instruction-tuned models—demonstrate the effectiveness of SubFit. At 25% sparsity, the method retains 84.6% of downstream task accuracy, with only a 2.42× perplexity increase, outperforming baseline methods that reach 81.6% accuracy with 4.34× perplexity. Additionally, the compressed models show notable inference speedups, with a 1.18 to 1.40 times acceleration in time-to-first-token, and significant KV-cache savings. Ablation studies confirm that non-contiguous submodule selection enhances robustness, especially at higher compression ratios.
Overall, SubFit offers a practical, efficient, and highly effective solution for deploying large models in real-world scenarios. Its ability to operate without retraining, combined with superior performance-accuracy trade-offs, makes it a promising direction for future research. The approach also provides new insights into the internal redundancy of transformers, paving the way for more granular and intelligent model optimization strategies. While challenges remain—such as extending to larger models and more complex residual structures—the current results mark a significant step toward scalable, resource-efficient AI deployment, with broad implications for industry and academia alike.
Deep Dive
Abstract
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.
References (20)
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari Do Nascimento et al.
Training Verifiers to Solve Math Word Problems
K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng et al.
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart et al.
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni et al.
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Elias Frantar, Dan Alistarh
An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.
Crowdsourcing Multiple Choice Science Questions
Johannes Welbl, Nelson F. Liu, Matt Gardner
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang et al.
On the Limits of Layer Pruning for Generative Reasoning in LLMs
S. Shrestha, Anubhav Shrestha, Aadim Nepal et al.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, T. Hoefler et al.
BlockPruner: Fine-grained Pruning for Large Language Models
Longguang Zhong, Fanqi Wan, Ruijun Chen et al.
What Matters in Transformers? Not All Attention is Needed
Shwai He, Guoheng Sun, Zheyu Shen et al.
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin et al.
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury et al.
Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
Xinrui Chen, Hongxin Zhang, Fanyi Zeng et al.
2SSP: A Two-Stage Framework for Structured Pruning of LLMs
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca