From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

TL;DR

SubFit introduces non-contiguous submodule replacement in LLMs, achieving superior compression with 84.6% accuracy at 25% sparsity, using residual fitting without retraining.

cs.CL 🔴 Advanced 2026-06-02 77 views
Elia Cunegatti Marcus Vukojevic Erik Nielsen Giovanni Iacca
LLM compression submodule replacement sparsity post-training tuning Transformer pruning

Key Findings

Methodology

This paper proposes SubFit, a residual-based replacement framework that operates at the submodule level within Transformer models. Unlike traditional methods that prune entire layers or contiguous blocks, SubFit employs a non-contiguous selection strategy based on residual contribution scores for Attention and FeedForward modules. The process involves: • scoring submodules by their impact on residual streams using a residual rotation and magnitude metric; • selecting low-impact submodules independently across depths; • fitting a low-rank surrogate model for each removed submodule using calibration data, which captures the residual contribution via a shared basis for FeedForward modules and layer-specific parameters for Attention modules; • replacing the submodules with lightweight residual bypasses that approximate their residual outputs. This approach requires only calibration data, no retraining, and leverages low-rank structures to reduce parameters and computation, enabling efficient model compression while maintaining task performance.

Key Results

  • Extensive experiments on ten LLMs, including five base and five instruction-tuned models, demonstrate that at 25% sparsity, SubFit retains 84.6% of downstream task accuracy with only 2.42× perplexity increase, outperforming four baseline methods which reach 81.6% accuracy with 4.34× perplexity. The method achieves better trade-offs across multiple sparsity levels, especially under aggressive compression, with perplexity gaps growing from 0.11× at 12.5% to 1.92× at 25%.
  • Inference speedup is substantial; at 25% sparsity, the time-to-first-token (TTFT) accelerates by 1.18× to 1.40×, and KV-cache usage reduces proportionally, confirming practical deployment benefits. The approach also exhibits high stability, with low variance in perplexity and accuracy across models and sparsity levels.
  • Ablation studies reveal that non-contiguous submodule selection significantly improves robustness over traditional contiguous block pruning, especially at higher compression ratios. Residual fitting of FeedForward modules plays a dominant role in maintaining model performance, with shared low-rank bases providing parameter efficiency without sacrificing accuracy.

Significance

This work marks a paradigm shift in LLM compression by moving from layer-wise contiguous pruning to flexible, non-contiguous submodule-level selection. It leverages the inherent redundancy in pretrained transformers, enabling high compression ratios without retraining. The method addresses key industry challenges—reducing deployment costs, improving inference speed, and lowering memory footprint—while preserving model accuracy. Its simplicity, requiring only calibration data, makes it highly applicable for real-world scenarios, including edge deployment and resource-constrained environments. The insights into residual contributions of Attention and FeedForward modules deepen our understanding of transformer internal dynamics, opening avenues for more nuanced model optimization strategies.

Technical Contribution

The core technical innovations include: • a residual contribution scoring mechanism that quantifies the impact of each submodule on the residual stream, guiding non-contiguous selection; • a low-rank surrogate model for residual approximation, with closed-form solutions for parameter fitting based on calibration statistics; • a shared low-rank basis for FeedForward modules across layers, reducing parameter count; • a lightweight residual bypass that replaces the original submodule, preserving the residual contribution without internal computation re-derivation; • an end-to-end post-training compression pipeline that requires no further training or fine-tuning, only calibration data, making it highly practical for deployment.

Novelty

This study is the first to systematically implement non-contiguous submodule-level replacement in transformer-based LLMs, contrasting with prior work that focused on contiguous layer or block pruning. The key novelty lies in the residual-based scoring and low-rank surrogate fitting, which enable precise approximation of residual contributions from individual submodules. This approach exploits the internal redundancy of transformers more granularly, leading to higher compression ratios with minimal performance loss. The shared basis for FeedForward modules across layers is also a novel engineering solution that balances parameter efficiency with accuracy, setting this work apart from existing pruning and replacement methods.

Limitations

  • The method relies on calibration data, which may limit its effectiveness in scenarios with highly domain-specific tasks or unseen data distributions. Excessive compression beyond 37.5% may cause significant performance degradation due to the limitations of low-rank approximation.
  • The assumption of low-rank residuals may not hold for all submodules, especially in highly non-linear or complex components, potentially increasing approximation errors.
  • Scaling to extremely large models (e.g., hundreds of billions of parameters) poses computational challenges in covariance estimation and eigen-decomposition steps, requiring further optimization.

Future Work

Future directions include extending the residual fitting framework to incorporate non-linear surrogate models, enabling better approximation of complex submodules. Exploring adaptive, data-driven selection strategies that consider task-specific importance could further improve compression-performance trade-offs. Additionally, integrating hardware-aware optimization and quantization techniques may enhance deployment efficiency. Investigating the applicability of this approach to multimodal models and real-time adaptive compression remains an open avenue, aiming to make large models more accessible and efficient across diverse applications.

AI Executive Summary

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating unprecedented capabilities across a wide range of tasks. However, their enormous size—often hundreds of billions of parameters—poses significant challenges for deployment, especially in resource-constrained environments. Traditional model compression techniques, such as pruning entire layers or contiguous blocks, have made strides in reducing model size and inference latency. Yet, these methods often suffer from performance drops at high compression ratios, due to the rigid structure of layer-wise pruning and the inability to fully exploit the internal redundancy of transformers.

This paper introduces a novel approach called SubFit, which fundamentally rethinks the granularity of model compression. Instead of removing entire layers or contiguous blocks, SubFit operates at the submodule level—specifically targeting the Attention and FeedForward components within each transformer layer. The key insight is that the redundancy in these submodules is not confined to contiguous regions, and different submodules exhibit varying degrees of importance. By scoring each submodule based on its residual contribution to the model's residual stream, SubFit identifies those with minimal impact and selectively replaces them with lightweight residual bypasses.

The core technical innovation lies in fitting a low-rank surrogate model for each removed submodule using only calibration data. This surrogate captures the residual contribution of the submodule via shared low-rank bases for FeedForward modules and layer-specific parameters for Attention modules. The process involves a closed-form solution for parameter estimation, avoiding the need for retraining or fine-tuning, which significantly simplifies deployment. The residual bypass approximates the submodule's effect, preserving the model's overall performance while reducing parameters and computational load.

Extensive experiments across ten diverse LLMs—including base and instruction-tuned models—demonstrate the effectiveness of SubFit. At 25% sparsity, the method retains 84.6% of downstream task accuracy, with only a 2.42× perplexity increase, outperforming baseline methods that reach 81.6% accuracy with 4.34× perplexity. Additionally, the compressed models show notable inference speedups, with a 1.18 to 1.40 times acceleration in time-to-first-token, and significant KV-cache savings. Ablation studies confirm that non-contiguous submodule selection enhances robustness, especially at higher compression ratios.

Overall, SubFit offers a practical, efficient, and highly effective solution for deploying large models in real-world scenarios. Its ability to operate without retraining, combined with superior performance-accuracy trade-offs, makes it a promising direction for future research. The approach also provides new insights into the internal redundancy of transformers, paving the way for more granular and intelligent model optimization strategies. While challenges remain—such as extending to larger models and more complex residual structures—the current results mark a significant step toward scalable, resource-efficient AI deployment, with broad implications for industry and academia alike.

Deep Dive

Abstract

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.

cs.CL cs.AI

References (20)

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari Do Nascimento et al.

2024 372 citations ⭐ Influential View Analysis →

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 9032 citations ⭐ Influential View Analysis →

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng et al.

2023 474 citations View Analysis →

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart et al.

2020 8351 citations View Analysis →

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni et al.

2018 4769 citations View Analysis →

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk et al.

2019 4259 citations View Analysis →

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Elias Frantar, Dan Alistarh

2023 1299 citations View Analysis →

An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula et al.

2019 3199 citations

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F. Liu, Matt Gardner

2017 841 citations View Analysis →

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang et al.

2019 2423 citations View Analysis →

On the Limits of Layer Pruning for Generative Reasoning in LLMs

S. Shrestha, Anubhav Shrestha, Aadim Nepal et al.

2026 1 citations

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, T. Hoefler et al.

2022 2089 citations View Analysis →

BlockPruner: Fine-grained Pruning for Large Language Models

Longguang Zhong, Fanqi Wan, Ruijun Chen et al.

2024 28 citations View Analysis →

GLU Variants Improve Transformer

Noam Shazeer

2020 1893 citations View Analysis →

What Matters in Transformers? Not All Attention is Needed

Shwai He, Guoheng Sun, Zheyu Shen et al.

2024 80 citations View Analysis →

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta

2020 881 citations View Analysis →

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin et al.

2019 859 citations View Analysis →

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury et al.

2016 4094 citations View Analysis →

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Xinrui Chen, Hongxin Zhang, Fanyi Zeng et al.

2025 6 citations View Analysis →

2SSP: A Two-Stage Framework for Structured Pruning of LLMs

Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca

2025 7 citations View Analysis →