Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

TL;DR

Introduces Alternating Gradient Flow (AGF) to prevent structural collapse under 75% compression on ImageNet-1K.

cs.CV πŸ”΄ Advanced 2026-03-13 3 views
Tianhao Qian Zhuoxuan Li Jinde Cao Xinli Shi Hanjie Liu Leszek Rutkowski
deep learning structural pruning dynamic routing vision networks gradient flow

Key Findings

Methodology

This paper proposes a decoupled kinetic paradigm based on Alternating Gradient Flow (AGF) for structural pruning and dynamic routing in deep networks. By utilizing an absolute feature-space Taylor expansion, AGF accurately captures the network's structural 'kinetic utility', preserving baseline functionality and exhibiting topological implicit regularization under extreme sparsity.

Key Results

  • Under a 75% compression stress test on ImageNet-1K, AGF effectively avoids structural collapse where traditional metrics fall below random sampling.
  • For dynamic inference on ImageNet-100, the AGF-guided hybrid routing framework achieves Pareto-optimal efficiency, reducing heavy expert usage by approximately 50% without sacrificing full-model accuracy.
  • AGF successfully avoids collapse seen in models trained from scratch under extreme sparsity, demonstrating topological implicit regularization.

Significance

This study holds significant implications for both academia and industry. It addresses the magnitude bias issue in structural pruning of deep vision networks and provides a novel perspective on understanding and optimizing the network's structural kinetic utility through AGF. This approach not only enhances model compression efficiency but also offers more precise signal guidance for dynamic routing.

Technical Contribution

Technical contributions include introducing a novel decoupled kinetic paradigm that uses Alternating Gradient Flow (AGF) to capture the network's structural kinetic utility. Compared to existing SOTA methods, AGF demonstrates superior topological implicit regularization under extreme sparsity and achieves Pareto-optimal efficiency in dynamic inference through a hybrid routing framework.

Novelty

This study is the first to apply Alternating Gradient Flow (AGF) to structural pruning and dynamic routing in deep networks. Compared to existing magnitude and gradient-based methods, AGF better captures the network's structural kinetic utility, avoiding the magnitude bias issue.

Limitations

  • AGF requires backward passes during the calibration phase, incurring higher offline computational overhead than forward-only metrics.
  • Under extreme compression conditions, all static proxies reach a performance ceiling, indicating the need for a hybrid routing strategy.

Future Work

Future research directions include further optimizing AGF's computational efficiency, exploring its application in other network architectures, and developing more efficient dynamic routing strategies to address performance bottlenecks under extreme sparsity.

AI Executive Summary

In deep learning, structural pruning and dynamic routing are key techniques for improving model efficiency. However, existing static metrics like weight magnitude or activation awareness suffer from magnitude bias in structural pruning, failing to preserve critical functional pathways.

To address this issue, this paper proposes a decoupled kinetic paradigm based on Alternating Gradient Flow (AGF). AGF uses an absolute feature-space Taylor expansion to accurately capture the network's structural 'kinetic utility', preserving baseline functionality and exhibiting topological implicit regularization under extreme sparsity.

In experiments, AGF successfully avoids structural collapse under a 75% compression stress test on ImageNet-1K, where traditional metrics fall below random sampling. AGF maintains the network's functional integrity under these conditions.

Additionally, for dynamic inference on ImageNet-100, the AGF-guided hybrid routing framework achieves Pareto-optimal efficiency, reducing heavy expert usage by approximately 50% without sacrificing full-model accuracy. This result demonstrates AGF's ability to provide more precise signal guidance under dynamic signal compression.

However, AGF requires backward passes during the calibration phase, incurring higher offline computational overhead than forward-only metrics. Furthermore, under extreme compression conditions, all static proxies reach a performance ceiling, indicating the need for a hybrid routing strategy. Future research directions include further optimizing AGF's computational efficiency and exploring its application in other network architectures.

Deep Analysis

Background

Improving efficiency in deep learning has been a hot research topic, especially under limited computational resources. Structural pruning and dynamic routing are two main optimization strategies. Traditional pruning methods often rely on weight magnitude or activation awareness metrics, such as Wanda and RIA. However, these methods often fail to preserve critical functional pathways due to magnitude bias in structural pruning of deep vision networks. Recently, Alternating Gradient Flow (AGF) has been proposed as a new perspective to understand and optimize the network's structural kinetic utility.

Core Problem

In structural pruning of deep vision networks, traditional static metrics suffer from magnitude bias, failing to preserve critical functional pathways. This issue is particularly pronounced under extreme sparsity, leading to significant performance degradation. Solving this problem is crucial for improving model compression efficiency and the precision of dynamic routing.

Innovation

The core innovation of this paper is the introduction of a decoupled kinetic paradigm based on Alternating Gradient Flow (AGF). β€’ AGF uses an absolute feature-space Taylor expansion to accurately capture the network's structural 'kinetic utility'. β€’ Under extreme sparsity, AGF preserves baseline functionality and exhibits topological implicit regularization. β€’ The AGF-guided hybrid routing framework achieves Pareto-optimal efficiency in dynamic inference.

Methodology

  • οΏ½οΏ½ Use Alternating Gradient Flow (AGF) for structural pruning to capture the network's structural kinetic utility. β€’ Utilize an absolute feature-space Taylor expansion to avoid magnitude bias. β€’ Conduct a 75% compression stress test on ImageNet-1K to validate AGF's effectiveness. β€’ Perform dynamic inference on ImageNet-100 to test the efficiency of the hybrid routing framework.

Experiments

Experiments are conducted on ImageNet-1K and ImageNet-100 using network architectures like ResNet and ViT. β€’ A 75% compression stress test is conducted on ImageNet-1K to compare AGF with traditional metrics. β€’ Dynamic inference is performed on ImageNet-100 to test the efficiency of the hybrid routing framework. β€’ Key hyperparameters include compression rate and dynamic routing strategy.

Results

Experimental results show that AGF effectively avoids structural collapse under extreme sparsity. β€’ Under a 75% compression on ImageNet-1K, AGF outperforms traditional metrics. β€’ In dynamic inference on ImageNet-100, the AGF-guided hybrid routing framework achieves Pareto-optimal efficiency.

Applications

AGF can be used for structural pruning and dynamic routing of deep vision networks, suitable for resource-constrained scenarios. β€’ In fields like autonomous driving and real-time image processing, AGF can improve model efficiency and precision. β€’ Its hybrid routing framework can be applied to applications requiring dynamic inference.

Limitations & Outlook

AGF requires backward passes during the calibration phase, incurring higher offline computational overhead than forward-only metrics. β€’ Under extreme compression conditions, all static proxies reach a performance ceiling, indicating the need for a hybrid routing strategy. β€’ Future research directions include further optimizing AGF's computational efficiency and exploring its application in other network architectures.

Plain Language Accessible to non-experts

Imagine a factory that needs to produce as many products as possible with limited resources. Traditional methods decide which machines to shut down based on their size, but this might overlook the critical role of some small machines. Alternating Gradient Flow (AGF) is like a smart factory manager who not only looks at the size of the machines but also observes each machine's contribution to overall efficiency during production. This way, even a small machine won't be shut down if it plays an important role in production. AGF ensures that the factory operates efficiently even with limited resources. When quick adjustments to the production line are needed, AGF provides precise guidance to ensure flexibility and efficiency.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game, and you need to complete tasks with limited time. Traditional methods decide which tasks to skip based on difficulty, but this might overlook the importance of some small tasks. Alternating Gradient Flow (AGF) is like your game assistant. It not only looks at the difficulty of tasks but also observes each task's contribution to your overall progress. This way, even a small task won't be skipped if it's important for completing the game. AGF ensures you can efficiently complete the game even with limited time. When you need to quickly adjust your strategy, AGF provides precise guidance to ensure flexibility and efficiency.

Glossary

Alternating Gradient Flow (AGF)

A framework for capturing the network's structural kinetic utility, avoiding magnitude bias through absolute feature-space Taylor expansion.

Used for structural pruning and dynamic routing in deep networks.

Structural Pruning

A technique to improve model efficiency by removing redundant structures in the network.

Applied in deep vision networks to reduce computational overhead.

Dynamic Routing

A technique to conditionally skip computations based on input complexity to optimize efficiency.

Used to improve inference efficiency in deep networks.

Kinetic Utility

A measure of the network's contribution to overall loss reduction during optimization.

Used to assess the importance of network structures.

Topological Implicit Regularization

A technique to avoid model collapse by preserving the network's topological structure.

Applied under extreme sparsity to improve model stability.

Magnitude Bias

A bias in traditional metrics due to over-reliance on weight magnitude.

Leads to the loss of critical functional pathways in structural pruning.

Hybrid Routing Framework

A framework combining AGF-guided offline structural search with online execution via zero-cost physical priors.

Used to improve efficiency in dynamic inference.

Sparsity Bottleneck

A phenomenon where all static proxies reach a performance ceiling under extreme sparsity.

Indicates the need for a hybrid routing strategy.

Zero-Cost Physical Priors

Physical priors used for online execution without additional computational overhead.

Used in the hybrid routing framework for dynamic inference.

Feature-Space Taylor Expansion

A mathematical tool for capturing the network's structural kinetic utility.

Used in AGF to avoid magnitude bias.

Open Questions Unanswered questions from this research

  • 1 How to improve AGF's calibration efficiency without increasing computational overhead? Current methods require backward passes during calibration, leading to high offline computational costs. New methods are needed to reduce this cost.
  • 2 How effective is AGF in other network architectures? Current research focuses mainly on deep vision networks, and it's unclear if AGF is equally effective in other types of networks.
  • 3 How to further improve model stability under extreme sparsity? Although AGF addresses this issue to some extent, all static proxies still reach a performance ceiling under extreme conditions.
  • 4 Can more efficient dynamic routing strategies be developed to address performance bottlenecks under extreme sparsity? The current hybrid routing framework is effective but has room for improvement.
  • 5 Can new metrics be developed based on AGF to further improve the precision of structural pruning? Existing methods still suffer from magnitude bias in some cases.

Applications

Immediate Applications

Autonomous Driving

AGF can be used to optimize deep vision networks in autonomous driving systems, improving real-time image processing efficiency and precision.

Real-Time Image Processing

In tasks requiring quick response, AGF provides more precise signal guidance, improving processing efficiency.

Resource-Constrained Devices

On devices with limited computational resources, AGF can improve deep network efficiency through structural pruning and dynamic routing.

Long-term Vision

General Artificial Intelligence

By improving the efficiency and flexibility of deep networks, AGF is expected to drive the development of general artificial intelligence in the future.

Smart Cities

AGF's application can enhance the efficiency of various smart systems in cities, promoting the construction of smart cities.

Abstract

Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

cs.CV cs.LG cs.NE

References (20)

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Xinglong Sun, Barath Lakshmanan, Maying Shen et al.

2025 4 citations View Analysis β†’

Second Order Derivatives for Network Pruning: Optimal Brain Surgeon

B. Hassibi, D. Stork

1992 2091 citations

Importance Estimation for Neural Network Pruning

Pavlo Molchanov, Arun Mallya, Stephen Tyree et al.

2019 1092 citations View Analysis β†’

Omnigrok: Grokking Beyond Algorithmic Data

Ziming Liu, Eric J. Michaud, Max Tegmark

2022 121 citations View Analysis β†’

Pruning Filters for Efficient ConvNets

Hao Li, Asim Kadav, Igor Durdanovic et al.

2016 4013 citations View Analysis β†’

SNIP: Single-shot Network Pruning based on Connection Sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr

2018 1413 citations View Analysis β†’

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun et al.

2017 7404 citations View Analysis β†’

DepGraph: Towards Any Structural Pruning

Gongfan Fang, Xinyin Ma, Mingli Song et al.

2023 458 citations View Analysis β†’

Picking Winning Tickets Before Training by Preserving Gradient Flow

Chaoqi Wang, Chaoqi Wang, Guodong Zhang et al.

2020 729 citations View Analysis β†’

Batch-shaping for learning conditional channel gated networks

B. Bejnordi, Tijmen Blankevoort, M. Welling

2019 82 citations View Analysis β†’

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe, James L. McClelland, S. Ganguli

2013 2017 citations View Analysis β†’

Optimal Brain Damage

Yann LeCun, J. Denker, S. Solla

1989 5187 citations

Deep double descent: where bigger models and more data hurt

Preetum Nakkiran, Gal Kaplun, Yamini Bansal et al.

2019 1081 citations View Analysis β†’

Channel Pruning for Accelerating Very Deep Neural Networks

Yihui He, Xiangyu Zhang, Jian Sun

2017 2722 citations View Analysis β†’

SkipNet: Learning Dynamic Routing in Convolutional Networks

Xin Wang, F. Yu, Zi-Yi Dou et al.

2017 718 citations View Analysis β†’

Wanda++: Pruning Large Language Models via Regional Gradients

Yifan Yang, Kai Zhen, Bhavana Ganesh et al.

2025 19 citations View Analysis β†’

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Lu Yin, You Wu, Zhenyu (Allen) Zhang et al.

2023 160 citations View Analysis β†’

Dynamic Convolution: Attention Over Convolution Kernels

Yinpeng Chen, Xiyang Dai, Mengchen Liu et al.

2019 1237 citations View Analysis β†’

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, Barret Zoph, Noam Shazeer

2021 3450 citations View Analysis β†’

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

Sidak Pal Singh, Dan Alistarh

2020 193 citations