ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training

TL;DR

ZO-SAM integrates zero-order optimization to reduce computational costs, enhancing efficiency and robustness in sparse training.

cs.LG 🔴 Advanced 2026-03-14 2 views

Jie Ji Gen Li Kaiyuan Deng Fatemeh Afghah Xiaolong Ma

AI Reader Arxiv Page Download PDF

deep learning sparse training zero-order optimization robustness computational efficiency

Key Findings

Methodology

This paper introduces a novel optimization framework, ZO-SAM, which integrates zero-order optimization with Sharpness-Aware Minimization (SAM). Unlike traditional SAM, ZO-SAM requires only a single backpropagation during perturbation, utilizing zero-order gradient estimations to halve the computational cost. This approach stabilizes the training process and accelerates convergence by identifying flat minima, making it particularly suitable for sparse training scenarios.

Key Results

On the CIFAR-10 and CIFAR-100 datasets, models using ZO-SAM improved accuracy by 0.38% to 2.54% at sparsity levels of 90%, 95%, and 98%. Specifically, on ResNet-32, ZO-SAM increased accuracy by 0.38% to 2.31% on CIFAR-10 and by 0.45% to 2.54% on CIFAR-100.
In experiments on the ImageNet-1K dataset using DeiT-Tiny and DeiT-Small architectures, ZO-SAM demonstrated superior performance, achieving accuracy improvements of up to 1.17% at 50% and 70% sparsity.
ZO-SAM exhibited outstanding robustness in distribution shift tests on the CIFAR-10-C dataset, significantly improving model accuracy, demonstrating its potential for real-world deployment.

Significance

ZO-SAM holds significant implications for both academia and industry. It addresses the issue of chaotic gradient signals at high sparsity levels, enhancing model convergence and generalization. Moreover, ZO-SAM excels in resource-constrained environments, reducing computational costs and enabling the deployment of deep learning models on edge devices and mobile applications.

Technical Contribution

The technical contribution of ZO-SAM lies in its innovative integration of zero-order optimization into the SAM framework, reducing computational overhead while improving training stability. Compared to existing sparse training methods, ZO-SAM significantly lowers computational demands while maintaining model performance, offering a more efficient solution for sparse training.

Novelty

ZO-SAM is the first to combine zero-order optimization with SAM, proposing a more efficient optimization method for sparse training. Unlike previous methods, ZO-SAM uses zero-order gradient estimations in the perturbation step, reducing computational costs while maintaining SAM's ability to identify flat minima.

Limitations

ZO-SAM may still face gradient estimation inaccuracies at extremely high sparsity levels, potentially affecting final model performance.
Although ZO-SAM reduces computational costs, additional hyperparameter tuning may be required in some cases to achieve optimal performance.
The applicability and effectiveness of ZO-SAM in specific deep learning architectures may require further validation.

Future Work

Future research directions include exploring the applicability of ZO-SAM in other deep learning architectures and its performance on larger datasets. Additionally, further optimizing the accuracy and efficiency of zero-order gradient estimations is an important research direction.

AI Executive Summary

Deep learning models have achieved remarkable success across various domains, but their substantial computational costs and memory demands limit their deployment in resource-constrained environments. Sparse neural networks offer an attractive solution by drastically reducing parameter count and computational overhead. However, existing sparse training methods often experience chaotic and noisy gradient signals, severely hindering convergence and generalization performance, especially at high sparsity levels. To tackle this critical challenge, this paper proposes Zero-Order Sharpness-Aware Minimization (ZO-SAM), a novel optimization framework that strategically integrates zero-order optimization within the SAM approach. Unlike traditional SAM, ZO-SAM requires only a single backpropagation step during perturbation, selectively utilizing zero-order gradient estimations. This innovative approach reduces the backpropagation computational cost by half, significantly lowering gradient variance and effectively eliminating associated computational overhead. By harnessing SAM's capacity for identifying flat minima, ZO-SAM stabilizes the training process and accelerates convergence. These efficiency gains are particularly important in sparse training scenarios, where computational cost is the primary bottleneck that limits the practicality of SAM. Moreover, models trained with ZO-SAM exhibit improved robustness under distribution shift, further broadening its practicality in real-world deployments.

In experiments, we tested ZO-SAM on ResNet-32 and ResNet-50 across CIFAR-10 and CIFAR-100 datasets, demonstrating significant accuracy improvements at various sparsity levels. Experiments on the ImageNet-1K dataset using DeiT-Tiny and DeiT-Small architectures also showcased ZO-SAM's superior performance. ZO-SAM not only improved model accuracy but also exhibited outstanding performance under distribution shifts, proving its potential for real-world applications.

Despite its strengths, ZO-SAM may still face gradient estimation inaccuracies at extremely high sparsity levels. Additionally, while ZO-SAM reduces computational costs, additional hyperparameter tuning may be required in some cases to achieve optimal performance. The applicability and effectiveness of ZO-SAM in specific deep learning architectures may require further validation.

Deep Analysis

Background

Deep learning has made significant advances over the past decade, particularly in fields such as computer vision, natural language processing, and speech recognition. However, these models typically require substantial computational resources and memory, which poses a major barrier in resource-constrained environments such as edge devices and mobile applications. To address this challenge, researchers have proposed sparse neural networks, which drastically reduce parameter count and computational costs by maintaining only a small proportion of active weights. Although sparse training is theoretically attractive, it still faces many challenges in practice, especially at high sparsity levels, where chaotic and noisy gradient signals severely affect model convergence and generalization performance.

Core Problem

The core problem of sparse training is how to maintain model convergence and generalization capabilities at high sparsity levels. Existing methods often rely on heuristic or specific metric strategies, which lead to chaotic gradient signals at high sparsity levels, affecting model performance. Additionally, as sparsity increases, the loss surface transforms from a smooth, wide basin into a steeper, narrower landscape, further complicating gradient instability and making effective gradient descent more difficult.

Innovation

The core innovation of this paper is the introduction of a new optimization framework, ZO-SAM, which integrates zero-order optimization with Sharpness-Aware Minimization (SAM). Specifically, ZO-SAM requires only a single backpropagation during perturbation and utilizes zero-order gradient estimations, halving the computational cost. This approach stabilizes the training process and accelerates convergence by identifying flat minima, making it particularly suitable for sparse training scenarios. Additionally, ZO-SAM demonstrates superior performance under distribution shifts, proving its potential for real-world applications.

Methodology

�� The ZO-SAM framework integrates zero-order optimization into SAM, reducing computational overhead.
�� During the perturbation step, ZO-SAM uses Random Gradient Estimation (RGE) to approximate gradients instead of traditional Coordinate-wise Gradient Estimation (CGE), reducing computational costs.
�� In the gradient update step, ZO-SAM maintains precise first-order gradients to ensure training stability and convergence.
�� This selective integration allows ZO-SAM to significantly reduce computational overhead while preserving SAM's ability to identify flat minima.

Experiments

We tested ZO-SAM on ResNet-32 and ResNet-50 across CIFAR-10 and CIFAR-100 datasets and conducted experiments on the ImageNet-1K dataset using DeiT-Tiny and DeiT-Small architectures. In our experiments, we compared the performance of ZO-SAM with existing sparse training methods such as SNIP, GraSP, SET, DSR, and RigL. We also conducted ablation studies to verify the effectiveness of ZO-SAM at different sparsity levels.

Results

Experimental results show that ZO-SAM significantly improves model accuracy at various sparsity levels. On the CIFAR-10 and CIFAR-100 datasets, models using ZO-SAM improved accuracy by 0.38% to 2.54% at sparsity levels of 90%, 95%, and 98%. Experiments on the ImageNet-1K dataset using DeiT-Tiny and DeiT-Small architectures also demonstrated ZO-SAM's superior performance. Additionally, ZO-SAM exhibited outstanding robustness in distribution shift tests on the CIFAR-10-C dataset, significantly improving model accuracy.

Applications

ZO-SAM has broad application potential in resource-constrained environments. It can be used for efficient deployment of deep learning models on edge devices and mobile applications, reducing computational costs and memory demands. Additionally, ZO-SAM's robustness under distribution shifts makes it suitable for real-world deployment scenarios requiring high reliability, such as autonomous driving and medical diagnostics.

Limitations & Outlook

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditional deep learning is like making a big feast with all the ingredients, which is delicious but requires a lot of ingredients and time. Sparse training is like making the same delicious meal with limited ingredients. ZO-SAM is like a smart chef who knows how to make the most delicious meal with the fewest ingredients. It uses a method called zero-order optimization, which only needs one try to find the best combination of spices, rather than trying repeatedly. This not only saves time but also reduces waste. Imagine finding the perfect salt and pepper ratio on the first try, instead of adjusting each time. ZO-SAM is like that smart chef, able to reduce ingredient use and waste while keeping the meal delicious.

ELI14 Explained like you're 14

Hey there, young explorers! Did you know training an AI model is like playing a super complex puzzle game? Traditional methods need lots and lots of puzzle pieces to finish the game, but that takes a lot of time and effort. Imagine if we could finish the same game with fewer pieces, wouldn't that be awesome? That's the goal of sparse training. And ZO-SAM is like a super smart puzzle master who knows how to complete the game with the fewest pieces. It uses a trick called zero-order optimization, which only needs one try to find the best puzzle combination, instead of trying over and over. This not only saves time but makes the game more fun! So, ZO-SAM is like a super smart puzzle master, reducing puzzle pieces while keeping the game exciting. Isn't that cool?

Glossary

Zero-Order Optimization

An optimization method that does not require explicit gradient computation, estimating gradients through direct function evaluations, suitable for scenarios where backpropagation is costly or infeasible.

Used in ZO-SAM to reduce computational overhead.

Sharpness-Aware Minimization (SAM)

An optimization technique that improves generalization by guiding models toward flat minima.

Used in ZO-SAM to stabilize the training process.

Sparse Training

A training method that reduces parameter count and computational costs by maintaining only a small proportion of active weights.

The primary application scenario of ZO-SAM.

Gradient Variance

The degree of fluctuation in gradient updates, with high variance potentially leading to unstable training.

ZO-SAM reduces gradient variance to improve training stability.

Distribution Shift

A situation where test data distribution differs from training data, potentially degrading model performance.

ZO-SAM demonstrates robustness under distribution shifts.

Random Gradient Estimation (RGE)

A method that estimates gradients by averaging directional finite differences, reducing computational costs.

Used in ZO-SAM to approximate gradients.

Coordinate-wise Gradient Estimation (CGE)

A method that estimates gradients by evaluating perturbations along each coordinate axis, with higher computational costs.

ZO-SAM opts for RGE over CGE.

Loss Surface

ZO-SAM optimizes the loss surface by identifying flat minima.

Backpropagation

An algorithm for updating neural network weights by computing gradients.

ZO-SAM reduces backpropagation frequency to lower computational costs.

Hyperparameter Tuning

The process of adjusting model parameters to improve performance, often requiring significant computational resources.

ZO-SAM may still require hyperparameter tuning in some cases.

Open Questions Unanswered questions from this research

1 The accuracy of gradient estimation in ZO-SAM at extremely high sparsity levels remains an area for further research. Current zero-order gradient estimations may lead to inaccurate updates in some scenarios, affecting final model performance.
2 Effectively applying ZO-SAM to larger datasets remains an open question. While it performs well on smaller datasets, computational costs and memory demands may become bottlenecks on larger scales.
3 The applicability of ZO-SAM across different deep learning architectures needs further validation. While it performs well in convolutional neural networks, its effectiveness in other architectures is yet to be determined.
4 Further optimizing the accuracy and efficiency of zero-order gradient estimations is an important research direction. Existing methods may lead to inaccurate gradient estimations in some cases, affecting model performance.
5 The robustness and scalability of ZO-SAM in practical applications require further study. Despite its strong experimental performance, it may face different challenges in real-world deployments.

Applications

Immediate Applications

Edge Devices

ZO-SAM enables efficient deployment of deep learning models on edge devices, reducing computational costs and memory demands, enhancing device intelligence.

Mobile Applications

By reducing computational overhead, ZO-SAM makes it feasible to run complex deep learning models on mobile devices, enhancing user experience.

Autonomous Driving

ZO-SAM's robustness under distribution shifts makes it suitable for autonomous driving scenarios, improving vehicle decision-making in complex environments.

Long-term Vision

Medical Diagnostics

ZO-SAM can be used in medical imaging analysis, improving diagnostic accuracy and efficiency, supporting the development of intelligent healthcare.

Smart Cities

By deploying efficient deep learning models in smart cities, ZO-SAM can enhance urban management intelligence, improving residents' quality of life.

Abstract

Deep learning models, despite their impressive achievements, suffer from high computational costs and memory requirements, limiting their usability in resource-constrained environments. Sparse neural networks significantly alleviate these constraints by dramatically reducing parameter count and computational overhead. However, existing sparse training methods often experience chaotic and noisy gradient signals, severely hindering convergence and generalization performance, particularly at high sparsity levels. To tackle this critical challenge, we propose Zero-Order Sharpness-Aware Minimization (ZO-SAM), a novel optimization framework that strategically integrates zero-order optimization within the SAM approach. Unlike traditional SAM, ZO-SAM requires only a single backpropagation step during perturbation, selectively utilizing zero-order gradient estimations. This innovative approach reduces the backpropagation computational cost by half compared to conventional SAM, significantly lowering gradient variance and effectively eliminating associated computational overhead. By harnessing SAM's capacity for identifying flat minima, ZO-SAM stabilizes the training process and accelerates convergence. These efficiency gains are particularly important in sparse training scenarios, where computational cost is the primary bottleneck that limits the practicality of SAM. Moreover, models trained with ZO-SAM exhibit improved robustness under distribution shift, further broadening its practicality in real-world deployments.

cs.LG

References (20)

Comparing Rewinding and Fine-tuning in Neural Network Pruning

Alex Renda, Jonathan Frankle, Michael Carbin

2020 432 citations ⭐ Influential View Analysis →

SNIP: Single-shot Network Pruning based on Connection Sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr

2018 1413 citations ⭐ Influential View Analysis →

Pruning neural networks without any data by iteratively conserving synaptic flow

Hidenori Tanaka, D. Kunin, Daniel L. K. Yamins et al.

2020 790 citations ⭐ Influential View Analysis →

MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge

Geng Yuan, Xiaolong Ma, Wei Niu et al.

2021 116 citations ⭐ Influential View Analysis →

Picking Winning Tickets Before Training by Preserving Gradient Flow

Chaoqi Wang, Chaoqi Wang, Guodong Zhang et al.

2020 729 citations ⭐ Influential View Analysis →

EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets

Xiaohan Chen, Yu Cheng, Shuohang Wang et al.

2020 109 citations ⭐ Influential View Analysis →

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle, Michael Carbin

2018 4064 citations ⭐ Influential View Analysis →

On the Design of Black-Box Adversarial Examples by Leveraging Gradient-Free Optimization and Operator Splitting Method

Pu Zhao, Sijia Liu, Pin-Yu Chen et al.

2019 61 citations View Analysis →

Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations

John C. Duchi, Michael I. Jordan, M. Wainwright et al.

2013 557 citations View Analysis →

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Tianlong Chen, Yu Cheng, Zhe Gan et al.

2021 268 citations View Analysis →

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Yihua Zhang, Pingzhi Li, Junyuan Hong et al.

2024 120 citations View Analysis →

Black-box Adversarial Attacks with Limited Queries and Information

Andrew Ilyas, Logan Engstrom, Anish Athalye et al.

2018 1348 citations View Analysis →

Zeroth-Order Optimization with Trajectory-Informed Derivative Estimation

Yao Shu, Zhongxiang Dai, Weicong Sng et al.

2023 18 citations

Robust and Faster Zeroth-Order Minimax Optimization: Complexity and Applications

Weixin An, Yuanyuan Liu, Fanhua Shang et al.

2024 4 citations

Training data-efficient image transformers & distillation through attention

Hugo Touvron, M. Cord, Matthijs Douze et al.

2020 8670 citations View Analysis →

Fine-Tuning Language Models with Just Forward Passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani et al.

2023 351 citations View Analysis →

Efficient Sharpness-aware Minimization for Improved Training of Neural Networks

Jiawei Du, Hanshu Yan, Jiashi Feng et al.

2021 165 citations View Analysis →

Certified Zeroth-order Black-Box Defense with Robust UNet Denoiser

Astha Verma, Siddhesh Bangar, A. Subramanyam et al.

2023 9 citations View Analysis →

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, H. Mobahi et al.

2020 1780 citations View Analysis →

Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources

Yun-Yun Tsai, Pin-Yu Chen, Tsung-Yi Ho

2020 111 citations View Analysis →

ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Zero-Order Optimization

Sharpness-Aware Minimization (SAM)

Sparse Training

Gradient Variance

Distribution Shift

Random Gradient Estimation (RGE)

Coordinate-wise Gradient Estimation (CGE)

Loss Surface

Backpropagation

Hyperparameter Tuning

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Edge Devices

Mobile Applications

Autonomous Driving

Long-term Vision

Medical Diagnostics

Smart Cities

Abstract

References (20)

Related Papers

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Representation Learning for Spatiotemporal Physical Systems

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors