SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

TL;DR

SafeSteer employs localized on-policy distillation focusing on safety tokens, reducing reliance on large datasets and auxiliary reward models, achieving a superior safety-capability trade-off.

cs.AI 🔴 Advanced 2026-06-02 29 views
Hao Li Jingkun An Zijun Song Pengyu Zhu Rui Li Hao Wang Wendi Feng Yesheng Liu Lijun Li Jin-Ge Yao Lei Sha
AI safety large model alignment policy distillation localized optimization reverse KL

Key Findings

Methodology

SafeSteer utilizes an activation-guided safety teacher model constructed by injecting a refusal direction into the hidden states, providing stable safety signals. It then employs a contrastive log probability algorithm to automatically identify a sparse subset of safety tokens that are most sensitive to refusal behaviors. During training, the method restricts the reverse KL divergence penalty solely to these safety tokens, enabling localized fine-tuning that preserves the model’s general capabilities. The process involves: (1) building a safety teacher via activation steering, (2) selecting safety tokens through contrastive log probabilities, and (3) applying a localized reverse KL penalty during on-policy distillation. This approach requires only 100 harmful samples, with no need for external reward models or large-scale general data, significantly reducing training costs.

Key Results

  • Across seven safety benchmarks, SafeSteer achieves an average safety success rate of 94.78%, outperforming existing methods such as MoCAN and BFPO by a substantial margin. On the Qwen-3-4B-Instruct model, it reduces harmful response rate to 1.13%, compared to 3.75% for the strongest baseline. In capability evaluations (e.g., MMLU, HumanEval), the performance drops by less than 1.5%, demonstrating effective preservation of general skills. The method also drastically cuts data requirements, using only 100 harmful samples versus thousands used by previous approaches, thus lowering the alignment cost.
  • In ablation studies, restricting the reverse KL penalty to safety tokens prevents the degradation of model capabilities, unlike full-vocabulary penalties which cause significant performance drops. The safety token selection based on contrastive log probabilities effectively isolates safety-critical signals, enabling precise and sparse adjustments. The experimental results validate that localized reverse KL optimization maintains a strong safety profile while keeping the model’s general abilities intact.
  • Furthermore, visualization of hidden states via PCA shows that the internal representations of the model remain nearly identical before and after safety alignment, confirming that the sparse, localized approach avoids catastrophic forgetting. The method’s robustness across different model architectures and response lengths highlights its practical viability for real-world deployment.

Significance

This work addresses the longstanding challenge of balancing safety and capability in large language models. Traditional methods often compromise model performance to achieve safety, but SafeSteer introduces a fundamentally different approach by leveraging the sparse nature of safety signals. Its localized, data-efficient strategy significantly reduces the cost and complexity of safety alignment, making it feasible for rapid deployment in industry. The theoretical insights into activation steering and contrastive token selection open new avenues for fine-grained control of model behaviors. Overall, this research paves the way for safer, more reliable AI systems that do not sacrifice their core competencies, thus accelerating the adoption of trustworthy AI in sensitive applications like healthcare, finance, and autonomous systems.

Technical Contribution

SafeSteer’s main technical innovations include: 1) constructing a stable safety teacher model via activation steering, which injects a refusal direction into hidden states; 2) developing a contrastive log probability-based algorithm to automatically identify the most safety-sensitive tokens; 3) applying a localized reverse KL divergence penalty restricted to these safety tokens during on-policy distillation, avoiding the common issue of capability degradation caused by global penalties. This framework effectively decouples safety features from the general capability space, enabling efficient, low-cost safety alignment without external reward models or massive data. The method’s theoretical foundation leverages the mode-seeking property of reverse KL and the sparsity of safety signals, offering a new paradigm for scalable, precise safety control in large models.

Novelty

This research is the first to implement a localized, sparse reverse KL divergence penalty for safety alignment in large language models. Unlike prior approaches that apply global regularization or rely on extensive data, SafeSteer identifies a sparse set of safety tokens through contrastive log probabilities and restricts the divergence penalty to these tokens. This innovation effectively preserves the model’s broad capabilities while enhancing safety, representing a significant departure from traditional full-vocabulary or data-intensive methods. Its combination of activation-guided safety teacher construction and sparse token-level optimization marks a new direction in AI safety research, emphasizing efficiency, scalability, and precision.

Limitations

  • The approach assumes that the base model already possesses some inherent refusal capability; if the model lacks this, the safety token mining process may be ineffective, limiting applicability to models with minimal safety behaviors.
  • Experiments are primarily conducted on models with parameters up to 10 billion; scalability and effectiveness on larger models (e.g., 100B+) remain to be validated, especially considering potential hyperparameter shifts.
  • Currently, the method is tested only on autoregressive text models; extending to multimodal or non-autoregressive architectures (e.g., diffusion models) requires further research.
  • The safety token selection relies on the quality of the contrastive log probability signals, which may vary across different tasks and datasets, potentially affecting robustness.

Future Work

Future research will focus on enhancing the robustness of safety token mining, especially for models with weaker inherent refusal capabilities. Exploring adaptive, dynamic safety token sets that evolve with user feedback and real-world interactions is also promising. Extending the framework to multimodal models, such as vision-language systems, and non-autoregressive architectures will broaden its applicability. Additionally, integrating reinforcement learning with human-in-the-loop feedback can further refine safety behaviors, enabling models to adapt to complex, evolving safety standards. Ultimately, the goal is to develop a unified, scalable safety alignment paradigm that balances efficiency, effectiveness, and adaptability across diverse AI systems.

AI Executive Summary

The rapid deployment of large language models (LLMs) in various applications has brought unprecedented capabilities but also significant safety challenges. Traditional safety alignment methods, such as reinforcement learning with human feedback (RLHF) or extensive fine-tuning on curated datasets, often lead to a phenomenon known as 'alignment tax'—a degradation in the model’s general capabilities. This trade-off limits the broader utility of LLMs, especially in sensitive domains like healthcare, finance, and autonomous systems.

Addressing this critical issue, the paper introduces SafeSteer, a novel safety alignment framework that leverages the inherent sparsity of safety signals within the output distribution of LLMs. Unlike global regularization techniques, SafeSteer focuses on localized modifications by identifying and adjusting only the safety-critical tokens. The core innovation lies in constructing a stable safety teacher model through activation steering, which injects a refusal direction into the model’s internal states, enabling consistent safety signals without external strong teachers or prompt engineering.

Building upon this, the authors develop a contrastive log probability algorithm to automatically mine safety tokens—those most sensitive to refusal behaviors. These tokens are sparse and largely disjoint from tokens used for general tasks, allowing the model to be fine-tuned with a localized reverse KL divergence penalty confined to this subset. This targeted approach ensures that safety improvements do not come at the expense of the model’s core capabilities.

Extensive experiments across multiple models, including Llama-3-8B, Llama-3.2-3B, and Qwen-2.5-7B, demonstrate that SafeSteer achieves state-of-the-art safety performance, reducing harmful response rates to below 2% on several benchmarks. Simultaneously, the models retain nearly their original performance on general tasks, with less than 1.5% degradation. Remarkably, the method requires only 100 harmful samples—less than 1% of what previous methods demand—substantially lowering the cost and complexity of safety alignment.

This work’s significance extends beyond technical innovation. It provides a practical, scalable solution for deploying safer AI systems without sacrificing performance. By decoupling safety features from the general capability space through sparse, localized adjustments, SafeSteer paves the way for more trustworthy AI in real-world applications. Future directions include extending the framework to multimodal models, larger-scale architectures, and dynamic safety policies, promising a new era of efficient, reliable, and safe AI systems.

Deep Analysis

Background

The evolution of large language models (LLMs) has revolutionized natural language processing, enabling applications from chatbots to automated content generation. Early models like GPT-2 and BERT laid the groundwork, followed by more capable architectures such as LLaMA, PaLM, and Qwen, which demonstrated remarkable zero-shot and few-shot learning abilities. Despite these advances, safety concerns—such as the generation of biased, harmful, or misleading content—have become prominent. Traditional safety measures involve supervised fine-tuning with curated datasets or reinforcement learning with human feedback (RLHF), which help reduce unsafe outputs but often degrade the model’s broader capabilities—a phenomenon termed 'alignment tax.'


Recent research has explored various techniques, including orthogonal projection of safety gradients, sparse activation rerouting, and activation-based control, aiming to improve safety without sacrificing performance. However, these methods typically require large amounts of data, external reward models, or complex engineering, limiting their scalability and practicality. The inherent sparsity of safety signals—refusal behaviors and harmful content indicators—suggests that targeted, sparse adjustments could be more effective. This paper builds on these insights, proposing a localized, sparse approach to safety alignment that leverages the internal representations of models to identify and adjust only the critical safety tokens.

Core Problem

Existing safety alignment techniques often rely on broad, global modifications that inadvertently impair the model’s ability to perform general tasks. This is because safety signals are sparse and often disjoint from tokens used in normal operations. Consequently, global penalties or extensive data augmentation lead to a trade-off: enhancing safety at the cost of capability. Moreover, many methods depend on external reward models or large-scale datasets, increasing complexity and cost. The core challenge is to develop a method that can precisely target safety signals—those sparse, critical tokens—without affecting the model’s overall performance. Achieving this requires identifying these tokens automatically and applying localized adjustments during training, which remains a significant technical hurdle.

Innovation

The paper introduces a set of key innovations:

1) Activation Steering for Safety Teacher Construction: By comparing the hidden states of the base model on harmful versus harmless prompts, a refusal direction is extracted and injected into the model’s residual stream, creating a stable safety teacher without external models or prompt engineering.

2) Contrastive Log Probability for Safety Token Mining: This algorithm compares the output distributions of the safety teacher and the base model, identifying tokens with the highest contrastive log probability differences as safety-critical. These tokens are sparse and highly relevant for safety adjustments.

3) Localized Reverse KL Divergence Penalty: Instead of applying a global penalty, the method restricts the reverse KL divergence to the identified safety tokens, enabling sparse, targeted fine-tuning that preserves the model’s general capabilities.

This combination of activation-based control, contrastive token selection, and localized divergence regularization constitutes a novel framework that effectively balances safety and performance with minimal data and cost.

Methodology

  • �� Construct a safety teacher model (πt) by comparing the hidden states of the base model (π0) on harmful and harmless prompts, extracting a refusal direction vector, and injecting it into the residual stream at a chosen layer.
  • �� Generate refusal trajectories using πt on harmful prompts, and compare these with baseline responses to identify safety-sensitive tokens via contrastive log probabilities.
  • �� For each trajectory, concatenate the response with the input and perform forward passes through both πt and π0, computing token-wise conditional probabilities.
  • �� Calculate the contrastive log probability difference (Δ) for each token, selecting the top-K tokens with the highest Δ values as safety tokens.
  • �� Aggregate votes for each token across multiple trajectories using a contrastive voting scheme, forming a safety token subset S.
  • �� During on-policy distillation, generate responses from the student model (πs) on harmful prompts, and compute the reverse KL divergence only over the safety token subset S.
  • �� Update the model parameters by minimizing this localized reverse KL loss, ensuring safety behaviors are learned without degrading general capabilities.
  • �� Evaluate safety success rate and model performance on multiple benchmarks to validate effectiveness.

Experiments

The experimental setup involves training models such as Llama-3-8B-Instruct, Llama-3.2-3B, and Qwen-2.5-7B using only 100 harmful samples from PKU-SafeRLHF, significantly less than previous methods. Baselines include MoCAN, BFPO, NSPO, and DPO-Mix, trained on larger datasets. Evaluation metrics encompass safety benchmarks like AdvBench, HarmBench, JailbreakBench, and SORRY-Bench, measuring attack success rate (ASR). General capabilities are assessed via MMLU, HumanEval, and AlpacaEval, using the DeepSeek-V4-Flash and lm-eval frameworks. Hyperparameters include a safety token subset size of |S|, response length H, and a temperature setting of 0 or 1.0. The training involves multiple rollouts (M) with localized reverse KL loss, with ablation studies testing the effects of response length, token selection, and normalization strategies.

Results

Across multiple models, SafeSteer achieves a safety success rate below 2% on several benchmarks, outperforming strong baselines like MoCAN (around 3.75%) and BFPO (around 3.8%). On Qwen-3-4B-Instruct, the harmful response rate drops to 1.13%, with minimal performance loss on general tasks (less than 1.5%). Ablation studies confirm that restricting the reverse KL to safety tokens prevents capability degradation, while full-vocabulary penalties cause significant drops. Visualization of hidden states shows that the internal representations of the model remain nearly identical before and after alignment, indicating that the sparse, localized approach effectively avoids catastrophic forgetting. These results demonstrate that SafeSteer provides a practical, low-cost solution for safe deployment of large models.

Applications

This approach is directly applicable to deploying safer AI systems in content moderation, enterprise chatbots, and sensitive data filtering. Its low data requirement and high efficiency make it suitable for rapid iteration and deployment in real-world scenarios. By focusing on sparse safety signals, it can be integrated into existing pipelines with minimal retraining. In the long term, extending this framework to multimodal models and larger architectures could enable safer AI in autonomous vehicles, medical diagnostics, and other critical domains. The ability to dynamically adjust safety policies based on sparse signals also opens avenues for adaptive, context-aware safety mechanisms.

Limitations & Outlook

The method assumes that the base model already exhibits some refusal behavior; models lacking this will face challenges in safety token mining. Its effectiveness on models larger than 10B parameters remains unverified, requiring further scaling studies. The current focus is on autoregressive text models, and adaptation to multimodal or non-autoregressive architectures is yet to be explored. Additionally, the safety token selection relies on the quality of contrastive signals, which may vary across datasets and tasks, potentially affecting robustness. Future work should address these limitations by developing more generalizable token mining algorithms and extending the approach to broader AI modalities.

Plain Language Accessible to non-experts

想象你在管理一个工厂,这个工厂里有很多不同的机器,每台机器都能做各种任务。有些机器可能会出错,甚至做出危险的事情。为了确保工厂安全,你可以在每台机器的关键部分安装传感器,只在这些关键点进行微调,而不是全部重新调节。这样一来,你就能快速修正那些可能引发危险的地方,而不会影响到整个工厂的正常运转。

SafeSteer的方法就像这个工厂的传感器系统。它通过观察机器内部的“传感器”信号,找到那些与安全相关的稀疏“关键点”。然后,只在这些点进行微调,确保机器在遇到危险请求时能果断拒绝,同时在正常情况下还能正常工作。这种局部调整比全面调节更高效,也更安全,因为它不会影响机器的整体性能。就像只在关键机器上做调整一样,既保证了安全,又不影响生产效率。

ELI14 Explained like you're 14

想象你在学校里,有一台超级聪明的机器人老师。这台机器人能回答各种问题,但有时候会说一些不太合适的话。为了让它变得更安全,老师们会告诉它:“如果遇到危险或不礼貌的问题,就要拒绝回答。”不过,要让机器人学会拒绝,不能把所有内容都改掉,否则它就变得不聪明了。

SafeSteer就像给机器人装了一套特别的“安全感应器”。这个感应器可以找到那些告诉机器人“不要回答”的关键字或信号。然后,只在这些关键点上做微调,让机器人学会在遇到危险时果断拒绝,而在正常情况下还能聪明地回答问题。这样,机器人既安全又聪明,不会因为调整而变笨,就像只在关键时刻按下“停止”按钮一样,既安全又高效。

Glossary

激活引导 (Activation Steering)

一种通过修改模型内部激活状态,控制模型行为的技术,旨在引导模型产生特定的输出反应。

用于构建安全教师模型,稳定拒绝行为。

逆KL散度 (Reverse KL Divergence)

一种衡量两个概率分布差异的指标,具有模式收敛倾向,常用于模型微调中的局部优化。

在SafeSteer中限制在安全标记上,实现局部微调。

安全标记 (Safety Tokens)

在模型输出中与安全拒绝相关的稀疏关键字或符号,用于识别和调节模型的安全行为。

通过对比分布挖掘,作为微调的目标子集。

对比日志概率 (Contrastive Log Probability)

一种通过比较两个模型在相同输入下的输出概率差异,识别敏感标记的方法。

用于安全标记的自动挖掘。

局部化微调 (Localized Fine-tuning)

只在模型输出的稀疏子集上进行微调,避免全局调整带来的能力损失。

实现安全与能力的平衡。

模型拒绝行为 (Refusal Behavior)

模型在面对不安全或不适当请求时,主动拒绝回答的行为。

通过激活引导稳定实现。

稀疏修正 (Sparse Adjustment)

只在少量关键标记上进行微调,减少对模型整体能力的影响。

核心技术之一。

激活空间 (Activation Space)

模型内部隐藏状态的向量空间,用于分析和引导模型行为。

安全教师模型的构建基础。

稀疏安全特征 (Sparse Safety Features)

在模型输出中稀疏存在的与安全相关的信号或标记。

挖掘和微调的目标。

安全对齐 (Safety Alignment)

使模型输出符合人类价值观和安全标准的过程。

本文的研究核心目标。

Abstract

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.

cs.AI cs.CL

References (20)

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun et al.

2023 701 citations ⭐ Influential View Analysis →

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed et al.

2024 720 citations ⭐ Influential View Analysis →

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip H. S. Torr, Mohamed Elhoseiny et al.

2024 29 citations ⭐ Influential View Analysis →

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Rishabh Bhardwaj, Soujanya Poria

2023 258 citations View Analysis →

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu, Dongrui Liu, Yi Yu et al.

2025 15 citations View Analysis →

GLM-5: from Vibe Coding to Agentic Engineering

GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.

2026 144 citations View Analysis →

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Simone Tedeschi, Felix Friedrich, P. Schramowski et al.

2024 93 citations View Analysis →

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang et al.

2022 21269 citations View Analysis →

Improving LLM Safety Alignment with Dual-Objective Optimization

Xuandong Zhao, Will Cai, Tianneng Shi et al.

2025 34 citations View Analysis →

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

2024 400 citations View Analysis →

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

2024 66 citations View Analysis →

Let's Verify Step by Step

H. Lightman, Vineet Kosaraju, Yura Burda et al.

2023 3420 citations View Analysis →

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin et al.

2024 1174 citations View Analysis →

Improving Alignment and Robustness with Circuit Breakers

Andy Zou, Long Phan, Justin Wang et al.

2024 294 citations View Analysis →

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song, Mao Zheng

2026 38 citations View Analysis →

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou et al.

2023 431 citations View Analysis →

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He et al.

2026 50 citations View Analysis →

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

Yifan Niu, Han Xiao, Dong Liu et al.

2025 7 citations View Analysis →

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart et al.

2020 8351 citations View Analysis →

Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

Tiansheng Huang, Sihao Hu, Fatih Ilhan et al.

2025 106 citations View Analysis →