Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

TL;DR

This study introduces representation steering via Sparse AutoEncoders (SAE) and activation space manipulation to reduce Whisper's hallucination rate from 72.63% to 14.11%, without fine-tuning.

cs.SD 🔴 Advanced 2026-06-06 47 views
Georgii Aparin Vadim Popov Tasnima Sadekova Assel Yermekova
ASR hallucination detection Sparse AutoEncoder representation steering model robustness

Key Findings

Methodology

The research systematically analyzes internal representations of Whisper, extracting activations from each encoder layer and training SAE models to obtain sparse latent features. Linear classifiers, specifically logistic regression, evaluate the separability of hallucination-related information across layers, confirming that deeper layers encode more discriminative, linearly separable features. Based on these insights, two steering strategies are developed: activation space steering, which adjusts residual activations by adding a difference vector derived from contrastive sets; and SAE latent space steering, which manipulates the most discriminative latent dimensions identified via feature importance scores. These interventions are applied during inference, modifying internal states to suppress hallucinations while preserving transcription accuracy. Extensive experiments across diverse datasets—including FSD50k, MUSAN, WHAM!, LibriSpeech, FLEURS, and AISHELL-1—validate the effectiveness of the methods in both small and large Whisper models, demonstrating significant hallucination reduction with minimal impact on Word Error Rate (WER).

Key Results

  • In non-speech datasets, hallucination rates dropped from 72.63% to 14.11% in Whisper small and from 86.88% to 27.33% in Whisper large-v3, with WER degradation less than 1%, confirming the high efficacy of SAE-based steering.
  • Layer-wise analysis shows that the discriminative power of internal representations increases with depth, with the highest AUC scores (>0.95) in the final encoder layers, supporting targeted interventions at these points.
  • SAE latent features, concentrated in approximately 50-100 dimensions, contain the majority of hallucination-related information, enabling precise and sparse interventions that outperform activation steering in stability and generalization across datasets.

Significance

This work advances the understanding of internal model mechanisms, demonstrating that hallucination-related information is linearly separable within internal representations. By leveraging this, it offers a parameter-free, efficient method for hallucination mitigation, addressing a critical challenge in deploying robust ASR systems in real-world noisy and non-speech environments. The approach enhances model interpretability and provides a foundation for future research into internal model control, potentially influencing the design of more reliable, explainable speech recognition systems that can operate effectively across diverse scenarios and languages.

Technical Contribution

The paper's main technical contributions include: 1) empirical validation of linear separability of hallucination signals in Whisper's internal activations and SAE latent spaces; 2) development of two novel, parameter-free steering strategies—activation space and SAE latent space steering—that do not require model fine-tuning; 3) comprehensive evaluation across multiple datasets, model sizes, and languages, demonstrating robustness and generalization. These innovations provide a new paradigm for internal model control, combining interpretability with practical effectiveness, and open avenues for integrating internal representation manipulation into real-time ASR pipelines.

Novelty

This is the first work to systematically analyze the linear structure of hallucination-related information within the internal states of a large-scale Transformer-based ASR model like Whisper. It introduces the concept of internal representation steering without fine-tuning, leveraging sparse autoencoders to identify and manipulate the most relevant latent features. Unlike prior approaches relying solely on output filtering or model retraining, this method exploits the model's internal mechanisms, offering a lightweight, scalable, and interpretable solution for hallucination mitigation. The combination of internal representation analysis, linear separability validation, and parameter-free steering constitutes a significant innovation in the field.

Limitations

  • The effectiveness of the proposed methods depends on the linear separability of hallucination signals, which may not hold under extreme noise or highly ambiguous inputs, limiting robustness in certain real-world scenarios.
  • The approach primarily targets the final encoder layer, potentially missing opportunities for multi-layer interventions that could further improve performance.
  • Computational overhead introduced by SAE encoding and latent space manipulation may hinder real-time deployment in resource-constrained environments.
  • Generalization across languages and acoustic conditions, especially for low-resource languages, remains to be fully validated, and adaptation strategies may be necessary.

Future Work

Future research will focus on developing adaptive, multi-layer steering mechanisms that dynamically adjust intervention strength based on input context. Integrating reinforcement learning or meta-learning techniques could enable models to self-regulate hallucination propensity. Exploring multi-modal data, such as combining acoustic and semantic cues, may further enhance detection robustness. Additionally, efforts will be directed toward optimizing computational efficiency for real-time applications and extending the framework to multilingual and low-resource settings. Investigating the theoretical underpinnings of representation disentanglement and extending these methods to other generative models also constitute promising directions.

AI Executive Summary

Automatic speech recognition (ASR) has experienced remarkable progress with the advent of transformer-based models like Whisper, which leverage massive datasets and self-supervised learning to achieve unprecedented accuracy across languages and domains. However, despite these advancements, a persistent challenge remains: the phenomenon of hallucinations, where models generate fluent but entirely fabricated transcriptions for non-speech or noisy inputs. This issue undermines the reliability of ASR systems, especially in real-world applications involving background noise, music, or silence, where false positives can lead to misinformation or system failures.

Traditional solutions to hallucination problems have focused on heuristic filtering based on confidence scores or post-processing correction, but these methods often fall short because hallucinated outputs can exhibit high confidence levels, evading simple thresholds. Fine-tuning models to reduce hallucinations is another approach, yet it is computationally expensive and may degrade overall recognition performance. Recognizing these limitations, the authors propose a novel internal representation-based approach that leverages the model’s own hidden states to detect and mitigate hallucinations without altering the model parameters.

The core idea is rooted in the observation that the internal activations of Whisper, especially in deeper encoder layers, contain linearly separable information related to hallucination propensity. By extracting residual stream activations and training sparse autoencoders (SAEs) on these representations, the authors identify sparse latent features that encapsulate hallucination signals. They then employ two steering strategies: one manipulates the activation space by adding a difference vector derived from contrastive sets, and the other adjusts the most discriminative SAE latent features by flipping their signs. Both methods are applied during inference, effectively steering the model away from hallucination regimes.

Extensive experiments across diverse datasets—including non-speech audio like FSD50k, MUSAN, WHAM!, and speech datasets such as LibriSpeech, FLEURS, and AISHELL-1—demonstrate the effectiveness of these strategies. The SAE-based steering reduces hallucination rates from over 70% to below 15% in the best cases, with negligible impact on recognition accuracy (WER increases less than 1%). Notably, the methods generalize well across model sizes (small and large Whisper variants) and languages, confirming their robustness and practicality.

This work significantly advances the understanding of internal model mechanisms, showing that hallucination signals are embedded in the model’s internal states in a form amenable to linear manipulation. By providing a parameter-free, efficient, and interpretable approach, it opens new avenues for building more reliable and trustworthy ASR systems. Future directions include adaptive multi-layer interventions, real-time implementation, and extending the framework to multilingual and low-resource scenarios, promising a more robust future for speech technology.

Deep Analysis

Background

The evolution of ASR technology has transitioned from traditional statistical models like HMM-GMM to deep neural network architectures, culminating in transformer-based models such as Whisper. These models leverage large-scale pretraining on vast datasets, enabling remarkable generalization across languages and acoustic environments. Despite these advances, a critical issue persists: hallucinations, where models generate plausible but false transcriptions for non-speech or noisy inputs. Early solutions relied on confidence thresholds or heuristic filters, which proved insufficient as hallucinations can exhibit high confidence scores. Fine-tuning approaches, though effective, are resource-intensive and risk degrading overall accuracy. Recent research has begun exploring internal model mechanisms, aiming to understand and manipulate the representations within the neural network. This paper builds upon this trend, focusing on the linear structure of hallucination signals embedded in the residual activations and sparse latent features, proposing a novel, parameter-free intervention method that operates during inference, thus offering a scalable and practical solution.

Core Problem

The core challenge addressed is the high hallucination rate of Whisper when processing non-speech audio, which leads to the generation of irrelevant or fabricated transcriptions. Existing filtering heuristics are inadequate because hallucinated outputs often have high confidence scores, making them hard to detect and filter in real-time. Fine-tuning models to reduce hallucinations is costly and can impair recognition performance. Moreover, the lack of understanding of the internal representations that encode hallucination signals limits the development of more effective, generalizable solutions. Therefore, the fundamental problem is how to reliably identify and suppress hallucination-prone inputs by exploiting the internal states of the model without retraining or fine-tuning, ensuring both robustness and efficiency.

Innovation

The key innovations include: 1) revealing that hallucination-related information is linearly separable in both raw activations and SAE latent spaces, validated through layer-wise AUC analysis; 2) designing two parameter-free steering strategies—activation space steering and SAE latent space steering—that intervene during inference without modifying model weights; 3) demonstrating that sparse, discriminative features can be targeted for intervention, enabling precise suppression of hallucinations while preserving speech recognition accuracy. These contributions collectively provide a new paradigm for internal model control, combining interpretability, efficiency, and robustness, which surpasses prior post-processing or fine-tuning methods.

Methodology

  • �� Data collection: Extract residual stream activations from each encoder layer of Whisper, applying average pooling to obtain fixed-size representations.
  • �� SAE training: Train sparse autoencoders on these activations across diverse datasets, enforcing sparsity via L1 regularization to produce disentangled, interpretable latent features.
  • �� Discriminability assessment: Use logistic regression classifiers to evaluate the linear separability of hallucination signals at each layer, computing AUC scores to identify the most discriminative layers.
  • �� Feature importance analysis: Derive importance scores from classifiers to select top-k SAE latent features that encode hallucination signals.
  • �� Steering vector construction: For activation space, compute the difference vector between hallucinating and non-hallucinating sets; for SAE, create a sign-flipped sparse vector based on importance scores.
  • �� Inference-time intervention: Apply the steering vectors to residual activations or SAE latent representations, either additively or multiplicatively, to steer the internal states away from hallucination regimes.
  • �� Evaluation: Measure hallucination rate reduction and WER impact across multiple datasets and model variants, optimizing hyperparameters (α, k) via grid search on non-speech data.

Experiments

The experimental setup involves testing on two Whisper variants—small and large-v3—using datasets such as FSD50k, MUSAN, WHAM! for non-speech, and LibriSpeech, FLEURS, AISHELL-1 for speech. The datasets are split to prevent data leakage, with hyperparameter tuning performed solely on non-speech training data. The primary metrics include hallucination rate (HR) and WER/CER. The classifiers trained on internal representations validate the linear separability of hallucination signals. Hyperparameters for steering (α, k) are optimized through grid search, with evaluation on held-out non-speech test sets to assess generalization. Ablation studies compare the effectiveness of activation versus SAE-based steering, analyzing the impact of different layers, feature dimensions, and intervention strengths. Results demonstrate consistent hallucination reduction across datasets and models, with minimal degradation in recognition accuracy.

Results

Quantitative results show that SAE-based steering reduces hallucination rates from over 70% to below 15% across datasets, with WER increases less than 1%. Layer-wise analysis confirms that the most discriminative features are concentrated in the final encoder layers, with AUC scores exceeding 0.95. The sparse latent features, typically around 50-100 dimensions, contain the majority of hallucination signals, enabling targeted interventions that outperform activation steering in stability and cross-domain robustness. The methods generalize well across model sizes and languages, demonstrating their practical applicability and scalability.

Applications

The proposed internal representation steering techniques can be integrated into existing ASR pipelines to enhance robustness against non-speech noise and background interference. They are particularly suitable for real-time applications such as voice assistants, live captioning, and automated transcription services, where minimizing false positives is critical. The parameter-free nature allows easy deployment without retraining, making it accessible for industry adoption. Additionally, the insights gained from internal representation analysis can inform future model design, leading to inherently more robust architectures. Long-term, these methods could contribute to the development of self-correcting, explainable speech recognition systems capable of operating reliably in diverse, noisy environments.

Limitations & Outlook

The effectiveness of the methods hinges on the linear separability of hallucination signals, which may not hold in highly ambiguous or adversarial scenarios. The interventions are currently limited to the final encoder layer, potentially missing opportunities for multi-layer or hierarchical control. Computational overhead from SAE encoding and latent space manipulation could hinder real-time deployment in resource-constrained settings. The generalization to low-resource languages and unseen acoustic conditions requires further validation, and the approach may need adaptation for different model architectures or modalities. Future work should address these limitations by developing adaptive, multi-layer strategies, optimizing computational efficiency, and exploring broader applicability.

Plain Language Accessible to non-experts

想象你在一家工厂里工作,工厂里有许多机器在生产不同的商品。每台机器都依赖一套指令(就像模型内部的表示)来指导它们的工作。有时候,机器收到错误的指令或受到干扰,就会开始生产出完全不相关的产品,比如在制造汽车时突然出现一只猫。这就像AI模型在处理非语音输入时产生的幻觉——它会输出一段完全不符合输入内容的虚假文本。

为了防止这种情况,工厂的管理者开始研究这些指令,试图找到那些导致错误的指令。通过分析这些指令,他们发现某些特定的指令组合总是会引起错误。于是,他们设计了一种方法,在机器工作时,偷偷调整这些指令,确保机器不会偏离正确的生产流程。这就像本文提出的在模型内部引导激活状态,抑制幻觉的发生。

这种方法的核心思想是:不要去改变机器的硬件(模型参数),而是在它运行时,聪明地调整它的“内部指令”。这样既节省成本,又能让机器更可靠。经过多次实验,管理者发现,通过调整内部指令,生产的产品变得更加符合预期,错误率大大降低。未来,他们希望所有机器都能具备这种自我调节的能力,工厂的效率和产品质量都能得到极大提升。这就像让AI模型在处理非语音内容时,也能自己“调节”状态,避免产生虚假信息。

ELI14 Explained like you're 14

想象你在学校的图书馆里,有很多书架和书本。每个书架代表模型的不同部分,书本代表模型的内部信息。有时候,当你拿到一本书时,图书馆会出现一些奇怪的情况,比如出现一本完全不相关的书,或者一本空白的书。这就像AI模型在处理非语音内容时产生的幻觉——它会输出完全不符合输入的内容。

为了避免这种情况,图书馆的管理员开始研究书架上的书,试图找出那些容易出错的书。然后,他们会在你借书的时候,偷偷调整那些书的位置或者内容,让你更容易找到正确的书。这就像用内部表示引导模型,减少虚假输出。

这个方法的妙处在于:不用重新建造整个图书馆(不用微调模型),只是在你借书时,偷偷帮你调整书架上的书。这样既省事,又能让你借到的书更靠谱。经过多次尝试,管理员发现,这样做可以大大减少出错的概率,让你在借书时更放心。未来,他们希望能让所有的书架都变得更智能,自己知道哪些书容易出错,自己调节,变得越来越聪明。这就像让AI模型自己调节,避免在处理非语音内容时出错,变得更可靠、更智能。

Abstract

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

cs.SD cs.AI

References (20)

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

K. Kuznetsov, Laida Kushnareva, Polina Druzhinina et al.

2025 14 citations View Analysis →

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy et al.

2024 343 citations View Analysis →

Kimi-Audio Technical Report

KimiTeam, Ding Ding, Zeqian Ju et al.

2025 193 citations View Analysis →

CASteer: Cross-Attention Steering for Controllable Concept Erasure

Tatiana Gaintseva, Andreea-Maria Oncescu, Chengcheng Ma et al.

2025 12 citations View Analysis →

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu et al.

2022 7280 citations View Analysis →

WHAM!: Extending Speech Separation to Noisy Environments

G. Wichern, J. Antognini, Michael Flynn et al.

2019 489 citations View Analysis →

Weighted finite-state transducers in speech recognition

Mehryar Mohri, Fernando C Pereira, M. Riley

2002 1139 citations

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio

M. Barański, J. Jasinski, Julitta Bartolewska et al.

2025 26 citations View Analysis →

Discovering and Steering Interpretable Concepts in Large Generative Music Models

Nikhil Singh, Manuel Cherep, Pattie Maes

2025 6 citations View Analysis →

A Maximum Likelihood Approach to Continuous Speech Recognition

L. Bahl, F. Jelinek, R. Mercer

1983 1498 citations

FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech

Alexis Conneau, Min Ma, Simran Khanuja et al.

2022 604 citations View Analysis →

Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Rita Frieske, Bertram E. Shi

2024 35 citations View Analysis →

Linguistic constraints in hidden Markov model based speech recognition

M. Weintraub, H. Murveit, Michael Cohen et al.

1989 78 citations

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, L. Smith et al.

2023 1227 citations View Analysis →

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julia Schulz et al.

2023 770 citations View Analysis →

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu et al.

2025 296 citations View Analysis →

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang et al.

2024 247 citations View Analysis →

GigaAM: Efficient Self-Supervised Learner for Speech Recognition

Aleksandr Kutsakov, A. Maximenko, Georgi Gospodinov et al.

2025 3 citations View Analysis →

Steering Language Models With Activation Engineering

A. M. Turner, Lisa Thiergart, Gavin Leech et al.

2023 572 citations View Analysis →

FSD50K: An Open Dataset of Human-Labeled Sound Events

Eduardo Fonseca, Xavier Favory, Jordi Pons et al.

2020 695 citations View Analysis →