The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
This paper reveals that RLHF achieves shallow alignment by compressing partisan signals without removing the underlying partisan structure, as shown through internal representation analysis of Llama 3.1 8B.
Key Findings
Methodology
This study employs mechanistic analysis combining linear probing and sparse autoencoder (SAE) decomposition to investigate how RLHF influences the internal bias structures of the Llama 3.1 8B model. Initially, linear probes are trained on hidden states across layers to identify the partisan direction, assessing whether RLHF removes or alters this geometric bias. Subsequently, SAE is used to decode the activation features associated with partisan signals, revealing how these features change after RLHF. Causal experiments at the feature level further validate the role of identified bias pathways in output generation. The comprehensive approach ensures a detailed understanding of the internal mechanisms, involving pretraining, RLHF fine-tuning, bias direction detection, feature decoding, and causal validation.
Key Results
- RLHF does not eliminate the geometric partisan structure but compresses the partisan signal's variance, reducing the range from -0.5 to 1.253 in the base model to -0.011 to 0.388 in the instructed model. The standard deviation drops from 0.234 to 0.07, indicating a significant compression. The activation of partisan-related features in the SAE analysis diminishes sharply after RLHF, with almost all policy-related features becoming inactive. This demonstrates that RLHF achieves neutrality primarily through signal compression and pathway disconnection rather than structural removal.
- The primary mechanism of neutrality involves severing the causal pathway from partisan geometry to output, achieved by adjusting the activation of partisan features. Experiments show that activating partisan features in the instructed model reactivates biased outputs, confirming the causal link. Conversely, suppressing these features leads to balanced, multi-perspective responses. This indicates that the geometric bias remains but is functionally blocked, highlighting the superficial nature of RLHF's alignment process.
- The findings imply that the bias geometry persists internally, and the apparent neutrality is a consequence of feature-level suppression. The bias structure remains accessible and can be reactivated via inverse activation or inference of partisan identity, revealing potential vulnerabilities. This suggests that RLHF's effectiveness is limited to superficial signal modulation, and deeper bias structures may still pose risks, especially in adversarial or context-specific scenarios.
- Overall, this work advances understanding of how RLHF influences internal representations, emphasizing that shallow alignment via signal compression does not eradicate bias structures. It provides a mechanistic foundation for developing more robust alignment strategies that address underlying geometric biases rather than merely suppressing signals.
Significance
This research fundamentally challenges the common assumption that RLHF achieves deep alignment by removing undesirable biases. Instead, it shows that RLHF primarily compresses bias signals, leaving the geometric structures intact but functionally inactive. This insight has profound implications for AI safety, as models may retain latent biases that can be reactivated under certain conditions, posing risks of unintended behavior. It underscores the importance of developing alignment techniques that go beyond superficial signal suppression to address the root geometric and structural biases. Furthermore, the methodology combining linear probing and SAE provides a powerful toolkit for mechanistic interpretability, enabling researchers to dissect complex internal representations and causal pathways. The findings also inform policy and ethical considerations, emphasizing caution in assuming that current alignment methods fully mitigate bias and value misalignment in deployed AI systems.
Technical Contribution
This paper introduces a novel mechanistic framework combining linear probing and sparse autoencoder decomposition to analyze internal bias structures in large language models. It demonstrates that RLHF does not remove the geometric partisan direction but compresses the variance of the bias signal, effectively silencing it at the feature activation level. The study provides causal validation by manipulating feature activations, confirming the pathway from partisan geometry to output. The approach offers a new paradigm for understanding and controlling biases in neural networks, emphasizing the importance of internal feature-level analysis over surface-level output inspection. The integration of these techniques advances the interpretability and safety of large models, enabling targeted interventions at the internal representation level.
Novelty
This work is the first to systematically reveal that RLHF's bias mitigation operates primarily through signal compression and pathway disconnection rather than structural removal of geometric bias directions. Unlike prior studies focusing on output-level adjustments or superficial fine-tuning, this research uncovers the internal mechanisms, emphasizing the persistence of bias geometry within the residual stream. The combination of linear probes and SAE for interpretability and causal validation represents a significant methodological innovation, providing a deeper understanding of how alignment techniques influence internal representations. This insight challenges existing assumptions and opens new avenues for designing more robust, bias-resilient alignment strategies.
Limitations
- The analysis centers on partisan bias within the Llama 3.1 8B model, which may limit generalizability to other models or bias domains. The internal bias geometry might differ across architectures and training data.
- The study relies on linear probes and SAE, which may not capture complex nonlinear interactions or multi-modal influences on bias structures, potentially oversimplifying the internal dynamics.
- The causal experiments focus on activation manipulation at the feature level, but real-world bias reactivation could involve more subtle or context-dependent mechanisms, requiring further investigation.
- The findings are based on a specific dataset (congressional tweets) and may not fully represent biases arising from broader societal or cultural factors. Future work should validate across diverse datasets and bias types.
Future Work
Future research should extend the mechanistic analysis to other bias domains, such as racial or gender biases, and explore nonlinear and multi-modal internal representations. Developing methods to structurally modify or remove geometric bias directions could lead to more robust alignment. Additionally, integrating causal inference techniques and counterfactual analysis may enhance understanding of bias reactivation pathways. Expanding the scope to multi-language models and real-world deployment scenarios will be crucial for practical safety improvements. Finally, designing alignment strategies that address the root geometric structures rather than superficial signals could significantly advance AI safety and value alignment.
AI Executive Summary
The rapid advancement of large language models (LLMs) such as GPT and LLaMA has revolutionized natural language processing, enabling applications across industries from customer service to scientific research. Despite their impressive capabilities, these models harbor embedded biases and value-laden structures that pose significant safety and ethical challenges. Traditional approaches to alignment, notably Reinforcement Learning from Human Feedback (RLHF), aim to steer models toward human-compatible behaviors. However, recent insights suggest that RLHF primarily achieves superficial or shallow alignment, rather than fundamentally altering the internal representations that encode biases.
Wendy K. Tam’s recent study provides a mechanistic deep dive into this phenomenon, focusing on the internal geometry of partisan bias within the Llama 3.1 8B model. By employing linear probing techniques, the study identifies a specific partisan direction in the model’s residual stream, which correlates strongly with political orientation. The key discovery is that RLHF does not erase this geometric bias; instead, it compresses the variance of the partisan signal, effectively silencing it at the feature activation level.
To understand how this works, Tam utilizes sparse autoencoders (SAE) to decode the activation features associated with partisan bias. The analysis reveals that policy-related features, which activate sporadically in the base model, become completely inactive after RLHF. The bias signal’s compression results in responses that are balanced and non-partisan, but the underlying geometric structure remains intact within the residual stream. This indicates that RLHF’s neutrality is functionally achieved by pathway disconnection rather than structural removal.
Further causal experiments demonstrate that reactivating partisan features in the instructed model can reintroduce bias, confirming the pathway’s causal role. The bias control mechanism is thus superficial—dependent on feature activation rather than the elimination of bias geometry. This insight highlights potential vulnerabilities: adversarial inference or inference of partisan identity can reactivate bias pathways, risking unintended biased outputs.
Overall, Tam’s work challenges the assumption that RLHF achieves deep, structural alignment. Instead, it underscores the importance of understanding internal geometric biases and developing techniques that modify these root structures. The findings have profound implications for AI safety, model interpretability, and the design of future alignment strategies. Moving forward, integrating nonlinear analysis, multi-modal data, and structural bias removal could lead to more robust, trustworthy AI systems capable of aligning with complex human values in a resilient manner.
Deep Dive
Abstract
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.