Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
SAERL leverages Sparse Autoencoder activations to model diversity, difficulty, and quality for LLM post-training data engineering, boosting Qwen2.5-Math-1.5B accuracy by 3%.
Key Findings
Methodology
This paper proposes the SAERL framework, which utilizes Sparse Autoencoder (SAE) activations extracted from large language model (LLM) internals to characterize three intrinsic data properties: diversity, difficulty, and quality. Specifically, it performs clustering in SAE feature space combined with moderate batch mixing to control batch diversity, employs an ElasticNet regressor trained on a small labeled subset to predict sample difficulty and construct easy-to-hard curricula within clusters, and trains a linear classifier over SAE activations to filter noisy or off-distribution samples. The framework is evaluated on mathematical reasoning tasks using DeepMath-103K dataset with GRPO and DAPO reinforcement learning algorithms, demonstrating consistent accuracy and training efficiency improvements across model scales. Notably, a single SAE trained on a smaller model transfers effectively to larger models and different RL algorithms, showcasing its lightweight and reusable nature.
Key Results
- On the Qwen2.5-Math-1.5B model, SAERL improves average accuracy by 3.00% over vanilla GRPO and reduces training steps by 20% to reach target accuracy.
- Consistent gains are observed across different model scales (1.5B and 7B) and RL algorithms (GRPO and DAPO), demonstrating the generalizability of SAE-based data engineering.
- Ablation studies confirm that difficulty-based curriculum ordering, moderate batch mixing, and cluster-first grouping each significantly contribute to performance, with moderate batch mixing optimally balancing gradient coherence and data coverage.
Significance
This work pioneers the use of LLM internal activations as actionable signals for post-training data engineering, moving beyond traditional reliance on costly external feedback such as human preferences or verifier outcomes. By leveraging fine-grained, sparse SAE features, it captures nuanced data properties that enable more precise and interpretable data selection, curriculum construction, and filtering. This approach enhances training efficiency and final model performance, addressing long-standing challenges in reinforcement learning data engineering. It bridges model interpretability and data-driven training, offering both academic insights into model internal mechanisms and practical tools for industrial-scale LLM training optimization.
Technical Contribution
Technically, the paper innovates by applying Sparse Autoencoders to extract disentangled, sparse representations from LLM hidden states, enabling fine-grained modeling of intrinsic data properties. Unlike prior methods relying on coarse external signals or dense embeddings, SAERL grounds diversity, difficulty, and quality in explicit data engineering operations: batch construction via SAE-space clustering and mixing, curriculum ordering via calibrated difficulty regression, and data filtering via SAE-based classification. The moderate batch mixing strategy balances gradient stability and coverage, improving optimization dynamics. Furthermore, the demonstrated cross-model and cross-scale transferability of a single SAE encoder highlights its engineering efficiency and broad applicability, extending the frontier of model-internal signal utilization in RL data engineering.
Novelty
This is the first work to systematically leverage sparse SAE activations from LLM internals for post-training RL data engineering. Its fundamental innovation lies in jointly modeling three intrinsic data properties and mapping them to concrete data engineering operations, validated across multiple models and RL algorithms. This contrasts with prior approaches that primarily depend on external feedback or dense hidden states, marking a paradigm shift towards intrinsic, interpretable, and reusable data signals.
Limitations
- The method is evaluated primarily on mathematical reasoning tasks with verifiable rewards; its effectiveness in other post-training domains like code generation, agentic RL, or instruction following remains untested.
- Difficulty and quality proxies rely on limited labeled subsets or source distribution labels, precluding fully unsupervised data property modeling and limiting automation.
- Theoretical guarantees linking SAE-space distances to training dynamics are lacking, and causal relationships remain to be established.
Future Work
Future directions include extending SAE-based data engineering to broader post-training tasks such as code generation, multi-step decision-making, and multi-modal learning. Exploring weaker supervision or unsupervised methods for difficulty and quality estimation could reduce labeling dependence. Additionally, integrating gradient-level analyses to theoretically ground SAE-space representations in training dynamics would deepen understanding and improve method robustness.
AI Executive Summary
Large language models (LLMs) have achieved remarkable success through pretraining and fine-tuning, but their capabilities are increasingly advanced via post-training reinforcement learning (RL). Effective data engineering during this stage—deciding which samples to use, their ordering, and batching strategies—is crucial for improving training efficiency and final performance. Traditional approaches rely heavily on external feedback signals such as human preferences, verifier outputs, or rollout success rates, which are costly and often sparse. Meanwhile, the rich internal activations of LLMs, reflecting how the model processes data, remain underexploited.
This paper introduces SAERL, a novel framework that harnesses Sparse Autoencoder (SAE) activations extracted from LLM internals to model three intrinsic data properties: diversity, difficulty, and quality. SAERL operationalizes these properties into concrete data engineering steps: clustering in SAE space combined with moderate batch mixing to control batch diversity; an ElasticNet-based difficulty predictor to construct easy-to-hard curricula within clusters; and a linear quality classifier to filter noisy or off-distribution samples. This intrinsic approach leverages fine-grained, sparse, and interpretable features, enabling more precise and efficient data selection and ordering.
The core technical insight is that SAE activations provide a disentangled representation of LLM internal states, capturing semantic and structural nuances beyond shallow metadata or dense embeddings. Moderate batch mixing balances gradient coherence and coverage, optimizing training dynamics. The difficulty proxy is calibrated per cluster to ensure meaningful curriculum ordering. The quality probe effectively discriminates target distribution samples from noise, enhancing data purity.
Empirical evaluation on the DeepMath-103K dataset with Qwen2.5-Math-1.5B and 7B models demonstrates that SAERL improves average accuracy by 3.00% over vanilla GRPO and reduces training steps by 20% to reach target accuracy. Gains are consistent across RL algorithms (GRPO, DAPO) and model scales, with a single SAE encoder trained on a smaller model effectively guiding larger models. Ablation studies confirm the necessity of each component, and batch diversity analysis reveals a concave relationship between mixing strength and performance, highlighting the importance of moderate mixing.
This work advances the field by shifting post-training data engineering from external feedback dependence to intrinsic model signals, integrating mechanistic interpretability with practical training improvements. It opens avenues for more adaptive, efficient, and interpretable LLM training. Limitations include domain specificity to mathematical reasoning, reliance on limited supervision for difficulty and quality estimation, and the need for theoretical grounding of SAE-space dynamics. Future work aims to generalize to broader tasks, reduce supervision, and deepen theoretical understanding.
Deep Analysis
Background
Large language models (LLMs) have revolutionized natural language processing by leveraging massive pretraining on diverse corpora. However, to achieve state-of-the-art performance on complex tasks, post-training fine-tuning, especially via reinforcement learning (RL), has become essential. RL enables models to optimize for specific objectives, such as alignment with human preferences or task-specific rewards. Effective data engineering during this phase—selecting which samples to train on, their ordering, and batching—is critical to maximize learning efficiency and final model quality.
Traditional post-training data engineering relies on external feedback signals including human annotations, verifier outcomes, or rollout success rates. While useful, these signals are costly to obtain, often sparse, and may not fully capture the intrinsic structure of the data as perceived by the model. Concurrently, advances in mechanistic interpretability have revealed that model internal activations encode rich semantic and structural information about input data. Sparse Autoencoders (SAEs) have emerged as a powerful tool to decompose dense LLM hidden states into sparse, disentangled features, offering a fine-grained lens into model internals.
Previous work has leveraged model internals for pretraining data selection or supervised fine-tuning, but their role in post-training RL data engineering remains underexplored. This gap motivates the current study to harness SAE activations as intrinsic signals for guiding data engineering operations, aiming to improve training efficiency and model performance.
Core Problem
Post-training data engineering faces three intertwined challenges: ensuring batch diversity to cover a wide range of semantic and reasoning patterns, constructing curricula that order samples from easy to hard to facilitate progressive learning, and filtering out noisy or off-distribution samples to maintain data quality. Existing methods predominantly depend on external signals such as human difficulty annotations, rollout accuracy, or preference labels, which are expensive and may not scale well.
Moreover, these external signals often provide coarse or delayed feedback, limiting their effectiveness in fine-grained data selection and ordering. The question arises whether intrinsic model signals—specifically, sparse and interpretable activations extracted via SAE—can reliably encode these data properties and be operationalized into concrete data engineering steps. Addressing this is crucial for developing more adaptive, efficient, and interpretable post-training pipelines for LLMs.
Innovation
This work introduces several key innovations:
- �� Application of Sparse Autoencoders to extract sparse, disentangled feature activations from LLM hidden states, providing a structured and interpretable representation space for data characterization.
- �� Joint modeling of three intrinsic data properties—diversity, difficulty, and quality—directly from SAE activations, moving beyond reliance on external or scalar feedback.
- �� Mapping these intrinsic properties to concrete data engineering operations: SAE-space clustering and moderate batch mixing for diversity-driven batch construction; ElasticNet-based difficulty prediction and cluster-wise calibration for curriculum learning; and linear classification over SAE features for quality-based data filtering.
- �� Introduction of a moderate batch mixing strategy that balances within-batch gradient coherence and cross-cluster coverage, optimizing training dynamics.
- �� Demonstration that a single SAE encoder trained on a smaller model can transfer effectively across model scales and RL algorithms, highlighting engineering efficiency and generalizability.
Methodology
- �� SAE Representation Extraction: For each training sample, extract hidden activations from the 27th layer of the LLM separately for prompt and solution spans. Encode these activations using a pretrained SAE, then aggregate via mean and max pooling to capture sustained and localized patterns, resulting in a 960-dimensional sparse feature vector concatenated with shallow metadata (e.g., length, TeX ratio).
- �� Diversity-Driven Batch Construction: Perform MiniBatchKMeans clustering in SAE feature space to partition samples into K clusters reflecting semantic and reasoning structures. Construct batches by grouping samples within clusters and apply moderate batch mixing by swapping a small tail portion of samples between adjacent batches from different clusters but similar difficulty and length, balancing gradient coherence and coverage.
- �� Difficulty-Driven Curriculum Ordering: Train an ElasticNet regressor on a small labeled subset (3,000 samples) to predict continuous difficulty scores from SAE features. Calibrate scores cluster-wise using shrinkage-based corrections to normalize scales. Sort samples within clusters by calibrated difficulty to form easy-to-hard trajectories. Globally interleave batches across clusters stage-wise.
- �� Quality-Driven Data Filtering: Train a linear classifier over SAE features to estimate the probability of a sample belonging to the target distribution, using labeled source data. Filter samples by thresholding or top-K ranking to exclude noisy or off-distribution data before curriculum construction.
- �� Training and Evaluation: Integrate SAERL with GRPO and DAPO RL algorithms, train on DeepMath-103K dataset with batch size 128, evaluate on six mathematical reasoning benchmarks (GSM8K, AMC23, MATH500, MinervaMath, OlympiadBench, AIME24) using Avg@8 and Pass@8 metrics.
Experiments
Experiments are conducted on the DeepMath-103K dataset, a large-scale mathematical reasoning corpus with annotated topic labels and difficulty scores. Two model scales, Qwen2.5-Math-1.5B and 7B, are used to test scalability. Baselines include vanilla GRPO and DAPO without curriculum, Difficulty Curriculum Learning using external difficulty labels, rollout accuracy-based curriculum methods, and hidden state compression-based data selection.
Evaluation metrics include average accuracy (Avg@8) across five benchmarks and pass rate (Pass@8) on the competition-level AIME24 dataset. Ablation studies dissect the contributions of difficulty sorting, batch mixing, and cluster grouping. Training efficiency is measured by the number of steps to reach target accuracy. Additional experiments assess SAE activations’ ability to predict data diversity (topic labels), difficulty (Spearman correlation), and quality (Pearson correlation), as well as their effectiveness in filtering noisy samples from a mixed data pool.
Results
SAERL achieves a 3.00% average accuracy improvement over vanilla GRPO on Qwen2.5-Math-1.5B and reduces training steps by 20% to reach target accuracy. Consistent improvements are observed across model scales and RL algorithms, with SAERL outperforming external difficulty label-based curricula and rollout accuracy methods. Ablations confirm that difficulty-based curriculum ordering is essential, while moderate batch mixing and cluster-first grouping further enhance performance. Batch diversity analysis reveals a concave relationship between mixing strength and performance, with moderate mixing yielding optimal results. SAE activations enable accurate prediction of data diversity (topic label linear probe accuracy significantly above majority baseline), difficulty (Spearman ρ up to 0.749), and quality (Pearson r improved to 0.3715). In noisy data selection, SAE-based linear classifiers achieve ROC-AUC of 0.9911, effectively filtering off-distribution samples.
Applications
SAERL is directly applicable to post-training RL of LLMs, particularly in structured reasoning domains like mathematics. Its lightweight SAE encoder facilitates rapid deployment and cross-model scalability, making it attractive for industrial training pipelines aiming to optimize data selection and curriculum without extensive external labeling. By improving training efficiency and final model accuracy, SAERL can reduce computational costs and accelerate model iteration. Future extensions could adapt the framework to code generation, agentic RL, multi-modal tasks, and general instruction following, broadening its impact across AI applications.
Limitations & Outlook
The study focuses on mathematical reasoning with verifiable rewards, limiting demonstrated applicability to other domains. Difficulty and quality proxies rely on limited labeled data and source distribution labels, hindering fully unsupervised operation. Theoretical understanding of SAE-space distances’ causal impact on training dynamics is incomplete, necessitating further gradient-level analyses. Additionally, the optimal batch mixing parameter may vary by task and model, requiring adaptive tuning.
Abstract
Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.
References (20)
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
Zhiwei He, Tian Liang, Jiahao Xu et al.
Training Verifiers to Solve Math Word Problems
K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
Dengcan Liu, Jiahao Li, Zheren Fu et al.
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan et al.
Web-scale k-means clustering
D. Sculley
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal et al.
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Chaoqun He, Renjie Luo, Yuzhuo Bai et al.
Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals
Qinsi Wang, Jinghan Ke, Hancheng Ye et al.
Let's Verify Step by Step
H. Lightman, Vineet Kosaraju, Yura Burda et al.
Data-Efficient RLVR via Off-Policy Influence Guidance
Erle Zhu, Dazhi Jiang, Yuan Wang et al.
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Dylan Sam, Ayan Chakrabarti, A. Rostamizadeh et al.
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.
UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection
Yang Zhao, Kai Xiong, Xiao Ding et al.
LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment
Shipeng Li, Zhiqing Yang, Shikun Li et al.
GLM-5: from Vibe Coding to Agentic Engineering
GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.
Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey
Sanmit Narvekar, Bei Peng, M. Leonetti et al.
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li, Yong Zhang, Shwai He et al.
Addendum: Regularization and variable selection via the elastic net
H. Zou, T. Hastie
Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder
Xianjun Yang, Shaoliang Nie, Lijuan Liu et al.
Large-Scale Machine Learning with Stochastic Gradient Descent
L. Bottou