LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
LongTraceRL enhances long-context reasoning by generating complex multi-hop questions via knowledge graph random walks, using search trajectories for layered distractors, and entity-level rubric rewards, achieving significant improvements.
Key Findings
Methodology
This work introduces LongTraceRL, a framework that synthesizes long-context training data through knowledge graph random walks to generate multi-hop questions with verifiable reasoning chains. It leverages search agent trajectories to construct tiered distractors: Tier-1, comprising documents read but not cited, and Tier-2, appearing in search results but never opened, thus increasing data difficulty and realism. The reward component features a novel entity-level rubric reward, which provides fine-grained supervision based on the recall of gold entities in the reasoning chain, applied only to responses with correct final answers via a positive-only strategy. The training employs the Group Relative Policy Optimization (GRPO) algorithm, combining outcome and process rewards to guide policy updates. Extensive experiments across five benchmarks with models from 4B to 30B parameters demonstrate consistent performance gains over strong baselines, especially in deep reasoning tasks.
Key Results
- On Qwen 3-4B, LongTraceRL achieves an average score of 56.5, surpassing the baseline LongRLVR by 2.5 points and improving the base model by 5.7 points, with notable gains in complex reasoning benchmarks like AA-LCR (8.6 points increase from 33.2 to 41.8).
- In larger models (30B), performance improves from 60.5 to 63.7, indicating scalability and robustness of the approach. Ablation studies confirm that Tier-1 distractors and entity-level rubric rewards are critical for these improvements.
- The experimental results reveal that the entity-level rubric reward effectively encourages models to ground their reasoning in relevant evidence, reducing shortcut solutions and enhancing interpretability. The layered distractor strategy significantly increases training difficulty, leading to better generalization in downstream tasks.
Significance
This research addresses fundamental challenges in long-text reasoning, notably the scarcity of challenging training data and sparse reward signals. By integrating knowledge graph-based question generation, search trajectory-informed distractors, and entity-level supervision, it provides a comprehensive solution that enhances the reasoning depth, evidence reliance, and robustness of large language models. These advancements have profound implications for deploying AI in real-world applications requiring complex, multi-step inference, such as legal analysis, scientific research, and automated decision-making, pushing the boundaries of what large models can achieve in understanding and reasoning over extensive textual content.
Technical Contribution
The core technical contributions include: 1) a novel data construction pipeline utilizing knowledge graph random walks combined with search agent trajectories to produce realistic, layered distractors that challenge models; 2) a fine-grained entity-level rubric reward that supervises intermediate reasoning steps, only applied to correct answers to prevent reward hacking; 3) an effective training framework based on GRPO that balances outcome and process supervision, leading to improved reasoning capabilities. These innovations collectively push the state-of-the-art in long-context reinforcement learning, enabling models to perform deeper, more reliable reasoning.
Novelty
This work is the first to systematically incorporate search trajectory-derived layered distractors into long-context RL training, significantly increasing data complexity and realism. Additionally, it introduces an entity-level rubric reward that directly supervises the reasoning process, contrasting with prior outcome-only reward methods. The combination of knowledge graph-based question generation, layered distractors, and entity-focused supervision constitutes a novel paradigm that addresses the core issues of sparse rewards and superficial reasoning in existing approaches.
Limitations
- The data generation relies heavily on Wikipedia's knowledge graph, limiting diversity to encyclopedic knowledge and potentially reducing applicability to specialized domains like medicine or law without further adaptation.
- Search trajectories depend on the capabilities of the deployed search agent, which may introduce bias or variability in distractor quality, affecting training effectiveness.
- The entity-level rubric reward focuses on entity recall, but does not explicitly encode logical or causal reasoning structures, which could be vital for more complex inference tasks.
Future Work
Future directions include expanding knowledge sources beyond Wikipedia to include domain-specific graphs, developing more sophisticated search agents for diverse data collection, and integrating logical reasoning modules to enhance the interpretability and depth of the reasoning process. Additionally, exploring more scalable training strategies and applying the framework to multilingual or multimodal contexts could further broaden its impact.
AI Executive Summary
长文本推理一直是大规模语言模型(LLMs)面临的核心挑战之一。随着文本长度的增加,模型在定位和整合关键信息时,常常受到大量干扰信息的干扰,导致推理深度不足、答案偏差甚至虚假陈述。传统的强化学习(RL)方法依赖于稀疏的结果导向奖励,难以有效监督中间推理步骤,限制了模型的推理能力提升。
为解决这一问题,LongTraceRL提出了一套创新的训练策略。首先,通过知识图随机游走生成复杂的多跳问答,确保训练数据具有高度的推理深度和语义相关性。其次,利用搜索代理的轨迹,构建分层干扰项:Tier-1干扰项由搜索过程中阅读但未引用的文档组成,具有高混淆度;Tier-2干扰项则是搜索结果中未被打开的文档,相关性较低。这种设计极大增加了训练难度,促使模型在推理过程中更为谨慎和深思。
在奖励机制方面,研究引入了实体级的Rubric Reward,只在模型最终答案正确时,依据推理链中的实体引用情况给予细粒度的过程监督。这种正向奖励策略,有效避免了模型通过跳跃式推理或作弊获得奖励的问题。结合结果奖励,模型在五个长文本推理基准上表现出色,平均提升5.7分,特别是在复杂推理任务中提升明显。
实验结果显示,LongTraceRL在不同规模(4B到30B)模型上都具有良好的泛化能力,显著优于现有的长文本强化学习方法。其创新的数据构建和奖励设计,为长文本推理提供了新的技术路径,有望推动自动问答、知识推理等应用的深入发展。未来,研究将进一步扩展知识源、多样化推理逻辑,提升模型的推理深度和广度,推动人工智能在复杂场景中的应用落地。
Deep Analysis
Background
随着大规模预训练语言模型(如GPT、BERT等)的崛起,长文本理解与推理逐渐成为研究热点。早期方法多依赖于短文本或有限上下文,难以应对实际场景中的长篇内容。近年来,基于强化学习(如DeepSeek-AI、LongRL)的方法尝试通过奖励机制引导模型进行深层推理,但受限于稀疏的奖励信号和干扰项设计不足,推理深度和证据依赖仍不足。知识图谱的引入为多跳推理提供了结构化的支持,结合搜索策略增强干扰项的复杂性,成为提升长文本推理能力的重要方向。
Core Problem
当前长文本推理模型在处理大量干扰信息时表现不佳,主要问题在于训练数据缺乏复杂性,干扰项多为随机采样,缺乏语义相关性,导致模型难以区分关键信息与干扰信息。同时,奖励信号多为最终答案的正确性,无法有效监督中间推理步骤,模型容易通过跳跃式推理或作弊获得高奖励,影响推理的真实性和可靠性。这些问题限制了模型在实际复杂场景中的应用能力。
Innovation
本研究提出了两大创新:一是利用知识图随机游走生成多跳问答,结合搜索轨迹构建层级干扰项(Tier-1与Tier-2),极大增强训练数据的复杂性和真实性;二是引入实体级的Rubric Reward,只在答案正确时,依据推理链中的实体引用情况给予细粒度的过程监督,有效缓解奖励稀疏和作弊问题。这些创新突破了传统方法中干扰项设计和奖励信号不足的瓶颈,为长文本推理提供了新思路。
Methodology
- �� 生成多跳问答:采用知识图随机游走,从维基百科知识图中采样实体路径,结合LLM(如GPT-5.2)生成符合条件的多跳问题和对应答案。• 构建搜索轨迹:使用强化搜索代理,模拟回答过程,记录搜索行为轨迹,包括搜索、阅读、引用等步骤。• 构建干扰项:根据搜索轨迹,将阅读但未引用的文档作为Tier-1干扰项,未打开的搜索结果作为Tier-2干扰项,增强干扰的语义相关性和难度。• 训练数据组装:采用traj-tiered策略,优先加入Tier-1干扰项,确保训练数据具有高难度和真实性。• 奖励设计:引入实体级的Rubric Reward,只在模型答案正确时,根据推理链中的实体引用情况给予细粒度奖励。• 优化算法:采用GRPO,结合结果奖励和过程奖励,优化模型策略,提升推理深度和证据依赖。
Experiments
在五个长文本推理基准(如LongBench、AA-LCR等)上,使用不同规模(4B、8B、30B)模型进行训练和评估。比较基线包括传统RL方法(如LongRLVR)、随机干扰项方法、单次搜索方法等。指标涵盖答案正确率、推理深度、证据覆盖率等。通过消融实验验证Tier-1干扰项、实体级奖励和正向奖励策略的贡献。训练采用128K最大上下文长度,使用GRPO优化,训练200轮,评估时采用温度0.6,最大生成长度32K。
Results
实验证明,LongTraceRL在五个基准中的平均得分提升5.7分,最大提升在AA-LCR(8.6分)显著优于对比方法。模型在复杂推理任务中的表现尤为突出,推理深度和证据依赖得到增强。消融实验显示,Tier-1干扰项和实体级奖励是性能提升的关键因素,正向奖励策略有效防止模型作弊。不同模型规模的实验表明,方法具有良好的泛化能力和扩展性。
Applications
该方法适用于自动问答、知识推理、法律、金融等需要深度理解和多跳推理的场景。通过构建高难度训练数据,提升模型在复杂场景中的推理能力,满足行业对高可靠性和证据依赖的需求。未来可结合多源知识图和逻辑推理模型,推动智能问答系统的商业化应用。
Limitations & Outlook
依赖维基百科知识图,知识面有限,难以覆盖专业领域或行业知识,限制模型的泛化能力。搜索轨迹由特定搜索代理生成,可能存在偏差或能力不足,影响干扰项的质量。奖励机制主要关注实体引用,未充分考虑推理的逻辑合理性和因果关系,未来应结合逻辑推理模型进行优化。
Plain Language Accessible to non-experts
想象你在一个超级大的图书馆里找答案。这个图书馆里有成千上万的书,有的内容很相似,有的则完全无关。你想找到一个正确的答案,但很多书都在干扰你,让你迷失方向。于是,你开始用一种聪明的搜索方法:先找到一些相关的书,然后只阅读部分内容,记住重要的线索,但不一定全部引用。你还会遇到一些看起来很相关的书,但其实是误导你的。为了训练一个聪明的助手,我们会让它在这个图书馆里练习:它会用搜索策略找到很多书,然后学习如何区分哪些书是真的重要,哪些是干扰。我们还会奖励它,只在它找到正确答案的时候,才会给它评分,但这个评分还会看它在推理过程中是否引用了关键的线索。这样训练出来的助手,不仅能给出正确答案,还能清楚地讲出它是怎么推理出来的,就像一个聪明的侦探一样,讲述它的推理过程,让别人也能理解它的思路。
ELI14 Explained like you're 14
想象你在一个超级大的图书馆里找答案。这个图书馆里有成千上万的书,有的内容很相似,有的则完全无关。你想找到一个正确的答案,但很多书都在干扰你,让你迷失方向。于是,你开始用一种聪明的搜索方法:先找到一些相关的书,然后只阅读部分内容,记住重要的线索,但不一定全部引用。你还会遇到一些看起来很相关的书,但其实是误导你的。为了训练一个聪明的助手,我们会让它在这个图书馆里练习:它会用搜索策略找到很多书,然后学习如何区分哪些书是真的重要,哪些是干扰。我们还会奖励它,只在它找到正确答案的时候,才会给它评分,但这个评分还会看它在推理过程中是否引用了关键的线索。这样训练出来的助手,不仅能给出正确答案,还能清楚地讲出它是怎么推理出来的,就像一个聪明的侦探一样,讲述它的推理过程,让别人也能理解它的思路。
Glossary
Knowledge Graph (知识图谱)
一种结构化的知识表示方式,将实体和关系以图的形式组织,便于推理和搜索。
用于生成多跳问答和干扰项的基础结构。
Reinforcement Learning with Verifiable Rewards (RLVR, 可验证奖励强化学习)
一种强化学习方法,通过明确的奖励信号引导模型学习,奖励可验证,确保推理过程的正确性。
核心算法框架,用于训练长文本推理模型。
Rubric Reward (评分奖励)
基于推理链中的实体引用情况,给予模型细粒度的过程监督奖励。
提升模型推理深度和证据依赖能力的关键机制。
Tiered Distractors (分层干扰项)
根据搜索轨迹,将干扰文档分为高混淆度和低混淆度两层,增强训练难度。
用于构建更具挑战性的训练数据。
Group Relative Policy Optimization (GRPO)
一种强化学习优化算法,结合多样性样本和奖励信号,优化模型策略。
训练过程中用以提升模型推理能力。
Multi-hop Question (多跳问答)
需要通过多步推理,逐步连接多个实体或信息点,才能得到答案的问题。
训练模型进行深层推理的关键任务类型。
Search Agent (搜索代理)
模拟搜索行为的模型,用于收集搜索轨迹和构建干扰项。
干扰项生成和数据增强的重要工具。
Entity-Level Supervision (实体级监督)
在推理过程中对实体引用进行细粒度的监督,强化模型对关键实体的关注。
奖励机制的核心设计思想。
Knowledge Graph Random Walk (知识图随机游走)
在知识图中随机遍历实体路径,用于生成多跳问答。
问答数据生成的基础方法。
Positive-Only Strategy (正向奖励策略)
只在模型答案正确时,给予奖励,防止模型通过作弊获得奖励。
奖励机制设计中的关键策略。
Abstract
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.