Automated reproducibility assessments in the social and behavioral sciences using large language models

TL;DR

利用大语言模型（如Claude 4.7）自动评估社会行为科学研究的可复现性，通过效果量匹配和结论一致性验证，提升审查效率。

cs.AI 🔴 高级 2026-06-12 91 次浏览

Tobias Holtdirk Pietro Marcolongo Anna Steinberg Schulten Felix Henninger Stefan Rose Sarah Ball Bolei Ma Frauke Kreuter Markus Weinmann Stefan Feuerriegel

AI 阅读器 Arxiv 原文下载 PDF

可复现性大语言模型社会科学代码生成效果量评估

核心发现

方法论

本文提出一种基于大语言模型（LLMs）的自动化可复现性评估框架。该框架包括：• 输入研究数据、原始论文文本和预定义的研究声明；• LLM（以Claude 4.7为例）生成可执行的统计分析代码；• 通过多次运行（五次）获取效果量（Cohen's d）估计值；• 使用±0.05的容差判断效果量是否匹配原始结果，并评估结论支持度。该流程结合了自然语言理解、代码生成和统计分析，旨在实现大规模、自动化的研究验证。

关键结果

在76篇已发表研究中，LLM成功生成有效效果量估计的比例为93%，其中41%的研究效果量在±0.05容差范围内与原始效果量一致。该方法在支持原始结论的判断中达到了96%的准确率，显著优于人类分析的74%。此外，LLM在效果量匹配方面的相关系数为0.10，表明其在结论支持方面具有较强的实用性。
与人类重分析（人类分析在34%的研究中成功复现效果量）相比，LLM表现出更高的一致性和效率。特别是在处理大量数据和复杂统计模型时，LLM能够快速生成分析代码，显著降低人工成本。
研究还发现，LLM在某些研究中未能提供有效效果量估计（7个研究未成功），主要原因包括数据缺失、模型复杂或描述模糊。这提示未来需优化模型理解和数据处理能力。

研究意义

该研究突破了社会科学中传统依赖人工重分析的瓶颈，提出利用大语言模型实现自动化、规模化的可复现性检测。此技术不仅提升了科研审查的效率，也为系统性审计和元分析提供了基础工具，有助于推动科学透明度和研究质量的提升。通过标准化的效果量匹配和结论一致性评估，能够更客观、快速地识别潜在的不可复现研究，为政策制定和学术评价提供坚实依据。

技术贡献

本文的核心技术创新在于将自然语言处理（NLP）与统计分析结合，开发出基于Claude 4.7的自动化分析流水线。该系统能够理解研究材料、生成符合统计标准的代码，并多次运行以确保结果稳定性。相比传统手工重分析，方法具有高效、可扩展和标准化的优势。创新点还包括：• 利用预定义的分析模板确保一致性；• 采用多次运行策略提升结果可靠性；• 引入效果量±0.05容差作为匹配标准，增强评估的客观性。

新颖性

这是首个系统性将大语言模型应用于社会行为科学研究的可复现性自动评估框架。不同于以往仅依赖人工或有限自动化工具，本文实现了从文本理解、代码生成到效果验证的全流程自动化，显著推动了AI在科学质量控制中的应用边界。其创新在于结合多模态信息和统计指标，提供一种可扩展、标准化的验证工具，为未来大规模科研审查提供范例。

局限性

模型在某些研究中未能生成有效效果量，主要由于数据缺失或描述模糊，限制了方法的普适性。
效果量的标准化转换（如从t值、F值等到Cohen's d）存在假设偏差，可能影响匹配精度。
模型可能受到预训练数据的影响，存在信息泄露的风险，影响结果的客观性。

未来方向

未来研究将集中于增强模型对复杂统计模型的理解能力，提升效果量估算的准确性。同时，结合多模态信息（如图表、公式）以丰富模型输入，拓展适用范围。此外，计划开发更细粒度的质量控制指标，结合人类专家的判断，形成混合评估体系，以实现更高的可靠性和普适性。

AI 总览摘要

在科学研究中，确保研究结果的可复现性一直是学术界的核心追求。传统的重分析方法依赖于人工逐步重构原始数据和分析流程，既耗时又难以大规模推广。随着人工智能的发展，尤其是大语言模型（LLMs）如Claude 4.7的问世，为自动化科研验证提供了新的可能性。本研究提出了一套基于LLMs的自动化可复现性评估框架，旨在解决社会行为科学中研究结果验证的瓶颈。

该框架包括：输入研究数据、论文文本和预定义的研究声明，LLM生成对应的统计分析代码，反复运行多次以确保结果的稳定性。通过与原始研究效果量的比较（容差±0.05）和结论支持度的判断，系统可以快速评估研究的可复现性。实验证明，使用Claude 4.7的系统在76篇研究中，成功生成有效效果量的比例达93%，其中41%的研究效果量在容差范围内与原始效果一致。更重要的是，系统在支持原始结论方面达到了96%的准确率，远超人类重分析的74%。

这些结果表明，基于LLMs的自动化评估工具具有巨大潜力，能够显著提升科研审查的效率和规模。它不仅可以作为初步筛查工具，帮助识别潜在不可复现的研究，还能为系统性审计和元分析提供基础数据。尽管如此，模型在某些复杂研究中仍存在未能提供有效效果量的情况，主要由于数据缺失或描述模糊。未来的工作将聚焦于增强模型理解能力、丰富输入信息，以及结合人类专家的判断，构建更可靠的自动化验证体系。

综上所述，该研究为科学界提供了一种创新的工具，推动科研透明度和质量控制的数字化转型。随着技术的不断成熟，未来大规模、自动化的科研验证将成为可能，为科学发展提供更坚实的基础。

深度解读

原文摘要

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

cs.AI

参考文献 (20)

Investigating the analytical robustness of the social and behavioural sciences.

B. Aczél, Barnabas Szaszi, Harry T. Clelland 等

2026 11 引用 ⭐ 高影响力

The preregistration revolution

Brian A. Nosek, C. Ebersole, A. DeHaven 等

2018 1533 引用

Multidimensional Signals and Analytic Flexibility: Estimating Degrees of Freedom in Human-Speech Analyses

Stefano Coretta, Joseph V. Casillas, S. Roessig 等

2023 23 引用

Reproducibility in Management Science

Miloš Fišar, Ben Greiner, Christoph Huber 等

2023 17 引用

Towards end-to-end automation of AI research

Chris Lu, Cong Lu, R. Lange 等

2026 82 引用

The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗

Andrew Gelman, Eric Loken

2019 615 引用

Investigating the replicability of the social and behavioural sciences.

Andrew H. Tyner, A. Abatayo, Mason Daley 等

2026 12 引用

ChatDev: Communicative Agents for Software Development

Cheng Qian, Wei Liu, Hongzhang Liu 等

2023 867 引用查看解读 →

Investigating the reproducibility of the social and behavioural sciences.

Olivia Miske, A. Abatayo, Mason Daley 等

2026 7 引用

Autonomous chemical research with large language models

Daniil A. Boiko, R. MacKnight, Benjamin C Kline 等

2023 976 引用

Heterogeneity in effect size estimates

Felix Holzmeister, M. Johannesson, Robert Böhm 等

2024 51 引用

LAMBDA: A Large Model Based Data Agent

Maojun Sun, Ruijian Han, Binyan Jiang 等

2024 32 引用查看解读 →

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

Xinyi Song, L. Lee, Kexin Xie 等

2025 6 引用查看解读 →

Reproducibility and robustness of economics and political science research.

Abel Brodeur, Derek Mikola, Nikolai Cook 等

2026 7 引用

A community-sourced glossary of open scholarship terms

S. Parsons, F. Azevedo, M. Elsherif 等

2022 82 引用

Code sharing and reproducibility in survey-based social research: evidence from a large-scale audit

Daniel Krähmer, Laura Schächtele, Katrin Auspurg

2026 2 引用

Why most psychological research findings are not even wrong

Anne M. Scheel

2021 57 引用

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Yifan Song, Guoyin Wang, Sujian Li 等

2024 150 引用查看解读 →

Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project

Katrin Auspurg, J. Brüderl

2021 52 引用

The end justifies all means: questionable conversion of different effect sizes to a common effect size measure

M. A. V. van Assen, Andrea H. Stoevenbelt, Robbie C. M. van Aert

2023 5 引用

Automated reproducibility assessments in the social and behavioral sciences using large language models

核心发现

方法论

关键结果

研究意义

技术贡献

新颖性

局限性

未来方向

AI 总览摘要

深度解读

原文摘要

参考文献 (20)

相关论文

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Multi-Agent Transactive Memory

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

The Role of Feedback Alignment in Self-Distillation

A History-Aware Visually Grounded Critic for Computer Use Agents