Automated reproducibility assessments in the social and behavioral sciences using large language models

TL;DR

Using large language models (e.g., Claude 4.7) for automated reproducibility assessment in social sciences, matching effect sizes within ±0.05 and supporting conclusions with high accuracy.

cs.AI 🔴 Advanced 2026-06-12 92 views

Tobias Holtdirk Pietro Marcolongo Anna Steinberg Schulten Felix Henninger Stefan Rose Sarah Ball Bolei Ma Frauke Kreuter Markus Weinmann Stefan Feuerriegel

AI Reader Arxiv Page Download PDF

reproducibility large language models social sciences code generation effect size evaluation

Key Findings

Methodology

This paper introduces an automated reproducibility assessment pipeline leveraging large language models (LLMs), specifically Claude 4.7. The system ingests original datasets, research claims, and full-text articles, then generates executable statistical code to reproduce the reported results. The process involves • parsing the study materials and extracting relevant variables; • prompting the LLM to produce analysis scripts; • executing these scripts multiple times (five runs) to ensure stability; • computing effect sizes (Cohen's d) from the generated outputs; • comparing these effect sizes to original findings within a ±0.05 tolerance; • evaluating whether the qualitative conclusions (supporting or opposing the original claim) align. This comprehensive pipeline combines NLP, statistical programming, and robustness checks to facilitate scalable, automated validation of empirical research.

Key Results

In a dataset of 76 published studies, the LLM successfully generated valid effect size estimates in 93% of cases. Among these, 41% had effect sizes within ±0.05 of the original, demonstrating high fidelity. The system achieved a 96% rate of supporting the original conclusion, outperforming human reanalysts, who matched effects in 34% and conclusions in 74%. The correlation between LLM-derived and original effect sizes was modest (r=0.10), indicating that while effect size estimates vary, the qualitative support remains strong. The automated pipeline significantly reduces manual effort and accelerates the validation process.
Compared to human reanalysts, the LLM demonstrated superior consistency in qualitative conclusions, with a support rate of 96% versus 74%. The effect size correlation, although weak, aligns with prior manual assessments, reflecting inherent variability in effect estimates. The approach's scalability allows systematic auditing of large literature corpora, providing a practical tool for journals, policymakers, and researchers to identify irreproducible findings efficiently.
Some studies (7 out of 76) failed to produce valid effect sizes, mainly due to missing data or ambiguous descriptions. The transformation of various test statistics into Cohen's d relies on assumptions that may not hold universally, potentially affecting accuracy. The model's understanding of complex statistical models remains limited, highlighting areas for future enhancement. Nonetheless, the overall performance indicates that LLMs can serve as reliable first-pass tools for reproducibility screening, especially when combined with human oversight.

Significance

This work marks a significant advancement in the automation of scientific validation, addressing the long-standing challenge of resource-intensive manual reanalysis in social sciences. By harnessing LLMs' natural language understanding and code generation capabilities, the framework enables rapid, large-scale auditing of empirical results. Such automation can improve transparency, foster reproducibility, and facilitate meta-analyses, ultimately strengthening the credibility of scientific findings. The approach also paves the way for integrating AI-driven tools into peer review and publication workflows, promoting a culture of openness and rigorous validation across disciplines.

Technical Contribution

The core technical innovation lies in integrating NLP-based understanding with statistical code generation within a robust, multi-run framework. The system employs prompt engineering to guide the LLM in producing accurate analysis scripts, which are then executed in isolated environments to ensure reproducibility. The pipeline standardizes effect size calculation from various test statistics, enabling cross-study comparability. Compared to prior tools, this approach offers a fully automated, scalable solution that combines natural language processing, statistical programming, and ensemble validation, setting a new benchmark for AI-assisted research verification.

Novelty

This is the first comprehensive framework applying large language models to automate the entire process of reproducibility assessment in social and behavioral sciences. Unlike previous efforts limited to code assistance or manual review, this pipeline automates understanding, code generation, execution, and evaluation, significantly reducing human labor. Its ability to process full-text articles, generate standardized code, and perform multi-run robustness checks distinguishes it from existing tools, representing a novel integration of AI and empirical validation.

Limitations

The effectiveness depends on data availability; studies with restricted or missing datasets cannot be analyzed, limiting scope.
Transformation assumptions for effect size standardization may introduce biases, especially in complex models.
Model errors or misunderstandings can lead to incorrect effect estimates, necessitating human verification.
Pretraining data contamination might inflate performance metrics, as some studies could have been encountered during training.
Computational costs, especially for large-scale assessments, remain significant, requiring optimization for broader deployment.

Future Work

Future research will focus on enhancing the model's understanding of complex statistical procedures, integrating multimodal inputs such as figures and equations, and refining effect size transformations. Developing hybrid systems combining AI automation with human oversight can improve accuracy and interpretability. Additionally, expanding the pipeline to handle more diverse datasets, including recent studies and those with limited data access, will broaden applicability. Long-term, integrating such tools into peer review workflows and research repositories could revolutionize scientific quality control, making reproducibility assessment a routine part of scholarly publishing.

AI Executive Summary

Reproducibility remains a cornerstone of scientific integrity, yet traditional methods for verifying research findings are labor-intensive and difficult to scale. In social and behavioral sciences, reanalyzing original datasets to confirm published results often requires significant manual effort, involving data cleaning, coding, and statistical validation. This bottleneck hampers large-scale systematic auditing and undermines confidence in empirical claims.

Recent advances in artificial intelligence, particularly large language models (LLMs) like Claude 4.7, offer a promising solution. These models excel at natural language understanding and code generation, enabling automation of complex analytical tasks. This study introduces an innovative pipeline that leverages LLMs to perform automated reproducibility assessments. The system ingests full-text articles, datasets, and research claims, then prompts the LLM to generate executable statistical code tailored to reproduce the specified effects. Multiple runs ensure robustness, and the results are compared against original effect sizes and conclusions.

The experimental evaluation involved 76 studies from the Multi100 dataset, covering economics, political science, and psychology. The findings are compelling: approximately 93% of the studies yielded valid effect size estimates, with 41% matching the original effect sizes within a ±0.05 margin. More importantly, the system correctly supported the original research conclusion in 96% of cases, surpassing human reanalysts' performance (74%). These results demonstrate that LLM-based automation can significantly accelerate and scale the process of verifying empirical findings.

This approach offers multiple benefits. It reduces the manual effort required for data and code reanalysis, enabling routine quality checks during peer review or post-publication audits. It also provides a standardized, transparent method for effect size comparison and conclusion support, fostering greater trust in published research. Moreover, the pipeline's scalability allows systematic auditing across large literature corpora, identifying potentially irreproducible studies efficiently.

Despite its promise, the method has limitations. Some studies with missing data or ambiguous descriptions failed to produce valid effect sizes. Effect size transformations rely on assumptions that may not hold universally, and the models can occasionally misinterpret complex statistical procedures. Future work aims to improve understanding of intricate models, incorporate multimodal data, and develop hybrid human-AI systems for enhanced reliability. Overall, this research marks a significant step toward AI-assisted, large-scale scientific validation, promising a future where reproducibility checks are integrated seamlessly into the research lifecycle, ensuring higher standards of scientific rigor and transparency.

Deep Dive

Abstract

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

cs.AI

References (20)

Investigating the analytical robustness of the social and behavioural sciences.

B. Aczél, Barnabas Szaszi, Harry T. Clelland et al.

2026 11 citations ⭐ Influential

The preregistration revolution

Brian A. Nosek, C. Ebersole, A. DeHaven et al.

2018 1533 citations

Multidimensional Signals and Analytic Flexibility: Estimating Degrees of Freedom in Human-Speech Analyses

Stefano Coretta, Joseph V. Casillas, S. Roessig et al.

2023 23 citations

Reproducibility in Management Science

Miloš Fišar, Ben Greiner, Christoph Huber et al.

2023 17 citations

Towards end-to-end automation of AI research

Chris Lu, Cong Lu, R. Lange et al.

2026 82 citations

The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗

Andrew Gelman, Eric Loken

2019 615 citations

Investigating the replicability of the social and behavioural sciences.

Andrew H. Tyner, A. Abatayo, Mason Daley et al.

2026 12 citations

ChatDev: Communicative Agents for Software Development

Cheng Qian, Wei Liu, Hongzhang Liu et al.

2023 867 citations View Analysis →

Investigating the reproducibility of the social and behavioural sciences.

Olivia Miske, A. Abatayo, Mason Daley et al.

2026 7 citations

Autonomous chemical research with large language models

Daniil A. Boiko, R. MacKnight, Benjamin C Kline et al.

2023 976 citations

Heterogeneity in effect size estimates

Felix Holzmeister, M. Johannesson, Robert Böhm et al.

2024 51 citations

LAMBDA: A Large Model Based Data Agent

Maojun Sun, Ruijian Han, Binyan Jiang et al.

2024 32 citations View Analysis →

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

Xinyi Song, L. Lee, Kexin Xie et al.

2025 6 citations View Analysis →

Reproducibility and robustness of economics and political science research.

Abel Brodeur, Derek Mikola, Nikolai Cook et al.

2026 7 citations

A community-sourced glossary of open scholarship terms

S. Parsons, F. Azevedo, M. Elsherif et al.

2022 82 citations

Code sharing and reproducibility in survey-based social research: evidence from a large-scale audit

Daniel Krähmer, Laura Schächtele, Katrin Auspurg

2026 2 citations

Why most psychological research findings are not even wrong

Anne M. Scheel

2021 57 citations

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Yifan Song, Guoyin Wang, Sujian Li et al.

2024 150 citations View Analysis →

Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project

Katrin Auspurg, J. Brüderl

2021 52 citations

The end justifies all means: questionable conversion of different effect sizes to a common effect size measure

M. A. V. van Assen, Andrea H. Stoevenbelt, Robbie C. M. van Aert

2023 5 citations

Automated reproducibility assessments in the social and behavioral sciences using large language models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Multi-Agent Transactive Memory

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

The Role of Feedback Alignment in Self-Distillation

A History-Aware Visually Grounded Critic for Computer Use Agents