Automated reproducibility assessments in the social and behavioral sciences using large language models
Using large language models (e.g., Claude 4.7) for automated reproducibility assessment in social sciences, matching effect sizes within ±0.05 and supporting conclusions with high accuracy.
Key Findings
Methodology
This paper introduces an automated reproducibility assessment pipeline leveraging large language models (LLMs), specifically Claude 4.7. The system ingests original datasets, research claims, and full-text articles, then generates executable statistical code to reproduce the reported results. The process involves • parsing the study materials and extracting relevant variables; • prompting the LLM to produce analysis scripts; • executing these scripts multiple times (five runs) to ensure stability; • computing effect sizes (Cohen's d) from the generated outputs; • comparing these effect sizes to original findings within a ±0.05 tolerance; • evaluating whether the qualitative conclusions (supporting or opposing the original claim) align. This comprehensive pipeline combines NLP, statistical programming, and robustness checks to facilitate scalable, automated validation of empirical research.
Key Results
- In a dataset of 76 published studies, the LLM successfully generated valid effect size estimates in 93% of cases. Among these, 41% had effect sizes within ±0.05 of the original, demonstrating high fidelity. The system achieved a 96% rate of supporting the original conclusion, outperforming human reanalysts, who matched effects in 34% and conclusions in 74%. The correlation between LLM-derived and original effect sizes was modest (r=0.10), indicating that while effect size estimates vary, the qualitative support remains strong. The automated pipeline significantly reduces manual effort and accelerates the validation process.
- Compared to human reanalysts, the LLM demonstrated superior consistency in qualitative conclusions, with a support rate of 96% versus 74%. The effect size correlation, although weak, aligns with prior manual assessments, reflecting inherent variability in effect estimates. The approach's scalability allows systematic auditing of large literature corpora, providing a practical tool for journals, policymakers, and researchers to identify irreproducible findings efficiently.
- Some studies (7 out of 76) failed to produce valid effect sizes, mainly due to missing data or ambiguous descriptions. The transformation of various test statistics into Cohen's d relies on assumptions that may not hold universally, potentially affecting accuracy. The model's understanding of complex statistical models remains limited, highlighting areas for future enhancement. Nonetheless, the overall performance indicates that LLMs can serve as reliable first-pass tools for reproducibility screening, especially when combined with human oversight.
Significance
This work marks a significant advancement in the automation of scientific validation, addressing the long-standing challenge of resource-intensive manual reanalysis in social sciences. By harnessing LLMs' natural language understanding and code generation capabilities, the framework enables rapid, large-scale auditing of empirical results. Such automation can improve transparency, foster reproducibility, and facilitate meta-analyses, ultimately strengthening the credibility of scientific findings. The approach also paves the way for integrating AI-driven tools into peer review and publication workflows, promoting a culture of openness and rigorous validation across disciplines.
Technical Contribution
The core technical innovation lies in integrating NLP-based understanding with statistical code generation within a robust, multi-run framework. The system employs prompt engineering to guide the LLM in producing accurate analysis scripts, which are then executed in isolated environments to ensure reproducibility. The pipeline standardizes effect size calculation from various test statistics, enabling cross-study comparability. Compared to prior tools, this approach offers a fully automated, scalable solution that combines natural language processing, statistical programming, and ensemble validation, setting a new benchmark for AI-assisted research verification.
Novelty
This is the first comprehensive framework applying large language models to automate the entire process of reproducibility assessment in social and behavioral sciences. Unlike previous efforts limited to code assistance or manual review, this pipeline automates understanding, code generation, execution, and evaluation, significantly reducing human labor. Its ability to process full-text articles, generate standardized code, and perform multi-run robustness checks distinguishes it from existing tools, representing a novel integration of AI and empirical validation.
Limitations
- The effectiveness depends on data availability; studies with restricted or missing datasets cannot be analyzed, limiting scope.
- Transformation assumptions for effect size standardization may introduce biases, especially in complex models.
- Model errors or misunderstandings can lead to incorrect effect estimates, necessitating human verification.
- Pretraining data contamination might inflate performance metrics, as some studies could have been encountered during training.
- Computational costs, especially for large-scale assessments, remain significant, requiring optimization for broader deployment.
Future Work
Future research will focus on enhancing the model's understanding of complex statistical procedures, integrating multimodal inputs such as figures and equations, and refining effect size transformations. Developing hybrid systems combining AI automation with human oversight can improve accuracy and interpretability. Additionally, expanding the pipeline to handle more diverse datasets, including recent studies and those with limited data access, will broaden applicability. Long-term, integrating such tools into peer review workflows and research repositories could revolutionize scientific quality control, making reproducibility assessment a routine part of scholarly publishing.
AI Executive Summary
Reproducibility remains a cornerstone of scientific integrity, yet traditional methods for verifying research findings are labor-intensive and difficult to scale. In social and behavioral sciences, reanalyzing original datasets to confirm published results often requires significant manual effort, involving data cleaning, coding, and statistical validation. This bottleneck hampers large-scale systematic auditing and undermines confidence in empirical claims.
Recent advances in artificial intelligence, particularly large language models (LLMs) like Claude 4.7, offer a promising solution. These models excel at natural language understanding and code generation, enabling automation of complex analytical tasks. This study introduces an innovative pipeline that leverages LLMs to perform automated reproducibility assessments. The system ingests full-text articles, datasets, and research claims, then prompts the LLM to generate executable statistical code tailored to reproduce the specified effects. Multiple runs ensure robustness, and the results are compared against original effect sizes and conclusions.
The experimental evaluation involved 76 studies from the Multi100 dataset, covering economics, political science, and psychology. The findings are compelling: approximately 93% of the studies yielded valid effect size estimates, with 41% matching the original effect sizes within a ±0.05 margin. More importantly, the system correctly supported the original research conclusion in 96% of cases, surpassing human reanalysts' performance (74%). These results demonstrate that LLM-based automation can significantly accelerate and scale the process of verifying empirical findings.
This approach offers multiple benefits. It reduces the manual effort required for data and code reanalysis, enabling routine quality checks during peer review or post-publication audits. It also provides a standardized, transparent method for effect size comparison and conclusion support, fostering greater trust in published research. Moreover, the pipeline's scalability allows systematic auditing across large literature corpora, identifying potentially irreproducible studies efficiently.
Despite its promise, the method has limitations. Some studies with missing data or ambiguous descriptions failed to produce valid effect sizes. Effect size transformations rely on assumptions that may not hold universally, and the models can occasionally misinterpret complex statistical procedures. Future work aims to improve understanding of intricate models, incorporate multimodal data, and develop hybrid human-AI systems for enhanced reliability. Overall, this research marks a significant step toward AI-assisted, large-scale scientific validation, promising a future where reproducibility checks are integrated seamlessly into the research lifecycle, ensuring higher standards of scientific rigor and transparency.
Deep Dive
Abstract
Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.
References (20)
Investigating the analytical robustness of the social and behavioural sciences.
B. Aczél, Barnabas Szaszi, Harry T. Clelland et al.
The preregistration revolution
Brian A. Nosek, C. Ebersole, A. DeHaven et al.
Multidimensional Signals and Analytic Flexibility: Estimating Degrees of Freedom in Human-Speech Analyses
Stefano Coretta, Joseph V. Casillas, S. Roessig et al.
Reproducibility in Management Science
Miloš Fišar, Ben Greiner, Christoph Huber et al.
Towards end-to-end automation of AI research
Chris Lu, Cong Lu, R. Lange et al.
The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗
Andrew Gelman, Eric Loken
Investigating the replicability of the social and behavioural sciences.
Andrew H. Tyner, A. Abatayo, Mason Daley et al.
ChatDev: Communicative Agents for Software Development
Cheng Qian, Wei Liu, Hongzhang Liu et al.
Investigating the reproducibility of the social and behavioural sciences.
Olivia Miske, A. Abatayo, Mason Daley et al.
Autonomous chemical research with large language models
Daniil A. Boiko, R. MacKnight, Benjamin C Kline et al.
Heterogeneity in effect size estimates
Felix Holzmeister, M. Johannesson, Robert Böhm et al.
LAMBDA: A Large Model Based Data Agent
Maojun Sun, Ruijian Han, Binyan Jiang et al.
StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis
Xinyi Song, L. Lee, Kexin Xie et al.
Reproducibility and robustness of economics and political science research.
Abel Brodeur, Derek Mikola, Nikolai Cook et al.
A community-sourced glossary of open scholarship terms
S. Parsons, F. Azevedo, M. Elsherif et al.
Code sharing and reproducibility in survey-based social research: evidence from a large-scale audit
Daniel Krähmer, Laura Schächtele, Katrin Auspurg
Why most psychological research findings are not even wrong
Anne M. Scheel
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Yifan Song, Guoyin Wang, Sujian Li et al.
Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project
Katrin Auspurg, J. Brüderl
The end justifies all means: questionable conversion of different effect sizes to a common effect size measure
M. A. V. van Assen, Andrea H. Stoevenbelt, Robbie C. M. van Aert