Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

TL;DR

Study shows large language models impact AI conference peer reviews, especially in linguistic complexity and evaluative focus.

cs.CL 🟡 Intermediate 2026-04-21 50 views
Wenqing Wu Chengzhi Zhang Yi Zhao Tong Bao
Large Language Models Peer Review Academic Communication Text Analysis Artificial Intelligence

Key Findings

Methodology

This study employs a maximum likelihood estimation method to identify peer review reports potentially modified or generated by large language models (LLMs). It also automatically annotates evaluative aspects of individual review sentences. By analyzing peer review texts from ICLR and NeurIPS conferences, the study investigates the impact of LLMs on review text length, linguistic complexity, and evaluative focus.

Key Results

  • Result 1: Following the emergence of LLMs, peer review texts at ICLR and NeurIPS have become longer and more fluent, particularly among reviewers with lower confidence scores, where text length and fluency significantly increased.
  • Result 2: In LLM-assisted review reports, there is an increased emphasis on summaries and surface-level clarity, while attention to originality, replicability, and nuanced critical reasoning has declined.
  • Result 3: LLM-assisted review reports have a modest positive influence on the informativeness of recommendations.

Significance

This study reveals the profound impact of large language models on the academic peer review process, particularly in terms of changes in linguistic expression and evaluative dimensions. These findings help understand how LLMs alter the dynamics of academic communication and provide actionable insights for improving review practices.

Technical Contribution

This research is the first to systematically analyze the impact of large language models on peer review texts from a fine-grained perspective, introducing a novel analytical framework that combines maximum likelihood estimation and automatic annotation, offering new technical means for improving peer review processes.

Novelty

This study is the first to conduct a detailed analysis of the impact of large language models on the linguistic complexity and evaluative focus of peer review texts, distinguishing it from previous studies that only focused on overall text changes.

Limitations

  • Limitation 1: The study primarily relies on publicly available data from ICLR and NeurIPS, which may not be applicable to conferences in other fields.
  • Limitation 2: The accuracy of the maximum likelihood estimation method depends on the quality of the training data, which may lead to misjudgments.
  • Limitation 3: The potential impact of large language models on reviewer biases was not explored in depth.

Future Work

Future research could extend to peer reviews in other academic fields to explore the impact of large language models on review processes across different domains. Additionally, more precise detection models could be developed to enhance the identification of content generated by large language models.

AI Executive Summary

With the rapid advancement of large language models (LLMs), the academic community, particularly in the realm of academic communication, has faced unprecedented disruptions. The primary function of peer review is to improve the quality of academic manuscripts, such as clarity, originality, and other evaluative aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined.

This study examines the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at a fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have been modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendations for paper decision-making.

The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly among reviewers with lower confidence scores. These phenomena are more obvious when comparing LLM-assisted and non-LLM-assisted reviews, and the aspects mentioned in LLM-assisted reports have a modest positive influence on the informativeness of the recommendations.

These findings reveal the profound impact of large language models on the academic peer review process, particularly in terms of changes in linguistic expression and evaluative dimensions. These changes may affect the dynamics of academic communication and have implications for the fairness and transparency of review practices. Therefore, understanding these effects can help refine peer review practices, ensure fairness and transparency, and provide actionable insights for adapting to the evolving academic landscape.

However, this study also has some limitations. First, the study primarily relies on publicly available data from ICLR and NeurIPS, which may not be applicable to conferences in other fields. Second, the accuracy of the maximum likelihood estimation method depends on the quality of the training data, which may lead to misjudgments. Finally, the potential impact of large language models on reviewer biases was not explored in depth. Future research could extend to peer reviews in other academic fields to explore the impact of large language models on review processes across different domains. Additionally, more precise detection models could be developed to enhance the identification of content generated by large language models.

Deep Analysis

Background

Peer review is a critical quality control mechanism in the academic research and publication process. Its primary purpose is to ensure the rigor and credibility of academic research, assist authors in improving their work, and identify potential errors and shortcomings. However, in recent years, the peer review mechanism has faced widespread criticism due to the surge in paper submissions and the shortage of domain experts qualified to serve as reviewers, particularly at top artificial intelligence conferences. Current peer review processes face several challenges, including bias, variability in review quality, unclear reviewer motivations, and imperfect review mechanisms. As submission volumes continue to rise, these issues are becoming increasingly pronounced. Some researchers have sought to mitigate these problems by enhancing fairness, reducing biases among novice reviewers, calibrating noisy peer review ratings, and improving mechanisms for matching papers with reviewers’ expertise. Other studies have explored the use of natural language processing techniques to support or refine the peer review process. These studies introduce the possibility of leveraging artificial intelligence to assist overburdened scientists in the peer review process. While these technologies may aid reviewers to some extent, their impact on the peer review process still requires further study.


In recent years, the impressive capabilities demonstrated by large language models (LLMs) have sparked extensive research and discussion within the academic community. At the same time, concerns have emerged within the academic community about the potential erosion of peer review by LLMs. Researchers have also begun to study and analyze the application and impact of LLMs in the peer review process. For example, Liang et al. not only evaluated the effectiveness of GPT-4 in generating scientific feedback but also proposed a method to estimate the extent of LLMs usage in peer review texts. They found that some reviews from recent AI conferences may have been modified by LLMs. Latona et al. investigated the prevalence and impact of LLM-assisted peer reviews at the ICLR 2024, finding that LLM-assisted reviews significantly influence review scores and submission acceptance rates. While these preliminary studies indicate that LLMs have begun to affect peer review, whether they are altering the core functions of peer review remains underexplored. As LLMs become increasingly integrated into scholarly workflows, it is therefore important to analyze their impacts on peer review from multiple perspectives, including linguistic patterns, evaluation aspects, and the influence of reviewer recommendations. Understanding these effects can help refine peer review practices, ensure fairness and transparency, and provide actionable insights for adapting to the evolving academic landscape. Notably, major conferences such as NeurIPS have not yet established explicit policies on whether reviewers may use LLMs to assist in writing their reports. This policy ambiguity underscores the importance of examining how LLM assistance may already be shaping the linguistic and evaluative characteristics of peer review texts.

Core Problem

This study aims to explore the impact of large language models (LLMs) on the linguistic complexity and content-level expression of peer review texts. Specifically, the research focuses on the following three core questions:


1. How has the emergence of LLMs affected the linguistic complexity and aspect-level content expression of peer review texts?


2. In comparison to non-LLM-assisted reviews, which evaluation aspects are more prominently emphasized in LLM-assisted reviews?


3. How do the evaluation aspects emphasized in LLM-assisted reviews relate to reviewers’ scoring and confidence levels?


Addressing these questions is crucial for understanding the role of LLMs in the academic review process and how to improve review practices.

Innovation

The core innovations of this study include:


  • �� Fine-Grained Analysis: This is the first study to systematically analyze the impact of large language models on peer review texts from a fine-grained perspective, distinguishing it from previous studies that only focused on overall text changes.

  • �� Methodological Innovation: The study introduces a novel analytical framework that combines maximum likelihood estimation and automatic annotation techniques to identify peer review reports potentially modified or generated by LLMs.

  • �� Data Analysis: By analyzing peer review texts from ICLR and NeurIPS conferences, the study reveals the impact of LLMs on review text length, linguistic complexity, and evaluative focus.

These innovations provide new perspectives for understanding the role of large language models in academic communication.

Methodology

The study employs a detailed analytical approach, with the following steps:


  • �� Data Collection and Processing: Selected ICLR and NeurIPS as data sources, using the OpenReview platform to obtain peer review text data.

  • �� Review Sentence Aspect Identification: Based on the study by Yuan et al., a pre-trained aspect identification model is used to automatically annotate review sentences, identifying eight evaluative aspects.

  • �� LLM-Assisted Peer Review Text Detection: Utilizes the maximum likelihood estimation model designed by Liang et al., combined with a predefined terminology dictionary, to detect peer review texts potentially assisted by LLMs.

  • �� Lexical and Syntactic Complexity Analysis: Uses TAALES and TAASSC tools to calculate the lexical and syntactic complexity of review texts, analyzing their trends over time.

Experiments

The experimental design includes the following aspects:


  • �� Datasets: Selected peer review text data from ICLR and NeurIPS conferences, covering multiple years to analyze changes before and after the emergence of large language models.

  • �� Baselines: Compared with non-LLM-assisted peer review texts to analyze the impact of LLM assistance on review texts.

  • �� Evaluation Metrics: Includes review text length, lexical complexity, syntactic complexity, and the distribution of evaluative aspects.

  • �� Hyperparameters: Adjusted parameters in the maximum likelihood estimation model to improve the accuracy of LLM-assisted text detection.

Results

The study results indicate:


  • �� Following the emergence of LLMs, peer review texts at ICLR and NeurIPS have become longer and more fluent, particularly among reviewers with lower confidence scores, where text length and fluency significantly increased.

  • �� In LLM-assisted review reports, there is an increased emphasis on summaries and surface-level clarity, while attention to originality, replicability, and nuanced critical reasoning has declined.

  • �� LLM-assisted review reports have a modest positive influence on the informativeness of recommendations. These findings reveal the profound impact of large language models on the academic peer review process, particularly in terms of changes in linguistic expression and evaluative dimensions.

Applications

The application scenarios of this study include:


  • �� Academic Review: Provides references for improving review practices at academic conferences and journals, especially in terms of linguistic and evaluative dimensions of review texts.

  • �� Education Sector: Offers insights into the role of large language models in academic communication, helping students and researchers better understand and utilize these technologies.

  • �� AI Research: Provides empirical data on the application and impact of large language models in natural language processing and generation for AI researchers.

Limitations & Outlook

Despite revealing the impact of large language models on peer review, the study has some limitations:


  • �� Data Limitation: The study primarily relies on publicly available data from ICLR and NeurIPS, which may not be applicable to conferences in other fields.

  • �� Detection Model Accuracy: The accuracy of the maximum likelihood estimation method depends on the quality of the training data, which may lead to misjudgments.

  • �� Subjective Bias: The potential impact of large language models on reviewer biases was not explored in depth. Future research could extend to peer reviews in other academic fields to explore the impact of large language models on review processes across different domains. Additionally, more precise detection models could be developed to enhance the identification of content generated by large language models.

Plain Language Accessible to non-experts

Imagine you're working in a large library, responsible for reviewing and recommending books. Traditionally, you'd carefully read each book, analyze its content, writing style, and uniqueness, then give your evaluation and recommendation. However, with technological advancements, you now have a smart assistant that can quickly scan books and help you identify the main points and writing style.

This smart assistant is like a large language model. It can help you complete your work faster but might also influence your judgment. For instance, it might focus more on the book's summary and surface clarity, while overlooking some deeper originality and critical thinking.

In this process, you realize that while the smart assistant makes your work more efficient, it also brings new challenges. You need to carefully balance the assistant's suggestions with your own judgment to ensure the recommended books are both interesting and deep.

This is similar to using large language models in academic peer review. They can help reviewers write review reports faster but might also affect the depth and quality of the review. Therefore, understanding and managing these impacts is crucial for maintaining the fairness and quality of reviews.

ELI14 Explained like you're 14

Hey there! Imagine you're in school participating in a super cool science competition, and the judges need to score each project. Traditionally, the judges would spend a lot of time reading each project, analyzing their creativity and scientific value, and then give their evaluation.

But now, the judges have a super assistant—a large language model! This assistant is like a super smart robot that can quickly read projects and help judges identify the main highlights and writing style. Sounds cool, right?

However, this assistant also has a little problem. It might focus more on the project's summary and surface clarity, while overlooking some deeper creativity and critical thinking. It's like playing a game where you only focus on the character's appearance and ignore their skills and strategy.

So, while this assistant makes reviewing faster, the judges also need to carefully balance the assistant's suggestions with their own judgment to ensure the selected projects are both interesting and deep. Just like in a game, you need to balance the character's appearance and skills to win the competition!

Glossary

Large Language Model

A large language model is a deep learning-based natural language processing model capable of generating and understanding human language. They typically have billions of parameters and can handle complex language tasks.

Used in the paper to analyze the linguistic complexity and evaluative dimensions of peer review texts.

Peer Review

Peer review is a quality control mechanism in academic research and publication, where experts evaluate academic manuscripts to ensure their quality and credibility.

Used in the paper to analyze the impact of large language models on the review process.

Maximum Likelihood Estimation

Maximum likelihood estimation is a statistical method used to estimate model parameters that maximize the probability of observed data.

Used in the paper to identify peer review texts potentially generated by large language models.

Linguistic Complexity

Linguistic complexity refers to the complexity of vocabulary and syntax in a text, including the diversity of vocabulary and syntactic complexity.

Used in the paper to analyze the impact of large language models on review texts.

Evaluation Aspect

Evaluation aspects are different dimensions of focus in the review process, such as clarity, originality, replicability, etc.

Used in the paper to analyze the distribution of different dimensions in review texts.

ICLR

ICLR is an international top-tier machine learning conference focused on research in learning representations.

Used in the paper as one of the data sources for analyzing changes in review texts.

NeurIPS

NeurIPS is an international top-tier artificial intelligence conference covering research in neural information processing systems.

Used in the paper as one of the data sources for analyzing changes in review texts.

OpenReview

OpenReview is an open academic review platform that allows researchers to submit and review academic papers.

Used in the paper to obtain peer review text data from ICLR and NeurIPS conferences.

TAALES

TAALES is a tool for evaluating the lexical complexity of a text, capable of calculating multiple lexical complexity indices.

Used in the paper to analyze the lexical complexity of review texts.

TAASSC

TAASSC is a tool for evaluating the syntactic complexity of a text, capable of calculating multiple syntactic complexity indices.

Used in the paper to analyze the syntactic complexity of review texts.

Open Questions Unanswered questions from this research

  • 1 The impact of large language models on reviewer biases has not been fully explored. Existing research mainly focuses on changes in linguistic complexity and evaluative dimensions, while neglecting potential biases that reviewers may develop when using large language models. In-depth research in this area could help improve review practices, ensuring fairness and transparency.
  • 2 The accuracy of existing detection models for identifying content generated by large language models still needs improvement. Although maximum likelihood estimation can identify LLM-assisted review texts to some extent, its accuracy depends on the quality of the training data, which may lead to misjudgments. Developing more precise detection models is an important direction for future research.
  • 3 The impact of large language models on the review process in different fields has not been systematically studied. Existing research mainly focuses on the field of artificial intelligence, while review processes in other disciplines may be influenced by different factors. Expanding the scope of research could help comprehensively understand the role of large language models in academic communication.
  • 4 The impact of large language models on the depth of review text content needs further exploration. Although studies show that LLM-assisted review texts have improved in summaries and surface clarity, attention to originality and critical reasoning has decreased. The reasons for this phenomenon and potential solutions deserve further research.
  • 5 The application policy of large language models in academic peer review remains unclear. While some conferences have begun discussing the use of LLMs, a unified policy has not yet been formed. Clear policies could help regulate the use of LLMs in the review process, ensuring fairness and transparency.

Applications

Immediate Applications

Academic Conference Review

Large language models can help academic conferences improve review efficiency, especially when handling large volumes of submissions. By automatically generating review reports, reviewers can focus more on key issues, enhancing review quality.

Journal Editing

Journal editors can use large language models to quickly screen manuscripts and identify potential high-quality articles. This can reduce the workload of editors and improve the overall quality of journals.

Educational Assessment

Large language models can be used for assessment tasks in education, such as automatic grading and feedback generation for student essays. This can help teachers save time and provide students with more timely feedback.

Long-term Vision

Cross-Disciplinary Review

Large language models can be extended to review processes in other academic fields, improving the efficiency and quality of reviews. This could promote cross-disciplinary collaboration and communication, advancing scientific research.

Automated Academic Communication

With technological advancements, large language models may achieve higher degrees of automation in academic communication in the future. This will change the way academia works, enhancing research efficiency and innovation.

Abstract

With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

cs.CL cs.AI cs.DL cs.IR

References (20)

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Weixin Liang, Zachary Izzo, Yaohui Zhang et al.

2024 201 citations ⭐ Influential View Analysis →

Automated scholarly paper review: Concepts, technologies, and challenges

Jialiang Lin, Jiaxin Song, Zhangping Zhou et al.

2021 38 citations ⭐ Influential View Analysis →

The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates

Giuseppe Russo, Manoel Horta Ribeiro, Tim R. Davidson et al.

2024 61 citations ⭐ Influential View Analysis →

Can We Automate Scientific Reviewing?

Weizhe Yuan, Pengfei Liu, Graham Neubig

2021 120 citations ⭐ Influential View Analysis →

Scientists are working overtime: when do scientists download scientific papers?

Yu Geng, Renmeng Cao, Xiaopu Han et al.

2022 10 citations

LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

Jiangshu Du, Yibo Wang, Wenting Zhao et al.

2024 66 citations View Analysis →

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao et al.

2023 286 citations View Analysis →

A harm reduction approach to improving peer review by acknowledging its imperfections

S. Cooke, Nathan Young, K. Peiman et al.

2024 8 citations

On the peer review reports: does size matter?

Abdelghani Maddi, Luis Miotti

2024 8 citations View Analysis →

ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis

Qingyun Wang, Qi Zeng, Lifu Huang et al.

2020 79 citations View Analysis →

The state of the art in peer review

Jonathan P. Tennant

2018 130 citations

A One-Size-Fits-All Approach to Improving Randomness in Paper Assignment

Yixuan Even Xu, Steven Jecmen, Zimeng Song et al.

2023 8 citations View Analysis →

An Open Review of OpenReview: A Critical Analysis of the Machine Learning Conference Review Process

David Tran, Alex Valtchanov, Keshav Ganapathy et al.

2020 40 citations View Analysis →

Calibrating "Cheap Signals" in Peer Review without a Prior

Yuxuan Lu, Yuqing Kong

2023 10 citations View Analysis →

Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis

Jianxiang Yu, Zichen Ding, Jiaqi Tan et al.

2024 26 citations View Analysis →

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

E. Mitchell, Yoonho Lee, Alexander Khazatsky et al.

2023 970 citations View Analysis →

AgentReview: Exploring Peer Review Dynamics with LLM Agents

Yiqiao Jin, Qinlin Zhao, Yiyang Wang et al.

2024 79 citations View Analysis →

Automatic Analysis of Substantiation in Scientific Peer Reviews

Yanzhu Guo, Guokan Shang, Virgile Rennard et al.

2023 18 citations View Analysis →

Mapping the Increasing Use of LLMs in Scientific Papers

Weixin Liang, Yaohui Zhang, Zhengxuan Wu et al.

2024 146 citations View Analysis →

Double‐blind peer review affects reviewer ratings and editor decisions at an ecology journal

C. W. Fox, Jennifer A. Meyer, Emilie Aimé

2023 84 citations