Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Diagnosing LLM judge reliability using transitivity analysis and conformal prediction sets, revealing 33%-67% documents with at least one 3-cycle.
Key Findings
Methodology
This paper presents a two-pronged diagnostic toolkit applied to the SummEval dataset. First, a transitivity analysis reveals widespread per-input inconsistency, despite low aggregate violation rates (0.8%-4.1%). Second, split conformal prediction sets over 1-5 Likert scores provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator. Critically, prediction set width shows consistent cross-judge agreement, demonstrating it captures document-level difficulty rather than judge-specific noise.
Key Results
- Result 1: Transitivity analysis shows that despite low overall violation rates (0.8%-4.1%), 33%-67% of documents exhibit at least one directed 3-cycle, indicating judge inconsistency on individual instances.
- Result 2: Conformal prediction sets show a Spearman correlation of +0.576 (p<10^-100) across all judges and criteria, indicating a significant association between prediction set width and actual judge-human disagreement.
- Result 3: Across four judges and four criteria, relevance and coherence are judged most reliably, while fluency and consistency remain unreliable.
Significance
This study provides crucial insights into the per-instance inconsistency of LLM judge systems for automatic NLG evaluation. By employing transitivity analysis and conformal prediction sets, the research demonstrates that criterion matters more than the judge, especially in relevance and coherence. This finding is significant for both academia and industry as it challenges the current unconditional trust in LLM judge systems and proposes more reliable evaluation methods.
Technical Contribution
The technical contributions of this paper include the first measurement of directed 3-cycle rates in LLM judges at the per-document level and linking them to conformal uncertainty. Additionally, conformal prediction sets provide finite-sample coverage guarantees and serve as a deployment signal, with prediction set width correlating with actual judge error. These contributions offer new theoretical guarantees and engineering possibilities for evaluating NLG systems.
Novelty
This paper is the first to combine transitivity analysis and conformal prediction sets to diagnose the reliability of LLM judge systems. Unlike previous work, this study not only focuses on aggregate evaluation metrics but also delves into the reliability of individual instance evaluations, revealing the importance of evaluation criteria.
Limitations
- Limitation 1: The study is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks, such as dialogue generation or machine translation.
- Limitation 2: Conformal prediction sets provide marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents.
- Limitation 3: The study uses a fixed nonconformity score; future work could explore learned nonconformity scores based on judge confidence or LLM log-probabilities.
Future Work
Future research could expand to larger datasets and different NLG tasks, such as dialogue generation and machine translation. Additionally, exploring conditional conformal methods to improve coverage accuracy for difficult documents and developing dynamic nonconformity scoring systems based on judge confidence are promising directions.
AI Executive Summary
In the realm of automatic evaluation for natural language generation (NLG), LLM judge systems have gained traction due to their scalability. However, the reliability of these systems on a per-instance basis remains poorly understood. Existing evaluation methods often rely on system-level metrics like Kendall's τ or Pearson correlation with human scores, which, although seemingly impressive, often mask errors on individual instances.
This paper introduces a two-pronged diagnostic toolkit applied to the SummEval dataset to uncover inconsistencies in LLM judge systems on a per-instance level. First, a transitivity analysis reveals widespread per-input inconsistency, with 33%-67% of documents exhibiting at least one directed 3-cycle, despite low aggregate violation rates (0.8%-4.1%). Second, split conformal prediction sets over 1-5 Likert scores provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator.
The study finds that prediction set width shows consistent cross-judge agreement, indicating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, relevance and coherence are judged most reliably, while fluency and consistency remain unreliable. This finding is significant for both academia and industry as it challenges the current unconditional trust in LLM judge systems and proposes more reliable evaluation methods.
Moreover, the results of transitivity analysis and conformal prediction sets converge, demonstrating that evaluation criteria matter more than the judge. This conclusion offers a new perspective for evaluating NLG systems, suggesting that coherence and relevance scores should be trusted more than fluency and consistency scores when deploying LLM judge systems.
Despite revealing per-instance inconsistency in LLM judge systems, the study has limitations. It is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks. Additionally, conformal prediction sets provide marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents. Future research could expand to larger datasets and different NLG tasks and explore conditional conformal methods to improve coverage accuracy for difficult documents.
Deep Analysis
Background
Automatic evaluation of natural language generation (NLG) has become a cornerstone of modern natural language processing (NLP) research. With the advent of large language models (LLMs), LLM judge systems have rapidly been adopted as scalable proxies for human annotation. Traditional evaluation methods often rely on system-level metrics like Kendall's τ or Pearson correlation with human scores, which, although seemingly impressive, often mask errors on individual instances. Recent studies have begun to focus on the reliability of LLM judge systems, revealing systematic weaknesses on specific input types. However, existing research largely concentrates on aggregate evaluation metrics, lacking in-depth exploration of per-instance evaluation reliability.
Core Problem
The reliability of LLM judge systems on a per-instance basis remains poorly understood. Existing evaluation methods often rely on system-level metrics like Kendall's τ or Pearson correlation with human scores, which, although seemingly impressive, often mask errors on individual instances. A judge that is right 90% of the time can be spectacularly wrong on the 10% that matters most. Therefore, accurately assessing the per-instance reliability of LLM judge systems is a pressing issue.
Innovation
This paper introduces a two-pronged diagnostic toolkit applied to the SummEval dataset to uncover inconsistencies in LLM judge systems on a per-instance level. First, a transitivity analysis reveals widespread per-input inconsistency, with 33%-67% of documents exhibiting at least one directed 3-cycle, despite low aggregate violation rates (0.8%-4.1%). Second, split conformal prediction sets over 1-5 Likert scores provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator. Prediction set width shows consistent cross-judge agreement, indicating it captures document-level difficulty rather than judge-specific noise.
Methodology
- �� Transitivity Analysis: Measure directed 3-cycle violation rates across four judges on the SummEval dataset, revealing widespread per-input inconsistency.
- �� Conformal Prediction Sets: Use split conformal prediction sets over 1-5 Likert scores to provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator.
- �� Consistency Evaluation: Assess the association between prediction set width and actual judge-human disagreement using Spearman correlation, validating the consistent cross-judge agreement of prediction set width.
Experiments
Experiments are conducted on the SummEval dataset, which contains 100 documents × 16 systems (=1,600 outputs), each output rated by three annotators on a 1-5 Likert scale. For cost efficiency, the experiments subsample to 30 documents × 8 systems, rounding averaged human scores to the nearest integer for conformal calibration. Judges include gpt-4o-mini, meta-llama/llama-3.1-70b-instruct, qwen/qwen-2.5-72b-instruct, and mistralai/mistral-small-3.1-24b-instruct. All responses are cached in SQLite.
Results
Transitivity analysis shows that despite low overall violation rates (0.8%-4.1%), 33%-67% of documents exhibit at least one directed 3-cycle, indicating judge inconsistency on individual instances. Conformal prediction sets show a Spearman correlation of +0.576 (p<10^-100) across all judges and criteria, indicating a significant association between prediction set width and actual judge-human disagreement. Across four judges and four criteria, relevance and coherence are judged most reliably, while fluency and consistency remain unreliable.
Applications
The applications of this study include the deployment of judge systems in automatic NLG evaluation. By using transitivity analysis and conformal prediction sets, researchers and practitioners can more accurately assess the per-instance reliability of LLM judge systems, thereby improving the credibility of evaluation results. Additionally, this method can be applied to other NLP tasks requiring automatic evaluation, such as machine translation and dialogue generation.
Limitations & Outlook
Despite revealing per-instance inconsistency in LLM judge systems, the study has limitations. It is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks. Additionally, conformal prediction sets provide marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents. Future research could expand to larger datasets and different NLG tasks and explore conditional conformal methods to improve coverage accuracy for difficult documents.
Plain Language Accessible to non-experts
Imagine you're shopping in a large supermarket. There are many cashiers, each responsible for checking the price and quality of every item. Each cashier has their own standards; some may focus more on the appearance of the item, while others may care more about its functionality. Now, you want to know if these cashiers are consistent in their judgments for each item.
This study is like analyzing the consistency of these cashiers' judgments. Through transitivity analysis, researchers found that although the cashiers' judgments seem consistent overall, there are inconsistencies for certain items. It's like some cashiers think apples are better than bananas, bananas are better than oranges, but oranges are better than apples.
To better assess the reliability of each item's judgment, researchers used a method called conformal prediction sets. It's like scoring each item, and the wider the score range, the more uncertain the cashiers are about that item. This method allows researchers to more accurately assess the reliability of each item's judgment.
In short, this study is like helping the supermarket better evaluate the quality of each item, ensuring that each item's judgment is carefully analyzed and evaluated.
ELI14 Explained like you're 14
Hey there! You know how in school, teachers sometimes have different standards when grading you? Like, some teachers might care more about your homework quality, while others focus on your class participation.
This study is like analyzing those grading standards. Researchers found that although teachers' grades seem consistent overall, there might be some inconsistencies in certain cases. It's like one teacher thinks you're great at math but just okay in science, while another teacher might think the opposite.
To better assess each student's performance, researchers used a method called conformal prediction sets. It's like scoring each student, and the wider the score range, the more uncertain the teachers are about that student's performance. This method allows researchers to more accurately assess each student's performance.
In short, this study is like helping schools better evaluate each student's performance, ensuring that each grade is carefully analyzed and evaluated.
Glossary
Transitivity Analysis
A method for evaluating the consistency of preferences among multiple options by measuring directed 3-cycle violation rates, revealing judge inconsistency on individual instances.
Used in this paper to uncover inconsistencies in LLM judge systems on a per-instance level.
Conformal Prediction Set
A method providing finite-sample coverage guarantees. By using split conformal prediction sets, the width serves as a reliability indicator for each instance.
Used to assess the reliability of each instance's judgment.
SummEval Dataset
A dataset containing 100 documents × 16 system outputs, each rated by three annotators on a 1-5 Likert scale.
The primary dataset used for experiments in this paper.
Directed 3-Cycle
A cyclic preference relationship among three options, such as A is preferred over B, B over C, but C over A.
Used in transitivity analysis to evaluate judge preference consistency.
Spearman Correlation Coefficient
A non-parametric statistical measure of the monotonic relationship between two variables.
Used to assess the association between prediction set width and actual judge-human disagreement.
Kendall's τ
A statistical measure of the consistency between two rankings.
Used to evaluate the consistency between LLM judge systems and human scores.
Likert Scale
A rating scale commonly used to measure attitudes or opinions, typically ranging from 1 to 5.
Used to evaluate system outputs in the SummEval dataset.
Marginal Coverage Guarantee
The probability that a prediction set contains the true value at a given confidence level.
The coverage guarantee provided by conformal prediction sets.
Conditional Coverage
The probability that a prediction set contains the true value under specific conditions.
Future research could explore using conditional conformal methods to improve coverage accuracy for difficult documents.
Nonconformity Score
A metric used to measure the difference between predictions and actual outcomes.
Used in the calculation of conformal prediction sets.
Open Questions Unanswered questions from this research
- 1 How can this method be validated on larger datasets and different NLG tasks? The current study is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks, such as dialogue generation or machine translation.
- 2 How can the coverage accuracy of conformal prediction sets for difficult documents be improved? The current method provides marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents.
- 3 How can dynamic nonconformity scoring systems based on judge confidence be developed? The current study uses a fixed nonconformity score; future work could explore learned nonconformity scores based on judge confidence or LLM log-probabilities.
- 4 How can the reliability of LLM judge systems be improved under different evaluation criteria? The current study shows that evaluation criteria matter more than the judge; future research could explore optimization methods under different criteria.
- 5 How can the credibility of evaluation results be improved without increasing computational costs? The current method may require significant computational resources; future research could explore more efficient evaluation methods.
Applications
Immediate Applications
Automatic NLG Evaluation
Researchers and practitioners can use transitivity analysis and conformal prediction sets to more accurately assess the per-instance reliability of LLM judge systems, thereby improving the credibility of evaluation results.
Machine Translation Quality Assessment
By applying the methods in this paper, the quality of machine translation system outputs can be better assessed, especially in cases where evaluation criteria are inconsistent.
Dialogue Generation System Evaluation
In dialogue generation tasks, using the methods in this paper can help identify judge inconsistencies on individual instances, thereby improving the accuracy of system evaluations.
Long-term Vision
Cross-Domain Evaluation Standardization
The methods in this paper can provide standardized approaches for automatic evaluation across different domains, improving evaluation consistency and comparability between different tasks.
Intelligent Evaluation System Development
Future developments could include intelligent evaluation systems based on the methods in this paper, automatically identifying and correcting judge inconsistencies on individual instances, enhancing the intelligence level of evaluation systems.
Abstract
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
References (20)
SummEval: Re-evaluating Summarization Evaluation
A. R. Fabbri, Wojciech Kryscinski, Bryan McCann et al.
Inductive Confidence Machines for Regression
H. Papadopoulos, Kostas Proedrou, Vladimir Vovk et al.
The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
Patrick Fernandes, Daniel Deutsch, M. Finkelstein et al.
A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method
Markus Schulze
Conformal Prediction Under Covariate Shift
R. Tibshirani, R. Barber, E. Candès et al.
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Ryan Koo, Minhwa Lee, Vipul Raheja et al.
Aggregating inconsistent information: ranking and clustering
Nir Ailon, Moses Charikar, Alantha Newman
A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification
Anastasios Nikolas Angelopoulos, Stephen Bates
Condorcet's Theory of Voting
H. Young
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Seonghyeon Ye, Doyoung Kim, Sungdong Kim et al.
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
Chenhui Shen, Liying Cheng, Yang You et al.
Unsupervised Quality Estimation for Neural Machine Translation
M. Fomicheva, Shuo Sun, L. Yankovskaya et al.
Algorithmic Learning in a Random World
Vladimir Vovk, A. Gammerman, G. Shafer
RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS
R. Bradley, M. E. Terry
Verbosity Bias in Preference Labeling by Large Language Models
Keita Saito, Akifumi Wachi, Koki Wataoka et al.
(Preprint)
Sarah Verschueren, J. van Aalst, A. Bangels et al.
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Lorenz Kuhn, Y. Gal, Sebastian Farquhar
GPTScore: Evaluate as You Desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang et al.
Topics on tournaments
J. Moon