Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

TL;DR

Diagnosing LLM judge reliability using transitivity analysis and conformal prediction sets, revealing 33%-67% documents with at least one 3-cycle.

cs.AI 🔴 Advanced 2026-04-17 37 views

Manan Gupta Dhruv Kumar

AI Reader Arxiv Page Download PDF

NLG evaluation transitivity analysis conformal prediction reliability assessment LLM judge

Key Findings

Methodology

This paper presents a two-pronged diagnostic toolkit applied to the SummEval dataset. First, a transitivity analysis reveals widespread per-input inconsistency, despite low aggregate violation rates (0.8%-4.1%). Second, split conformal prediction sets over 1-5 Likert scores provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator. Critically, prediction set width shows consistent cross-judge agreement, demonstrating it captures document-level difficulty rather than judge-specific noise.

Key Results

Result 1: Transitivity analysis shows that despite low overall violation rates (0.8%-4.1%), 33%-67% of documents exhibit at least one directed 3-cycle, indicating judge inconsistency on individual instances.
Result 2: Conformal prediction sets show a Spearman correlation of +0.576 (p<10^-100) across all judges and criteria, indicating a significant association between prediction set width and actual judge-human disagreement.
Result 3: Across four judges and four criteria, relevance and coherence are judged most reliably, while fluency and consistency remain unreliable.

Significance

This study provides crucial insights into the per-instance inconsistency of LLM judge systems for automatic NLG evaluation. By employing transitivity analysis and conformal prediction sets, the research demonstrates that criterion matters more than the judge, especially in relevance and coherence. This finding is significant for both academia and industry as it challenges the current unconditional trust in LLM judge systems and proposes more reliable evaluation methods.

Technical Contribution

The technical contributions of this paper include the first measurement of directed 3-cycle rates in LLM judges at the per-document level and linking them to conformal uncertainty. Additionally, conformal prediction sets provide finite-sample coverage guarantees and serve as a deployment signal, with prediction set width correlating with actual judge error. These contributions offer new theoretical guarantees and engineering possibilities for evaluating NLG systems.

Novelty

This paper is the first to combine transitivity analysis and conformal prediction sets to diagnose the reliability of LLM judge systems. Unlike previous work, this study not only focuses on aggregate evaluation metrics but also delves into the reliability of individual instance evaluations, revealing the importance of evaluation criteria.

Limitations

Limitation 1: The study is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks, such as dialogue generation or machine translation.
Limitation 2: Conformal prediction sets provide marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents.
Limitation 3: The study uses a fixed nonconformity score; future work could explore learned nonconformity scores based on judge confidence or LLM log-probabilities.

Future Work

Future research could expand to larger datasets and different NLG tasks, such as dialogue generation and machine translation. Additionally, exploring conditional conformal methods to improve coverage accuracy for difficult documents and developing dynamic nonconformity scoring systems based on judge confidence are promising directions.

AI Executive Summary

In the realm of automatic evaluation for natural language generation (NLG), LLM judge systems have gained traction due to their scalability. However, the reliability of these systems on a per-instance basis remains poorly understood. Existing evaluation methods often rely on system-level metrics like Kendall's τ or Pearson correlation with human scores, which, although seemingly impressive, often mask errors on individual instances.

This paper introduces a two-pronged diagnostic toolkit applied to the SummEval dataset to uncover inconsistencies in LLM judge systems on a per-instance level. First, a transitivity analysis reveals widespread per-input inconsistency, with 33%-67% of documents exhibiting at least one directed 3-cycle, despite low aggregate violation rates (0.8%-4.1%). Second, split conformal prediction sets over 1-5 Likert scores provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator.

The study finds that prediction set width shows consistent cross-judge agreement, indicating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, relevance and coherence are judged most reliably, while fluency and consistency remain unreliable. This finding is significant for both academia and industry as it challenges the current unconditional trust in LLM judge systems and proposes more reliable evaluation methods.

Moreover, the results of transitivity analysis and conformal prediction sets converge, demonstrating that evaluation criteria matter more than the judge. This conclusion offers a new perspective for evaluating NLG systems, suggesting that coherence and relevance scores should be trusted more than fluency and consistency scores when deploying LLM judge systems.

Despite revealing per-instance inconsistency in LLM judge systems, the study has limitations. It is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks. Additionally, conformal prediction sets provide marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents. Future research could expand to larger datasets and different NLG tasks and explore conditional conformal methods to improve coverage accuracy for difficult documents.

Deep Analysis

Background

Automatic evaluation of natural language generation (NLG) has become a cornerstone of modern natural language processing (NLP) research. With the advent of large language models (LLMs), LLM judge systems have rapidly been adopted as scalable proxies for human annotation. Traditional evaluation methods often rely on system-level metrics like Kendall's τ or Pearson correlation with human scores, which, although seemingly impressive, often mask errors on individual instances. Recent studies have begun to focus on the reliability of LLM judge systems, revealing systematic weaknesses on specific input types. However, existing research largely concentrates on aggregate evaluation metrics, lacking in-depth exploration of per-instance evaluation reliability.

Core Problem

The reliability of LLM judge systems on a per-instance basis remains poorly understood. Existing evaluation methods often rely on system-level metrics like Kendall's τ or Pearson correlation with human scores, which, although seemingly impressive, often mask errors on individual instances. A judge that is right 90% of the time can be spectacularly wrong on the 10% that matters most. Therefore, accurately assessing the per-instance reliability of LLM judge systems is a pressing issue.

Innovation

Methodology

�� Transitivity Analysis: Measure directed 3-cycle violation rates across four judges on the SummEval dataset, revealing widespread per-input inconsistency.

�� Conformal Prediction Sets: Use split conformal prediction sets over 1-5 Likert scores to provide theoretically guaranteed coverage, with set width serving as a per-instance reliability indicator.

�� Consistency Evaluation: Assess the association between prediction set width and actual judge-human disagreement using Spearman correlation, validating the consistent cross-judge agreement of prediction set width.

Experiments

Experiments are conducted on the SummEval dataset, which contains 100 documents × 16 systems (=1,600 outputs), each output rated by three annotators on a 1-5 Likert scale. For cost efficiency, the experiments subsample to 30 documents × 8 systems, rounding averaged human scores to the nearest integer for conformal calibration. Judges include gpt-4o-mini, meta-llama/llama-3.1-70b-instruct, qwen/qwen-2.5-72b-instruct, and mistralai/mistral-small-3.1-24b-instruct. All responses are cached in SQLite.

Results

Transitivity analysis shows that despite low overall violation rates (0.8%-4.1%), 33%-67% of documents exhibit at least one directed 3-cycle, indicating judge inconsistency on individual instances. Conformal prediction sets show a Spearman correlation of +0.576 (p<10^-100) across all judges and criteria, indicating a significant association between prediction set width and actual judge-human disagreement. Across four judges and four criteria, relevance and coherence are judged most reliably, while fluency and consistency remain unreliable.

Applications

The applications of this study include the deployment of judge systems in automatic NLG evaluation. By using transitivity analysis and conformal prediction sets, researchers and practitioners can more accurately assess the per-instance reliability of LLM judge systems, thereby improving the credibility of evaluation results. Additionally, this method can be applied to other NLP tasks requiring automatic evaluation, such as machine translation and dialogue generation.

Limitations & Outlook

Plain Language Accessible to non-experts

Imagine you're shopping in a large supermarket. There are many cashiers, each responsible for checking the price and quality of every item. Each cashier has their own standards; some may focus more on the appearance of the item, while others may care more about its functionality. Now, you want to know if these cashiers are consistent in their judgments for each item.

This study is like analyzing the consistency of these cashiers' judgments. Through transitivity analysis, researchers found that although the cashiers' judgments seem consistent overall, there are inconsistencies for certain items. It's like some cashiers think apples are better than bananas, bananas are better than oranges, but oranges are better than apples.

To better assess the reliability of each item's judgment, researchers used a method called conformal prediction sets. It's like scoring each item, and the wider the score range, the more uncertain the cashiers are about that item. This method allows researchers to more accurately assess the reliability of each item's judgment.

In short, this study is like helping the supermarket better evaluate the quality of each item, ensuring that each item's judgment is carefully analyzed and evaluated.

ELI14 Explained like you're 14

Hey there! You know how in school, teachers sometimes have different standards when grading you? Like, some teachers might care more about your homework quality, while others focus on your class participation.

This study is like analyzing those grading standards. Researchers found that although teachers' grades seem consistent overall, there might be some inconsistencies in certain cases. It's like one teacher thinks you're great at math but just okay in science, while another teacher might think the opposite.

To better assess each student's performance, researchers used a method called conformal prediction sets. It's like scoring each student, and the wider the score range, the more uncertain the teachers are about that student's performance. This method allows researchers to more accurately assess each student's performance.

In short, this study is like helping schools better evaluate each student's performance, ensuring that each grade is carefully analyzed and evaluated.

Glossary

Transitivity Analysis

A method for evaluating the consistency of preferences among multiple options by measuring directed 3-cycle violation rates, revealing judge inconsistency on individual instances.

Used in this paper to uncover inconsistencies in LLM judge systems on a per-instance level.

Conformal Prediction Set

A method providing finite-sample coverage guarantees. By using split conformal prediction sets, the width serves as a reliability indicator for each instance.

Used to assess the reliability of each instance's judgment.

SummEval Dataset

A dataset containing 100 documents × 16 system outputs, each rated by three annotators on a 1-5 Likert scale.

The primary dataset used for experiments in this paper.

Directed 3-Cycle

A cyclic preference relationship among three options, such as A is preferred over B, B over C, but C over A.

Used in transitivity analysis to evaluate judge preference consistency.

Spearman Correlation Coefficient

A non-parametric statistical measure of the monotonic relationship between two variables.

Used to assess the association between prediction set width and actual judge-human disagreement.

Kendall's τ

A statistical measure of the consistency between two rankings.

Used to evaluate the consistency between LLM judge systems and human scores.

Likert Scale

A rating scale commonly used to measure attitudes or opinions, typically ranging from 1 to 5.

Used to evaluate system outputs in the SummEval dataset.

Marginal Coverage Guarantee

The probability that a prediction set contains the true value at a given confidence level.

The coverage guarantee provided by conformal prediction sets.

Conditional Coverage

The probability that a prediction set contains the true value under specific conditions.

Future research could explore using conditional conformal methods to improve coverage accuracy for difficult documents.

Nonconformity Score

A metric used to measure the difference between predictions and actual outcomes.

Used in the calculation of conformal prediction sets.

Open Questions Unanswered questions from this research

1 How can this method be validated on larger datasets and different NLG tasks? The current study is conducted only on the SummEval dataset, and results may not generalize to other datasets or tasks, such as dialogue generation or machine translation.
2 How can the coverage accuracy of conformal prediction sets for difficult documents be improved? The current method provides marginal coverage guarantees rather than per-document conditional coverage, potentially leading to overly tight prediction sets for difficult documents.
3 How can dynamic nonconformity scoring systems based on judge confidence be developed? The current study uses a fixed nonconformity score; future work could explore learned nonconformity scores based on judge confidence or LLM log-probabilities.
4 How can the reliability of LLM judge systems be improved under different evaluation criteria? The current study shows that evaluation criteria matter more than the judge; future research could explore optimization methods under different criteria.
5 How can the credibility of evaluation results be improved without increasing computational costs? The current method may require significant computational resources; future research could explore more efficient evaluation methods.

Applications

Immediate Applications

Automatic NLG Evaluation

Researchers and practitioners can use transitivity analysis and conformal prediction sets to more accurately assess the per-instance reliability of LLM judge systems, thereby improving the credibility of evaluation results.

Machine Translation Quality Assessment

By applying the methods in this paper, the quality of machine translation system outputs can be better assessed, especially in cases where evaluation criteria are inconsistent.

Dialogue Generation System Evaluation

In dialogue generation tasks, using the methods in this paper can help identify judge inconsistencies on individual instances, thereby improving the accuracy of system evaluations.

Long-term Vision

Cross-Domain Evaluation Standardization

The methods in this paper can provide standardized approaches for automatic evaluation across different domains, improving evaluation consistency and comparability between different tasks.

Intelligent Evaluation System Development

Future developments could include intelligent evaluation systems based on the methods in this paper, automatically identifying and correcting judge inconsistencies on individual instances, enhancing the intelligence level of evaluation systems.

Abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

cs.AI cs.CL cs.LG

References (20)

SummEval: Re-evaluating Summarization Evaluation

A. R. Fabbri, Wojciech Kryscinski, Bryan McCann et al.

2020 934 citations ⭐ Influential View Analysis →

Inductive Confidence Machines for Regression

H. Papadopoulos, Kostas Proedrou, Vladimir Vovk et al.

2002 611 citations

The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

Patrick Fernandes, Daniel Deutsch, M. Finkelstein et al.

2023 101 citations View Analysis →

A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method

Markus Schulze

2011 294 citations

Conformal Prediction Under Covariate Shift

R. Tibshirani, R. Barber, E. Candès et al.

2019 628 citations View Analysis →

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja et al.

2023 151 citations View Analysis →

Aggregating inconsistent information: ranking and clustering

Nir Ailon, Moses Charikar, Alantha Newman

2005 284 citations

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios Nikolas Angelopoulos, Stephen Bates

2021 965 citations View Analysis →

Condorcet's Theory of Voting

H. Young

1988 822 citations

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim et al.

2023 170 citations View Analysis →

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.

2023 7908 citations View Analysis →

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Chenhui Shen, Liying Cheng, Yang You et al.

2023 100 citations View Analysis →

Unsupervised Quality Estimation for Neural Machine Translation

M. Fomicheva, Shuo Sun, L. Yankovskaya et al.

2020 282 citations View Analysis →

Algorithmic Learning in a Random World

Vladimir Vovk, A. Gammerman, G. Shafer

2005 1974 citations

RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS

R. Bradley, M. E. Terry

1952 4082 citations

Verbosity Bias in Preference Labeling by Large Language Models

Keita Saito, Akifumi Wachi, Koki Wataoka et al.

2023 67 citations View Analysis →

(Preprint)

Sarah Verschueren, J. van Aalst, A. Bangels et al.

2018 4705 citations

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Y. Gal, Sebastian Farquhar

2023 626 citations View Analysis →

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang et al.

2023 450 citations View Analysis →

Topics on tournaments

J. Moon

1968 590 citations

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Transitivity Analysis

Conformal Prediction Set

SummEval Dataset

Directed 3-Cycle

Spearman Correlation Coefficient

Kendall's τ

Likert Scale

Marginal Coverage Guarantee

Conditional Coverage

Nonconformity Score

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Automatic NLG Evaluation

Machine Translation Quality Assessment

Dialogue Generation System Evaluation

Long-term Vision

Cross-Domain Evaluation Standardization

Intelligent Evaluation System Development

Abstract

References (20)

Related Papers

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Large Language Models Exhibit Normative Conformity