The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

TL;DR

GPT models predict experience ratings from open-ended survey text; prompt optimization improves accuracy by 2%.

cs.CL 🔴 Advanced 2026-04-22 32 views
Andrew Hong Jason Potteiger Luis E. Zapata
LLM NLP predictive scoring text annotation prompt engineering

Key Findings

Methodology

This study examines the impact of prompt design and model selection on the performance of GPT models in predicting experience ratings from open-ended survey text. Four configurations were tested: the original baseline prompt and a moderately customized version, combined with three GPT models (4.1, 4.1-mini, 5.2). Approximately 10,000 post-game surveys from five MLB teams were analyzed to assess the effects of prompt customization and model selection on prediction accuracy.

Key Results

  • Result 1: On GPT 4.1, prompt customization improved within ±1 agreement by about 2 percentage points, from 67% to 69%.
  • Result 2: Model swaps degraded performance: GPT 5.2 returned to baseline, and GPT 4.1-mini fell six percentage points below it.
  • Result 3: The linguistic character of the input text influenced accuracy more than an order of magnitude greater than prompt or model choice.

Significance

This research highlights the relative importance of prompt design and model selection in using large language models (LLMs) for predicting open-ended text ratings. It demonstrates that while prompt customization can improve model performance to some extent, the linguistic characteristics of the input text have a more significant impact on the final results. This finding is crucial for the field of natural language processing, particularly in scenarios where quantitative information needs to be extracted from unstructured text.

Technical Contribution

The technical contribution of this paper lies in revealing the specific role of prompt design in correcting model biases in reading text and the unreliability of model selection in this regard. By systematically analyzing and comparing the performance of different prompt and model configurations, the study provides a deeper understanding of the role of prompt engineering in LLM prediction tasks.

Novelty

This study is the first to systematically analyze the impact of prompt design and model selection on LLM predictions of open-ended text ratings, particularly in the context of post-sports event surveys. Such comparative analysis is unprecedented in existing literature.

Limitations

  • Limitation 1: The model performs poorly on texts containing negative operational details, with within ±1 agreement dropping to 42-44%.
  • Limitation 2: The improvement from prompt customization is mainly concentrated on texts where surface sentiment and user ratings diverge, failing to universally enhance prediction accuracy across all text types.
  • Limitation 3: The study does not fully eliminate prediction errors due to missing information in the text.

Future Work

Future research can further explore how to improve LLM prediction accuracy across different text types through enhanced prompt design and model selection. Additionally, the study can be extended to other domains of open-ended text prediction tasks to verify the generalizability of these findings.

AI Executive Summary

In today's data-driven world, understanding the complexities of user experience is crucial for businesses and researchers. Traditional survey methods often rely on closed-ended questions, which limit a comprehensive understanding of users' true feelings. This paper explores the potential of using large language models (LLMs) to predict user experience ratings from open-ended survey text.

Building on previous work that found an unoptimized GPT 4.1 prompt could predict user-reported experience ratings within one point 67% of the time, this paper further tests the relative impact of prompt design and model selection on that performance. Approximately 10,000 post-game surveys from five MLB teams were analyzed, comparing four configurations: the original baseline prompt and a moderately customized version, combined with three GPT models (4.1, 4.1-mini, 5.2).

Results indicate that prompt customization on GPT 4.1 improved within ±1 agreement by about 2 percentage points, from 67% to 69%. However, model swaps degraded performance: GPT 5.2 returned to baseline, and GPT 4.1-mini fell six percentage points below it. The study found that the linguistic character of the input text influenced accuracy more than an order of magnitude greater than prompt or model choice.

The technical contribution of this paper lies in revealing the specific role of prompt design in correcting model biases in reading text and the unreliability of model selection in this regard. By systematically analyzing and comparing the performance of different prompt and model configurations, the study provides a deeper understanding of the role of prompt engineering in LLM prediction tasks.

While prompt customization can improve model performance to some extent, the linguistic characteristics of the input text have a more significant impact on the final results. This finding is crucial for the field of natural language processing, particularly in scenarios where quantitative information needs to be extracted from unstructured text. Future research can further explore how to improve LLM prediction accuracy across different text types through enhanced prompt design and model selection.

Deep Analysis

Background

In the field of natural language processing, using large language models (LLMs) for text prediction and annotation has become a significant approach. With the advancement of model capabilities in recent years, LLMs have shown increasingly better performance across various tasks. However, effectively extracting quantitative information from open-ended text remains a challenge. Traditional text analysis methods often rely on closed-ended questions, limiting a comprehensive understanding of users' true feelings. To overcome this limitation, researchers have begun exploring the potential of using LLMs to predict user experience ratings from open-ended text.

Core Problem

The core problem is accurately predicting user experience ratings from open-ended survey text. Due to the unstructured nature of text and the diversity of language, model performance can vary significantly across different text types. Additionally, models may have biases in reading text, which need to be corrected through prompt design. The study aims to assess the impact of prompt design and model selection on prediction accuracy and identify ways to improve model performance.

Innovation

The core innovation of this paper lies in systematically analyzing the impact of prompt design and model selection on LLM predictions of open-ended text ratings. The study is the first to conduct such comparative analysis in the context of post-sports event surveys, revealing the specific role of prompt design in correcting model biases in reading text. By systematically analyzing and comparing the performance of different prompt and model configurations, the study provides a deeper understanding of the role of prompt engineering in LLM prediction tasks.

Methodology

  • �� Four configurations were used: the original baseline prompt and a moderately customized version, combined with three GPT models (4.1, 4.1-mini, 5.2).

  • �� Approximately 10,000 post-game surveys from five MLB teams were analyzed.

  • �� The impact of prompt customization and model selection on prediction accuracy was assessed.

  • �� Multiple metrics were used to evaluate model performance, including exact match rate, within ±1 agreement, mean absolute error, and directional bias.

Experiments

The experimental design involved analyzing approximately 10,000 post-game surveys from five MLB teams. The baseline and customized prompts were run on GPT 4.1, GPT 4.1-mini, and GPT 5.2. By comparing the performance of different prompt and model configurations, the impact of prompt design and model selection on prediction accuracy was assessed. Multiple metrics were used to evaluate model performance, including exact match rate, within ±1 agreement, mean absolute error, and directional bias.

Results

Experimental results indicate that prompt customization on GPT 4.1 improved within ±1 agreement by about 2 percentage points, from 67% to 69%. However, model swaps degraded performance: GPT 5.2 returned to baseline, and GPT 4.1-mini fell six percentage points below it. The study found that the linguistic character of the input text influenced accuracy more than an order of magnitude greater than prompt or model choice.

Applications

The findings of this study can be applied in scenarios where quantitative information needs to be extracted from unstructured text, such as customer feedback analysis, market surveys, and user experience research. By improving prompt design and model selection, LLM prediction accuracy across different text types can be enhanced, providing businesses and researchers with more accurate insights into user experiences.

Limitations & Outlook

While prompt customization can improve model performance to some extent, the linguistic characteristics of the input text have a more significant impact on the final results. Additionally, the model performs poorly on texts containing negative operational details, with within ±1 agreement dropping to 42-44%. Future research can further explore how to improve LLM prediction accuracy across different text types through enhanced prompt design and model selection.

Plain Language Accessible to non-experts

Imagine you're watching a baseball game, and after the game, you're asked to rate your overall experience. You might say the game was great but mention that the concession lines were too long. Now, researchers want to use a smart computer program to predict your rating without directly asking you. This program is like a very smart assistant that reads what you wrote about the game and then guesses what kind of rating you would give.

This assistant uses something called a large language model (LLM). It's like a super-intelligent reader that can understand every word you write and try to figure out your overall feeling about the game. Researchers found that by giving this assistant some special instructions, like "don't give a low score just because of small issues," it can make more accurate predictions.

However, this assistant sometimes makes mistakes, especially when you mention some negative details. Researchers found that the assistant tends to underestimate your rating when dealing with these negative details. It's like the assistant hears you complain about the long lines and assumes you didn't enjoy the game overall.

To make this assistant smarter, researchers are working on improving its "understanding ability," hoping it can better understand your writing and make more accurate rating predictions.

ELI14 Explained like you're 14

Hey, imagine you just watched an awesome baseball game, and afterward, you're asked to rate your overall experience. You might say the game was fantastic but also mention that the snack lines were way too long. Now, there's this super-smart computer program that wants to guess your rating without asking you directly. This program is like a super-intelligent assistant that reads what you wrote about the game and then tries to guess what kind of rating you'd give.

This assistant uses something called a large language model (LLM). It's like a super-smart reader that can understand every word you write and try to figure out your overall feeling about the game. Researchers found that by giving this assistant some special instructions, like "don't give a low score just because of small issues," it can make more accurate predictions.

But sometimes, this assistant messes up, especially when you talk about some negative stuff. Researchers found that the assistant tends to underestimate your rating when dealing with these negative details. It's like the assistant hears you complain about the long lines and thinks you didn't enjoy the game overall.

To make this assistant even smarter, researchers are working on improving its "understanding ability," hoping it can better understand your writing and make more accurate rating predictions.

Glossary

Large Language Model (LLM)

A large language model is a type of deep learning model trained to understand and generate natural language text. They are commonly used for various natural language processing tasks, such as text generation, translation, and sentiment analysis.

In this paper, LLMs are used to predict user experience ratings from open-ended survey text.

Prompt Engineering

Prompt engineering involves designing and optimizing input prompts to improve the performance of large language models on specific tasks. By adjusting the content and structure of prompts, the quality of model outputs can be influenced.

The paper studies the impact of prompt design on model prediction accuracy.

GPT 4.1

GPT 4.1 is a version of a large language model with powerful natural language processing capabilities. It can understand complex text inputs and generate relevant outputs.

In this paper, GPT 4.1 is used to test the impact of prompt design on prediction accuracy.

Within ±1 Agreement

Within ±1 agreement refers to the proportion of sessions where the predicted rating falls within one point of the survey rating. This metric is used to evaluate the accuracy of model predictions.

The paper uses within ±1 agreement to compare the performance of different prompt and model configurations.

Mean Absolute Error (MAE)

Mean absolute error is a metric that quantifies the average magnitude of errors in model predictions. It represents the average difference between predicted and actual values.

The paper uses MAE to assess the precision of model predictions.

Directional Bias

Directional bias refers to the tendency of model predictions to systematically overestimate or underestimate actual values. A negative directional bias indicates a tendency to underestimate ratings.

The paper analyzes directional bias across different configurations.

Text Annotation

Text annotation involves adding structured labels or information to text data to facilitate analysis and processing.

In this paper, text annotation is used to extract user experience ratings from open-ended survey text.

Model Selection

Model selection involves choosing the most suitable model for a specific task from multiple candidates. Criteria may include performance, computational cost, and applicability.

The paper studies the impact of model selection on prediction accuracy.

Customized Prompt

A customized prompt is an input prompt designed according to specific task requirements to improve model prediction performance.

The paper tests the impact of customized prompts on model prediction accuracy.

Baseline Prompt

A baseline prompt is a standard input prompt that has not been optimized or customized, used to evaluate the basic performance of a model.

The paper uses a baseline prompt as a comparison benchmark.

Open Questions Unanswered questions from this research

  • 1 How can model prediction accuracy be further improved when dealing with texts containing negative details? Current research finds that models perform poorly on texts with negative operational details, with within ±1 agreement dropping to 42-44%. Future research can explore improving prompt design and model selection to enhance performance on these text types.
  • 2 Are these research findings generalizable to other domains of open-ended text prediction tasks? The study focuses on the context of post-sports event surveys, and future research can extend to other domains to verify the generalizability of these findings.
  • 3 How can prompt design and model selection be effectively combined to maximize LLM prediction performance? Current research shows different impacts of prompt design and model selection on prediction accuracy, and future research can explore how to effectively combine the two.
  • 4 How can directional bias in model predictions be reduced? The paper finds that all configurations exhibit a tendency to underestimate ratings, and future research can explore methods to reduce this bias.
  • 5 In multilingual environments, how do prompt design and model selection impact prediction accuracy? The study focuses on a single-language environment, and future research can explore applications in multilingual contexts.

Applications

Immediate Applications

Customer Feedback Analysis

Businesses can use improved LLM technology to analyze customer feedback, extracting quantitative information from open-ended text to better understand customer experience and satisfaction.

Market Surveys

Market researchers can leverage LLMs to predict user experience ratings from open-ended survey text, gaining more accurate market insights.

User Experience Research

Researchers can use LLM technology to extract experience ratings from user-generated content to evaluate the user experience of products or services.

Long-term Vision

Automated Satisfaction Surveys

In the future, LLM technology could be used to automate satisfaction surveys, extracting ratings from open-ended text and reducing reliance on closed-ended questions.

Multilingual Text Analysis

As LLM technology advances, it could be applied in multilingual environments to extract consistent experience ratings from texts in different languages.

Abstract

An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that "prompt engineering helps a little" but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.

cs.CL

References (18)

Self-reports: How the questions shape the answers.

N. Schwarz

1999 2797 citations ⭐ Influential

LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text

Jason Potteiger, Andrew Hong, Ito Zapata

2026 1 citations ⭐ Influential View Analysis →

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

Filip J. Kucia, Anirban Chakraborty, Anna Wr'oblewska

2026 1 citations ⭐ Influential View Analysis →

Back to Bentham? Explorations of experience utility

P. Wakker, D. Kahneman, R. Sarin

1997 2415 citations ⭐ Influential

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez et al.

2023 1007 citations View Analysis →

GPT as a Measurement Tool

Hemanth Asirvatham, Elliott Mokski, A. Shleifer

2026 1 citations

ChatGPT outperforms crowd workers for text-annotation tasks

F. Gilardi, Meysam Alizadeh, M. Kubli

2023 1414 citations View Analysis →

Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states.

N. Schwarz, G. Clore

1983 5112 citations

The use of LLMs to annotate data in management research: Foundational guidelines and warnings

Natalie A. Carlson, Vanessa C. Burbano

2025 13 citations

Large Language Models: An Applied Econometric Framework

Jens O. Ludwig, Sendhil Mullainathan, Ashesh Rambachan

2024 40 citations View Analysis →

GPT is an effective tool for multilingual psychological text analysis

Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky et al.

2024 312 citations

Validating the use of large language models for psychological text classification

Hannah L. Bunt, Alex Goddard, T. Reader et al.

2025 8 citations

Bad Is Stronger Than Good

P. Harms

2022 1334 citations

Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages

Petter Törnberg

2024 74 citations

Measuring Scalar Constructs in Social Science with LLMs

Hauke Licht, Rupak Sarkar, Patrick Y. Wu et al.

2025 7 citations View Analysis →

When More Pain Is Preferred to Less: Adding a Better End

D. Kahneman, B. Fredrickson, Charles A. Schreiber et al.

1993 1466 citations

Prompt Stability Scoring for Text Annotation with Large Language Models

C. Barrie, Elli Palaiologou, Petter Törnberg

2024 17 citations View Analysis →

Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models

Naoki Egami, Musashi Jacobs-Harukawa, Brandon M Stewart et al.

2023 44 citations View Analysis →