Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

TL;DR

BLF system achieves state-of-the-art binary forecasting performance on ForecastBench using sequential Bayesian updating of linguistic beliefs.

cs.AI πŸ”΄ Advanced 2026-04-21 32 views
Kevin Murphy
Bayesian updating language model forecasting machine learning data analysis

Key Findings

Methodology

The paper introduces BLF (Bayesian Linguistic Forecaster), a system for binary forecasting based on three core ideas: 1) Bayesian linguistic belief state, which combines numerical probability estimates with natural-language evidence summaries; 2) hierarchical multi-trial aggregation, using logit-space shrinkage with a data-dependent prior; 3) hierarchical calibration with Platt scaling and hierarchical priors to avoid over-shrinking extreme predictions. Experiments show BLF outperforms all top public methods on the ForecastBench benchmark.

Key Results

  • On 400 backtesting questions from ForecastBench, the BLF system outperforms all top public methods, including Cassi, GPT-5, Grok 4.20, and Foresight-32B. Specifically, BLF achieves a difficulty-adjusted Brier Index (ABI) of 94.8 on market questions, compared to 91.4 for Foresight-32B.
  • Ablation studies indicate that removing the structured belief state degrades the Brier Index by 5.1, a larger effect than removing web search (3.4).
  • BLF significantly outperforms the crowd baseline on market questions, while all other methods are statistically indistinguishable from simply returning the market price.

Significance

The BLF system represents a breakthrough in binary forecasting, particularly in handling extreme predictions with skewed base rates. By employing hierarchical calibration and multi-trial aggregation, BLF not only improves prediction accuracy but also demonstrates better adaptability across different datasets. This research provides new insights for future forecasting systems, especially in effectively integrating natural language processing and probabilistic inference.

Technical Contribution

The technical contribution of the BLF system lies in its innovative combination of Bayesian linguistic belief states and hierarchical calibration methods. This approach not only offers new theoretical guarantees but also opens up new engineering possibilities. Compared to existing state-of-the-art methods, BLF shows higher accuracy and robustness in handling complex forecasting problems.

Novelty

The BLF system is the first to integrate Bayesian updating with natural language processing for binary forecasting. This innovation lies in its ability to dynamically update belief states and effectively handle extreme predictions through hierarchical calibration, offering higher prediction accuracy compared to traditional methods.

Limitations

  • The BLF system may perform poorly when dealing with completely unknown events, as it relies on historical data and the knowledge of language models.
  • Due to the system's complexity, computational costs may be high, especially during multi-trial aggregation.
  • Hierarchical shrinkage may lead to performance degradation on certain specific datasets.

Future Work

Future research directions include: 1) extending the BLF system to handle multi-class forecasting problems; 2) optimizing the system's computational efficiency to reduce costs; 3) testing in more real-world application scenarios to verify its broad applicability.

AI Executive Summary

Forecasting the probability of future events is a fundamental challenge with applications in geopolitics, finance, and public health. Traditional methods often struggle with dynamic environments, relying heavily on historical data and complex mathematical models.

The BLF (Bayesian Linguistic Forecaster) system offers a novel solution for binary forecasting by integrating Bayesian updating and natural language processing. Its core lies in a dynamically updated Bayesian linguistic belief state, combining numerical probability estimates with natural-language evidence summaries, enabling more accurate predictions at each iterative step.

The technical principles of the BLF system include three key components: 1) Bayesian linguistic belief state for dynamic belief updates; 2) hierarchical multi-trial aggregation using logit-space shrinkage with data-dependent priors; 3) hierarchical calibration using Platt scaling with hierarchical priors to avoid over-shrinking extreme predictions. These techniques enable BLF to excel in handling complex forecasting problems.

On 400 backtesting questions from the ForecastBench benchmark, the BLF system outperforms all top public methods, including Cassi, GPT-5, Grok 4.20, and Foresight-32B. Specifically, BLF achieves a difficulty-adjusted Brier Index (ABI) of 94.8 on market questions, compared to 91.4 for Foresight-32B. This result demonstrates a significant improvement in prediction accuracy.

The broad application potential of the BLF system is evident in its ability to handle various types of forecasting problems, including market and dataset predictions. This success not only brings new technical breakthroughs to the forecasting field but also points the way for future research.

However, the BLF system has limitations, such as potential poor performance in completely unknown events and high computational costs. Future research will focus on optimizing computational efficiency and testing in more real-world application scenarios.

Deep Analysis

Background

Forecasting the probability of future events is a fundamental challenge with wide-ranging applications in geopolitics, finance, and public health. Recent advancements in large language models (LLMs) have shown that these models can approach human-level forecasting when given web search access. Benchmarks like ForecastBench provide standardized evaluation methods and showcase the performance of different methods through online leaderboards. The BLF (Bayesian Linguistic Forecaster) system offers a novel solution for binary forecasting by integrating Bayesian updating and natural language processing. Its core lies in a dynamically updated Bayesian linguistic belief state, combining numerical probability estimates with natural-language evidence summaries, enabling more accurate predictions at each iterative step.

Core Problem

Traditional methods often struggle with dynamic environments, relying heavily on historical data and complex mathematical models. The BLF system offers a novel solution for binary forecasting by integrating Bayesian updating and natural language processing. Its core lies in a dynamically updated Bayesian linguistic belief state, combining numerical probability estimates with natural-language evidence summaries, enabling more accurate predictions at each iterative step.

Innovation

The innovation of the BLF system lies in its integration of Bayesian updating and natural language processing for binary forecasting. Specifically, the core of the BLF system is its dynamically updated Bayesian linguistic belief state, which combines numerical probability estimates with natural-language evidence summaries, enabling more accurate predictions at each iterative step. Compared to traditional methods, the BLF system excels in handling complex forecasting problems.

Methodology

The technical principles of the BLF system include three key components:


  • οΏ½οΏ½ Bayesian linguistic belief state: for dynamic belief updates, combining numerical probability estimates with natural-language evidence summaries.

  • οΏ½οΏ½ Hierarchical multi-trial aggregation: using logit-space shrinkage with data-dependent priors to combine independent trial results.

  • οΏ½οΏ½ Hierarchical calibration: using Platt scaling with hierarchical priors to avoid over-shrinking extreme predictions.

Experiments

On 400 backtesting questions from the ForecastBench benchmark, the BLF system outperforms all top public methods, including Cassi, GPT-5, Grok 4.20, and Foresight-32B. Specifically, BLF achieves a difficulty-adjusted Brier Index (ABI) of 94.8 on market questions, compared to 91.4 for Foresight-32B. This result demonstrates a significant improvement in prediction accuracy.

Results

On 400 backtesting questions from the ForecastBench benchmark, the BLF system outperforms all top public methods, including Cassi, GPT-5, Grok 4.20, and Foresight-32B. Specifically, BLF achieves a difficulty-adjusted Brier Index (ABI) of 94.8 on market questions, compared to 91.4 for Foresight-32B. This result demonstrates a significant improvement in prediction accuracy.

Applications

The broad application potential of the BLF system is evident in its ability to handle various types of forecasting problems, including market and dataset predictions. This success not only brings new technical breakthroughs to the forecasting field but also points the way for future research.

Limitations & Outlook

The BLF system has limitations, such as potential poor performance in completely unknown events and high computational costs. Future research will focus on optimizing computational efficiency and testing in more real-world application scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen trying to predict which dish your family will like. Traditional methods might involve looking at past recipes and family feedback, then making a rough guess. But the BLF system is like a smart assistant that not only references past recipes but also adjusts your menu based on real-time feedback and changes in your family's taste preferences. This assistant updates its predictions every time you cook, ensuring that the dishes you make always match your family's taste. In this way, the BLF system can make more accurate predictions in dynamically changing environments.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a prediction game where you have to guess tomorrow's weather. Traditional methods might involve looking at past weather records and making a rough guess. But the BLF system is like a super-smart weather forecaster that not only references past weather records but also adjusts its predictions based on the latest weather data and trends. This system updates its judgment every time it makes a prediction, ensuring you get the most accurate weather forecast. Isn't that cool?

Glossary

Bayesian Updating

A statistical method used to update probability estimates based on new data.

Used in the BLF system for dynamically updating prediction beliefs.

Language Model

A model used for generating and understanding natural language, typically based on deep learning.

Used in the BLF system to generate natural-language evidence summaries.

Belief State

A structured data representation of current prediction beliefs, including probability estimates and evidence summaries.

Used in the BLF system for dynamically updating predictions.

Hierarchical Calibration

A calibration method that avoids over-shrinking extreme predictions through hierarchical priors.

Used in the BLF system to improve prediction accuracy.

Platt Scaling

A technique used to convert raw prediction probabilities into calibrated probabilities.

Used in the BLF system for hierarchical calibration.

Logit-space Shrinkage

A method for achieving more stable predictions by adjusting the logit values of prediction results.

Used in the BLF system for multi-trial aggregation.

Brier Index

A metric for evaluating prediction accuracy, where lower values indicate more accurate predictions.

Used in the BLF system's experiments to evaluate performance.

ForecastBench

A benchmark for evaluating forecasting system performance, containing various types of questions.

Used in the BLF system's experiments for testing and comparison.

Ablation Study

A method for evaluating the impact of removing certain parts of a system on overall performance.

Used in the BLF system's experiments to validate the importance of each component.

Zero-shot Forecasting

A method for making predictions without prior training data.

Used in the BLF system's experiments for baseline comparison.

Open Questions Unanswered questions from this research

  • 1 How can the BLF system improve prediction accuracy for completely unknown events? Current methods rely on historical data and language model knowledge, which may perform poorly in completely unknown scenarios.
  • 2 How can the computational efficiency of the BLF system be optimized? Due to the system's complexity, computational costs may be high, especially during multi-trial aggregation.
  • 3 Hierarchical shrinkage may lead to performance degradation on certain specific datasets. How can this issue be resolved without affecting performance on other datasets?
  • 4 How does the BLF system perform in handling multi-class forecasting problems? Current research focuses mainly on binary forecasting, and future exploration of multi-class forecasting is needed.
  • 5 How can the broad applicability of the BLF system be verified in more real-world application scenarios? Current experiments focus mainly on specific benchmarks, and future testing in more real-world applications is needed.

Applications

Immediate Applications

Financial Market Forecasting

The BLF system can be used to predict stock market trends, helping investors make more informed decisions.

Public Health Alerts

By predicting the spread of epidemics, the BLF system can provide early warnings for public health agencies.

Geopolitical Analysis

The BLF system can be used to predict changes in international relations, helping governments formulate more effective foreign policies.

Long-term Vision

Intelligent Decision Systems

The BLF system can become a core component of future intelligent decision systems, helping various industries achieve automated decision-making.

Fully Automated Forecasting Platform

By integrating more data sources and forecasting models, the BLF system can evolve into a fully automated forecasting platform, widely applicable across various fields.

Abstract

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.

cs.AI

References (20)

Outcome-based Reinforcement Learning to Predict the Future

Benjamin Turtel, Danny Franklin, Kris Skotheim et al.

2025 7 citations ⭐ Influential View Analysis β†’

Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach

B. Efron, C. Morris

1973 1054 citations

AutoHarness: improving LLM agents by automatically synthesizing a code harness

Xinghua Lou, Miguel L'azaro-Gredilla, A. Dedieu et al.

2026 5 citations View Analysis β†’

Reasoning and Tools for Human-Level Forecasting

Elvis Hsieh, Preston Fu, Jonathan Chen

2024 7 citations View Analysis β†’

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

Pu-Jen Cheng, Junchen Liu, Yunshen Long

2026 1 citations View Analysis β†’

Wisdom of the silicon crowd: LLM ensemble prediction capabilities rival human crowd accuracy

P. Schoenegger, Indre Tuminauskaite, P. S. Park et al.

2024 66 citations View Analysis β†’

Can Language Models Use Forecasting Strategies?

Sarah Pratt, S. Blumberg, Pietro K. Carolino et al.

2024 12 citations View Analysis β†’

OpenEP: Open-Ended Future Event Prediction

Yong Guan, Hao Peng, Xiaozhi Wang et al.

2024 12 citations View Analysis β†’

Judgmental forecasting: A review of progress over the last 25 years

Michael Lawrence, P. Goodwin, M. O'Connor et al.

2006 529 citations

TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

Md. Atik Ahamed, Mihir Parmar, Palash Goyal et al.

2026 1 citations View Analysis β†’

Proper Scoring Rules for Estimation and Forecast Evaluation

Kartik G. Waghmare, J. Ziegel

2025 19 citations View Analysis β†’

Pitfalls in Evaluating Language Model Forecasters

Daniel Paleka, Shashwat Goel, Jonas Geiping et al.

2025 11 citations View Analysis β†’

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Ezra Karger, Houtan Bastani, Chen Yueh-Han et al.

2024 50 citations View Analysis β†’

Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff

Zehan Li, YuXuan Wang, Ali El Lahib et al.

2026 2 citations View Analysis β†’

Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak, Shashwat Goel, Ameya Prabhu et al.

2025 4 citations View Analysis β†’

Time-R1: Towards Comprehensive Temporal Reasoning in LLMs

Zijia Liu, Peixuan Han, Haofei Yu et al.

2025 17 citations View Analysis β†’

Forecasting Future World Events with Neural Networks

Andy Zou, Tristan Xiao, Ryan Jia et al.

2022 45 citations View Analysis β†’

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Yogesh Agrawal, Aniruddha Dutta, Mahadi Hasan et al.

2026 1 citations View Analysis β†’

Superforecasting: The Art and Science of Prediction

P. Tetlock, Dan Gardner

2015 657 citations

Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets

Jade Zhang, Gardenia Liu, Oliver Johansson et al.

2026 1 citations View Analysis β†’