The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

TL;DR

Proposed a Markovian framework for auditing agentic AI reliability and oversight cost, improving state-action blind mass by 12.53%.

cs.AI 🔴 Advanced 2026-03-26 47 views
Biplab Pal Santanu Bhattacharya
Markov framework agentic AI reliability oversight cost enterprise workflow

Key Findings

Methodology

The paper introduces a measure-theoretic Markov framework to evaluate the reliability and oversight cost of agentic AI before deployment. Core quantities include state blind-spot mass B_n(τ), state-action blind mass B^SA_{π,n}(τ), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. The framework is instantiated on the BPI 2019 purchase-to-pay log to validate its effectiveness.

Key Results

  • Result 1: On the BPI 2019 log, the state space expanded from 42 to 668, and state-action blind mass increased from 0.0165 at τ=50 to 0.1253 at τ=1000, demonstrating the framework's effectiveness in handling large workflows.
  • Result 2: On the held-out dataset, the realized autonomous step accuracy tracked the maximum policy π-hat(a|s) within 3.4 percentage points on average, validating the framework's predictive capability.
  • Result 3: The risk-weighted state-action blind mass was 0.0202 at τ=200 and 0.0505 at τ=1000, showing the impact of risk factors on autonomous decision-making.

Significance

This research provides a systematic approach to evaluating the reliability and oversight cost of agentic AI before deployment, addressing the challenges of reliability and economics in large-scale enterprise workflows. By instantiating and validating the framework, it demonstrates direct applicability in enterprise procurement workflows, offering theoretical support for optimizing engineering processes. The study is significant not only in academia but also provides practical tools for the industry to enhance AI deployment efficiency and economic benefits.

Technical Contribution

The technical contribution lies in proposing a novel Markov framework that evaluates the reliability and oversight cost of agentic AI before deployment. Compared to existing methods, this framework offers new theoretical guarantees and engineering possibilities, especially in handling large-scale enterprise workflows. By introducing state blind-spot mass and state-action blind mass, the framework identifies under-supported areas in workflows, optimizing the reliability of autonomous decision-making.

Novelty

This study is the first to propose a Markov framework for evaluating the reliability and oversight cost of agentic AI before deployment. Compared to existing enterprise workflow evaluation methods, this framework not only considers state support but also introduces state-action blind mass and risk-weighting mechanisms, providing a more comprehensive autonomy assessment.

Limitations

  • Limitation 1: The BPI log is observational rather than agent-generated, limiting the framework's ability to directly evaluate counterfactual effects of arbitrary actions.
  • Limitation 2: The state representation uses a first-order Markov approximation, which may not capture more complex state dependencies.
  • Limitation 3: The risk proxy's weight needs to be reproducible, but may require adjustment in different application scenarios.

Future Work

Future research could extend the framework to support more complex state representations and counterfactual evaluations. Additionally, exploring its applicability in different domains and application scenarios could verify its generality and robustness. Further studies could also optimize the risk-weighting mechanism to improve the accuracy and economics of autonomous decision-making.

AI Executive Summary

In modern enterprises, deploying agentic artificial intelligence (AI) faces dual challenges of reliability and oversight cost. Traditional workflows are typically engineered to behave near-deterministically through approval rules, validation checks, and exception-handling logic. However, when systems driven by large language models (LLMs) or agentic policies are introduced, execution is no longer described by one-step plausibility alone but by a trajectory distribution over a constrained process.

This paper proposes a measure-theoretic Markov framework to evaluate the reliability and oversight cost of agentic AI before deployment. The framework's core quantities include state blind-spot mass B_n(τ), state-action blind mass B^SA_{π,n}(τ), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. The framework is instantiated on the BPI 2019 purchase-to-pay log to validate its effectiveness.

In experiments, researchers expanded the workflow's state space from 42 to 668, and state-action blind mass increased from 0.0165 at τ=50 to 0.1253 at τ=1000. This indicates that while the workflow appears well-supported at the state level, substantial blind mass remains over next-step decisions. By incorporating case context, economic magnitude, and actor class, the framework can more accurately assess the reliability of autonomous decision-making.

The framework is significant not only in academia but also provides practical tools for the industry to enhance AI deployment efficiency and economic benefits. By instantiating and validating the framework, it demonstrates direct applicability in enterprise procurement workflows, offering theoretical support for optimizing engineering processes.

However, the framework also has limitations. The BPI log is observational rather than agent-generated, limiting the framework's ability to directly evaluate counterfactual effects of arbitrary actions. Additionally, the state representation uses a first-order Markov approximation, which may not capture more complex state dependencies. Future research could extend the framework to support more complex state representations and counterfactual evaluations.

Deep Analysis

Background

With the rapid development of artificial intelligence technology, the application of agentic AI in enterprises is becoming increasingly widespread. However, enterprises face dual challenges of reliability and oversight cost when deploying agentic AI. Traditional enterprise workflows are typically engineered to behave near-deterministically through approval rules, validation checks, and exception-handling logic. However, when systems driven by large language models (LLMs) or agentic policies are introduced, execution is no longer described by one-step plausibility alone but by a trajectory distribution over a constrained process. In recent years, many studies have focused on improving the autonomy and reliability of agentic AI, but how to evaluate its reliability and oversight cost before deployment in large-scale enterprise workflows remains an unsolved problem.

Core Problem

The core problem in deploying agentic AI is how to ensure reliability while reducing oversight cost. Traditional workflows are typically engineered to behave near-deterministically through approval rules, validation checks, and exception-handling logic. However, when systems driven by large language models (LLMs) or agentic policies are introduced, execution is no longer described by one-step plausibility alone but by a trajectory distribution over a constrained process. This mismatch is no longer a hypothetical issue but a real challenge. Enterprises need a systematic approach to evaluate the reliability and oversight cost of agentic AI to ensure its feasibility and economic viability in large-scale workflows.

Innovation

The core innovation of this paper lies in proposing a measure-theoretic Markov framework to evaluate the reliability and oversight cost of agentic AI before deployment. • The framework introduces state blind-spot mass B_n(τ) and state-action blind mass B^SA_{π,n}(τ) to identify under-supported areas in workflows. • By introducing an entropy-based human-in-the-loop escalation gate, the framework can more accurately assess the reliability of autonomous decision-making. • The framework also provides an expected oversight-cost identity over the workflow visitation measure, combining reliability and economics.

Methodology

This paper proposes a measure-theoretic Markov framework to evaluate the reliability and oversight cost of agentic AI before deployment. • State blind-spot mass B_n(τ): Measures the proportion of deployment mass in low-support states. • State-action blind mass B^SA_{π,n}(τ): Measures the support for choosing the next action in agentic systems. • Entropy-based human-in-the-loop escalation gate: Introduces Shannon entropy and reproducible risk weighting to establish an escalation rule for human intervention. • Expected oversight-cost identity over the workflow visitation measure: Couples reliability and economics at the level of workflow visitation, providing a new evaluation method.

Experiments

The experimental design uses the BPI 2019 purchase-to-pay log, which contains 251,734 cases and 1,595,923 events, involving 42 distinct workflow actions. Researchers split the log into an 80/20 chronological training and held-out dataset. By simulating an agent on the held-out dataset, they validated the framework's predictive capability. The experiments compared state blind-spot mass and state-action blind mass under different state representations and analyzed the impact of risk-weighting on autonomous decision-making.

Results

The experimental results show that the state space expanded from 42 to 668, and state-action blind mass increased from 0.0165 at τ=50 to 0.1253 at τ=1000. This indicates that while the workflow appears well-supported at the state level, substantial blind mass remains over next-step decisions. By incorporating case context, economic magnitude, and actor class, the framework can more accurately assess the reliability of autonomous decision-making. The risk-weighted state-action blind mass was 0.0202 at τ=200 and 0.0505 at τ=1000, showing the impact of risk factors on autonomous decision-making.

Applications

The framework can be directly applied to enterprise procurement workflows to evaluate the reliability and oversight cost of agentic AI. By identifying under-supported areas in workflows, enterprises can optimize the reliability of autonomous decision-making, thereby improving AI deployment efficiency and economic benefits. Additionally, the framework can be applied to other engineering processes as long as operational event logs are available.

Limitations & Outlook

The framework's limitations include: The BPI log is observational rather than agent-generated, limiting the framework's ability to directly evaluate counterfactual effects of arbitrary actions. Additionally, the state representation uses a first-order Markov approximation, which may not capture more complex state dependencies. The risk proxy's weight needs to be reproducible, but may require adjustment in different application scenarios. Future research could extend the framework to support more complex state representations and counterfactual evaluations.

Plain Language Accessible to non-experts

Imagine you work at a large company responsible for overseeing the procurement process. You need to ensure that every purchase order goes through the correct approval process and can be handled promptly when issues arise. However, as the company grows, manually handling these processes becomes increasingly difficult. So, you decide to introduce an intelligent system to help automate these processes.

This intelligent system is like a smart assistant that can predict the next action for each order based on historical data. It decides whether human intervention is needed based on different order types, amounts, and responsible parties. The core of this system is its ability to identify orders that lack sufficient support in historical data, prompting human assistance in these cases.

In this way, you can not only improve work efficiency but also reduce the risk of errors. This system is like your capable assistant, helping you navigate the busy work environment with ease. Even in the most complex situations, it can help you make informed decisions, ensuring that every order is handled properly.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex strategy game. The game has many missions, each with different steps, and you need to make the right decisions at each step. You have to manage resources, complete missions, and make sure you don't mess up.

Now, imagine you have a super smart assistant that can help you predict the next step for each mission. This assistant is like a game guide, telling you what to do next based on your previous game records. If some missions are too complex, it will remind you to handle them yourself.

The cool thing about this assistant is that it can identify which missions don't have enough support in the historical records, prompting your help in these cases. This way, you can make smarter decisions in the game, ensuring that every mission is completed smoothly.

So, this smart assistant is like your secret weapon, helping you win the game!

Glossary

Markov Framework

A mathematical model used to describe the transition process between different states in a system.

Used to evaluate the reliability and oversight cost of agentic AI.

State Blind-Spot Mass

Measures the proportion of deployment mass in low-support states.

Used to identify under-supported areas in workflows.

State-Action Blind Mass

Measures the support for choosing the next action in agentic systems.

Used to optimize the reliability of autonomous decision-making.

Entropy

A metric for measuring information uncertainty.

Used to assess the reliability of autonomous decision-making.

Human-in-the-Loop Escalation Gate

Establishes an escalation rule for human intervention by introducing entropy and risk weighting.

Used to optimize the reliability of autonomous decision-making.

Oversight Cost

The expected oversight cost over the workflow visitation measure.

Used to evaluate the economic viability of agentic AI.

BPI 2019 Log

A dataset containing purchase-to-pay process event logs.

Used to validate the effectiveness of the Markov framework.

Agentic AI

An artificial intelligence system capable of making autonomous decisions.

Applied in enterprise workflows.

Large Language Model

A natural language processing model trained on large amounts of text data.

Used for decision support in agentic AI.

Support Deficiency

Lack of sufficient support samples in historical data.

Used to identify situations requiring human intervention.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How can the framework's generality and robustness be verified in different domains and application scenarios? Current research focuses mainly on procurement workflows, and applicability in other fields remains unverified.
  • 2 Open Question 2: How can the risk-weighting mechanism be optimized to improve the accuracy and economics of autonomous decision-making? The existing risk-weighting mechanism may need adjustment based on different application scenarios.
  • 3 Open Question 3: How can state representation be extended to support more complex state dependencies without increasing computational complexity? The current state representation uses a first-order Markov approximation, which may not capture more complex state dependencies.
  • 4 Open Question 4: How can counterfactual effects of arbitrary actions be directly evaluated in observational logs? The existing experimental design cannot directly evaluate counterfactual effects.
  • 5 Open Question 5: How can oversight costs be reduced without affecting the accuracy of autonomous decision-making? The existing framework may impact the accuracy of autonomous decision-making while reducing oversight costs.

Applications

Immediate Applications

Enterprise Procurement Process Optimization

By identifying under-supported areas in workflows, enterprises can optimize the reliability of autonomous decision-making, thereby improving AI deployment efficiency and economic benefits.

Engineering Process Optimization

The framework can be applied to other engineering processes as long as operational event logs are available. By evaluating the reliability and oversight cost of agentic AI, enterprises can optimize process efficiency.

Risk Management

By introducing a risk-weighting mechanism, enterprises can more accurately assess the reliability of autonomous decision-making, thereby reducing the risk of errors.

Long-term Vision

Cross-Domain Applications

Future research could explore the framework's applicability in different domains and application scenarios to verify its generality and robustness.

Intelligent Decision Support Systems

By extending state representation and optimizing the risk-weighting mechanism, future systems could develop more intelligent decision support systems to improve the accuracy and economics of autonomous decision-making.

Abstract

Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.

cs.AI

References (20)

Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra et al.

2023 1003 citations View Analysis →

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, J. Steinhardt et al.

2016 2952 citations View Analysis →

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios Nikolas Angelopoulos, Stephen Bates

2021 935 citations View Analysis →

A comprehensive survey on safe reinforcement learning

Javier García, F. Fernández

2015 1865 citations

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong et al.

2023 655 citations View Analysis →

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang et al.

2023 1242 citations View Analysis →

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 6542 citations View Analysis →

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

2023 1745 citations View Analysis →

Experimental evidence on the productivity effects of generative artificial intelligence

Shakked Noy, Whitney Zhang

2023 1129 citations

Constrained Policy Optimization

Joshua Achiam, David Held, Aviv Tamar et al.

2017 1690 citations View Analysis →

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi et al.

2024 424 citations View Analysis →

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Beck Labash et al.

2023 2862 citations View Analysis →

Air Canada’s chatbot illustrates persistent agency and responsibility gap problems for AI

Joshua L. M. Brand

2024 3 citations

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang et al.

2023 634 citations View Analysis →

Doubly Robust Policy Evaluation and Learning

Miroslav Dudík, J. Langford, Lihong Li

2011 762 citations View Analysis →

Generative AI at Work

Erik Brynjolfsson, Danielle Li, Lindsey Raymond

2023 922 citations View Analysis →

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì et al.

2023 3240 citations View Analysis →

Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality

Fabrizio Dell’Acqua, Edward McFowland, Ethan Mollick et al.

2023 554 citations

Constitutional

Direito Constitutional, Luiz Henrique, Diniz Araujo

2020 67 citations

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Nan Jiang, Lihong Li

2015 690 citations View Analysis →