FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

TL;DR

FinHarness reduces ASR to 15% on FinVault with 4.7× fewer advanced judge calls via inline lifecycle safety harness

cs.CL 🔴 Advanced 2026-05-27 147 views
Haoxuan Jia Yang Liu Bin Chong Yingguang Yang Yancheng Chen Jiayu Liang Qian Li Hanning Lu Kefu Xu Hao Zheng Chongyang Zhang Hao Peng Philip S. Yu
financial safety large language models agent security tool call monitoring risk assessment

Key Findings

Methodology

FinHarness proposes an inline lifecycle safety harness framework tailored for financial LLM agents, comprising three core components: a Query Monitor that fuses single-turn intent signals with cross-turn drift to produce a session-level risk cumulant; a Tool Monitor that evaluates each prospective tool call based on permission tiers, parameter anomalies, business facts, and tool sequence priors; and a Cascade module that adaptively routes verification calls between a lightweight judge (gpt-4o-mini) and an advanced-tier judge (gpt-4o) using a sliding risk window. Triggered risk factors are dynamically injected as ex-ante evidence into the agent's input, enabling autonomous refusal, replanning, or approval decisions, thus closing the loop between safety signals and agent policy.

Key Results

  • FinHarness achieves a substantial reduction in Attack Success Rate (ASR) from 38.3% to 15.0% on the FinVault benchmark while maintaining a benign approval rate decrease from 41.1% to 39.3%, demonstrating a strong safety-utility trade-off.
  • Compared to a baseline always invoking the advanced judge, FinHarness reduces advanced judge calls by a factor of 4.7 through intelligent risk-window-based routing, significantly lowering computational overhead.
  • Ablation studies reveal that the Query Monitor provides zero-cost early intent risk warnings, the Cascade module efficiently distributes judge calls, and the evidence injection mechanism increases agent self-rejection by 15.7 percentage points and active interception by 6.7 points, enhancing multi-step attack trajectory containment.

Significance

This work addresses the critical challenge of securing financial LLM agents executing irreversible multi-step workflows, overcoming limitations of traditional boundary filters and post-hoc auditing that either miss mid-trajectory risks or intervene too late. By embedding safety inline within the agent's execution lifecycle and enabling real-time risk assessment and feedback, FinHarness significantly improves the trustworthiness and operational safety of automated financial processes. This advancement provides a practical and scalable solution for deploying secure LLM agents in finance, with potential to influence broader AI safety paradigms in high-stakes domains.

Technical Contribution

FinHarness's technical novelty lies in its end-to-end inline architecture that integrates deterministic compliance priors over query and tool signals into a unified session-level risk cumulant. The sliding risk window and adaptive judge routing balance accuracy and computational cost, ensuring bounded per-step overhead. The dynamic injection of fired risk factors as structured evidence into the agent input enables a novel closed-loop interaction between safety monitoring and agent policy, empowering autonomous risk-aware decision-making. Additionally, the selective episodic memory mechanism recalls semantically and contextually relevant history steps to support informed judge evaluations, enhancing detection of complex multi-step attacks.

Novelty

FinHarness is the first framework to embed a comprehensive safety harness inline within the financial LLM agent's execution lifecycle, enabling real-time multi-step risk monitoring and intervention. Unlike prior approaches relying solely on boundary filtering or post-execution auditing, it fuses multi-turn intent drift and tool call risk into a session-level cumulant and dynamically routes judge invocations, achieving efficient and effective risk containment without collapsing legitimate approvals.

Limitations

  • FinHarness depends on a fixed set of rule heads and static parameters, limiting its adaptability to evolving attacker strategies and novel attack vectors.
  • The evaluation is constrained to single-run experiments on specific benchmarks, lacking robustness analysis across different model versions, prompt variations, and tool simulators, which may affect generalizability.
  • Certain single-step syntactic attacks perform better against lightweight judges, where FinHarness shows regression, indicating a need for integrating fast-reject paths to enhance defense coverage.

Future Work

Future directions include developing adaptive rule learning and dynamic risk assessment methods to counter emerging attacks, extending robustness evaluations across diverse models and environments, integrating fast rejection mechanisms to improve response latency, and generalizing the inline safety harness concept to other high-risk domains beyond finance.

AI Executive Summary

Large language models (LLMs) have become integral to automating complex financial workflows, but their deployment as autonomous agents introduces significant security challenges. Financial LLM agents must navigate multi-step business processes involving irreversible actions such as fund transfers and contract signings, making them vulnerable to prompt-injection attacks that can cause unauthorized operations. Traditional security mechanisms—boundary filters that operate only at conversation boundaries and post-hoc LLM judges auditing after task termination—are insufficient. Boundary filters lack visibility into mid-trajectory tool calls, allowing fragmented or obfuscated attacks to slip through, while post-hoc auditing intervenes too late and incurs computational costs that scale linearly with the length of interaction traces.

To address these challenges, FinHarness introduces an inline lifecycle safety harness that wraps the financial LLM agent end-to-end, embedding safety monitoring and intervention directly within the agent's execution loop. The framework comprises three main components: a Query Monitor that computes a session-level risk cumulant by fusing single-turn intent signals with cross-turn drift patterns; a Tool Monitor that evaluates each proposed tool call against deterministic compliance priors including permission tiers, parameter anomalies, business facts, and tool sequence anomalies; and a Cascade module that adaptively routes verification calls between a lightweight judge (gpt-4o-mini) and an advanced-tier judge (gpt-4o) based on a sliding risk window. Crucially, triggered risk factors are re-injected as structured ex-ante evidence into the agent's input, enabling the agent to autonomously refuse, re-plan, or approve actions, thus closing the feedback loop between safety signals and agent policy.

This inline approach offers several technical advantages. By maintaining a session-level risk cumulant with gravity decay, the system preserves awareness of structural risk signals across turns, preventing attackers from evading detection through intermittent benign behavior. The Cascade module's selective episodic memory recalls semantically and contextually relevant prior steps to inform judge decisions, enhancing detection of complex multi-step attacks. The dynamic injection of risk evidence empowers the agent with self-regulation capabilities, reducing reliance on external intervention.

Empirical evaluation on the FinVault benchmark, encompassing 214 cases across 31 financial scenarios and multiple attack families, demonstrates that FinHarness reduces the Attack Success Rate (ASR) from 38.3% to 15.0% while maintaining a benign approval rate close to baseline levels (41.1% to 39.3%). Compared to always invoking the advanced judge, FinHarness achieves a 4.7× reduction in advanced judge calls, significantly lowering computational overhead. Ablation studies confirm the importance of each component, with the Query Monitor providing zero-cost early warnings and the Cascade module effectively balancing safety and efficiency. Notably, external guardrail baselines achieve lower ASR but at the cost of drastically reduced benign approvals, highlighting FinHarness's superior safety-utility trade-off.

FinHarness represents a significant advancement in securing financial LLM agents by embedding safety inline within their execution lifecycle, enabling real-time multi-step risk assessment and autonomous agent self-regulation. This work not only enhances the trustworthiness of automated financial workflows but also sets a precedent for inline safety architectures in other high-stakes domains. Future work will focus on adaptive defenses, robustness across diverse environments, and extending the framework beyond finance to broader AI agent safety challenges.

Deep Analysis

Background

The rapid evolution of large language models (LLMs) has transformed the landscape of intelligent agents, enabling automated execution of complex workflows across various domains. In finance, LLM agents have been deployed to autonomously handle multi-step business processes such as loan approvals, fund transfers, and contract management. These agents interleave natural language reasoning with external tool invocations, as exemplified by frameworks like ReAct and Toolformer, which synergize reasoning and acting capabilities. However, the financial domain presents unique challenges due to the irreversible nature of many state-changing operations and stringent compliance requirements.


Existing safety mechanisms primarily rely on two paradigms: boundary filters that perform coarse allow/deny decisions at conversation boundaries, and post-hoc LLM judges that audit completed interaction traces. Boundary filters are lightweight and stateless but lack visibility into intermediate tool calls, making them vulnerable to sophisticated prompt-injection attacks that fragment malicious payloads across turns or embed them within retrieved documents. Post-hoc judges offer higher accuracy but intervene too late, often after irreversible actions have been executed, and their computational cost scales linearly with the length of the interaction trace, limiting scalability.


Recent benchmarks such as AgentDojo and Agent-SafetyBench have highlighted these vulnerabilities in tool-using agents, while domain-specific evaluations like FinVault have underscored the criticality of securing financial workflows. Runtime guardrail systems like GuardAgent and InferAct attempt to mitigate risks by integrating prompt-injection defenses, agent alignment, and code-level risk scanners, but often lack lifecycle integration or impose high computational overhead. Thus, there remains a pressing need for an inline, lifecycle-integrated safety mechanism that can monitor and intervene in real-time during agent execution, balancing security, efficiency, and business continuity.

Core Problem

The core problem addressed by FinHarness is the real-time detection and prevention of unauthorized, irreversible operations executed by financial LLM agents during multi-step workflows. Specifically, the challenge lies in blocking adversarial trajectories that exploit prompt-injection vulnerabilities to trigger illicit tool calls, such as unauthorized fund transfers or KYC data leaks, without impeding legitimate multi-turn business processes.


Traditional boundary filters fail to monitor the execution loop where tool calls occur, leaving a blind spot that attackers can exploit by fragmenting malicious instructions or hiding them within retrieved content. Post-execution auditing by LLM judges, while accurate, is inherently reactive and computationally expensive, often arriving too late to prevent damage. Moreover, the linear growth in audit complexity with trace length makes it impractical for long interactions.


Therefore, the problem requires designing a safety harness that operates inline within the agent's execution lifecycle, observing every intermediate state and tool invocation, dynamically assessing risk, and providing timely feedback to the agent. This harness must achieve high recall in blocking attacks, preserve benign approval rates to maintain business utility, and maintain bounded per-step computational cost independent of trajectory length. Balancing these competing objectives in a high-stakes financial environment is a significant technical bottleneck.

Innovation

FinHarness introduces several core innovations to address the outlined challenges:


1. Inline Lifecycle Architecture: Unlike prior approaches that treat safety as an external or post-hoc process, FinHarness embeds safety monitoring and intervention directly within the agent's execution loop. This inline positioning enables real-time observation and control over every tool call, closing the visibility gap present in boundary filters.


2. Multi-Dimensional Risk Cumulant: The framework fuses deterministic compliance priors over user queries and tool signals into a unified session-level risk cumulant. This cumulant integrates single-turn intent indicators (e.g., verb tiers, amount thresholds, coercion lexicons) with cross-turn drift signals (e.g., false references, debug modes, approval code anomalies), capturing both immediate and evolving risk patterns.


3. Adaptive Judge Routing via Sliding Risk Window: FinHarness employs a sliding window over recent risk scores to decide whether to route verification to a lightweight judge (gpt-4o-mini) or an advanced-tier judge (gpt-4o). This adaptive routing balances the trade-off between computational cost and verification accuracy, ensuring efficient resource utilization.


4. Dynamic Evidence Injection: Triggered risk factors are re-injected as structured ex-ante evidence into the agent's input across multiple contextual zones (user conversation, recalled history, current step). This mechanism empowers the agent to autonomously refuse, re-plan, or escalate actions, fostering a closed-loop interaction between safety signals and agent policy.


5. Selective Episodic Memory: To support informed judge decisions, the system recalls at most two prior steps selected based on semantic similarity, entity overlap, and information provider indicators, ensuring bounded memory usage while capturing relevant context.


Collectively, these innovations enable FinHarness to achieve high safety recall, maintain benign approval rates, and control computational overhead in complex financial workflows.

Methodology

FinHarness's methodology comprises the following key components and processes:


  • �� Query Monitor:
  • Input: User turn text at step k.
  • Process: Compute single-turn intent risk scores (Q1-Q5) based on verb tier, amount magnitude, risk-product flags, coercion/urgency lexicons, and injection lexicons.
  • Compute cross-turn drift scores (D1-D5) detecting false references, pseudo test modes, phantom approval codes, verb-tier jumps, and closing imperatives.
  • Fuse these into a session-level risk cumulant C_query with gravity decay to preserve structural drift signals.
  • Output: Advisory labels (UNSAFE, UNCERTAIN, SAFE, or dead zone) injected as evidence into agent input.

  • �� Tool Monitor:
  • Input: Proposed tool call at step t.
  • Process: Evaluate risk heads H1-H5 including permission tier prior, dangerous parameter matches, argument anomalies, business fact discrepancies, and tool sequence anomalies.
  • Fuse tool risk c_tool with latest query risk C_query to obtain step risk s_t.
  • Output: Step-level risk score and fired rule heads.

  • �� Cascade Module:
  • Input: Step risk scores over a sliding window of length W.
  • Process: Sum risk scores; if above threshold θ, route verification to advanced judge (gpt-4o), else to lightweight judge (gpt-4o-mini).
  • Recall up to two prior steps based on semantic similarity, entity overlap, and info-provider indicators for judge context.
  • Output: Judge verdict (UNSAFE, UNCERTAIN, SAFE) and fired evidence.

  • �� Dynamic Evidence Injection:
  • Fired rule heads from Query Monitor, Tool Monitor, and Cascade are dynamically injected into agent input at multiple contextual zones.
  • Enables agent to autonomously decide to refuse, re-plan, escalate, or approve.

  • �� Design Guarantees:
  • Bounded recalled history ensures per-step cost is independent of trajectory length.
  • Non-decaying structural memory prevents attackers from evading detection by intermittent benign behavior.

This architecture ensures real-time, efficient, and adaptive safety monitoring tightly coupled with agent decision-making.

Experiments

Experiments were conducted on the FinVault benchmark, which includes 214 cases (107 benign and 107 attack) spanning 31 financial scenarios such as personal credit, home mortgage, supply-chain finance, insurance, and securities. The attacks cover four broad families: parameter injection, pledge/margin manipulation, document fabrication, and privacy subject shifting. Additional evaluations were performed on an 856-trace all-attack synthesis set and the financial subset of Agent-SafetyBench (ASB) for out-of-distribution testing.


The agent backbone was gpt-4o-mini with native compliance prompts and tool annotations. Multiple baselines were compared, including no harness (B0), prompt-hardening (B1), per-step cheap and advanced judges with varying history lengths (B2-B4), off-the-shelf LlamaFirewall mappings (B5), and adapted GuardAgent and InferAct references (B6/B7).


Metrics included Attack Success Rate (ASR), benign approval rate, and judge call counts. Decoding was greedy with temperature zero. Ablations analyzed the contributions of Query Monitor, Cascade routing, and evidence injection. Statistical significance was assessed via McNemar tests.

Results

FinHarness's main configuration reduced ASR from 38.3% to 15.0% on FinVault while maintaining benign approval at 39.3%, closely matching advanced judge baselines. Compared to always invoking the advanced judge, it reduced advanced judge calls by 4.7×, significantly lowering computational cost. The Query Monitor provided zero-cost early warnings, firing on 53 of 107 attacks with 94.6% precision. The Cascade module routed 73% of hard-stops to the lightweight judge, reserving the advanced judge for residual cases. Adapted external guardrails achieved lower ASR (2.8% and 0.9%) but at the cost of collapsing benign approvals to 8.4%, demonstrating FinHarness's superior safety-utility balance. Per-attack-type analysis showed largest gains on emotional manipulation (-43.8 pp) and document fabrication (-25.0 pp), with some regression on direct injection attacks.

Applications

FinHarness is directly applicable to financial automation scenarios requiring secure multi-step workflows, including personal loan processing, supply-chain financing, insurance claims, and securities trading. Its inline monitoring and adaptive judge routing enable real-time risk assessment with low latency, suitable for deployment in banking institutions and fintech platforms. By empowering agents with autonomous risk-aware decision-making, it reduces reliance on human intervention, enhancing operational efficiency and customer experience. The framework's modular design allows extension to other high-risk domains where LLM agents interact with irreversible tools, such as healthcare and legal services.

Limitations & Outlook

FinHarness relies on a fixed set of deterministic rule heads and static parameters, limiting adaptability to evolving attacker tactics and novel threats. The evaluation is limited to single-run experiments on specific benchmarks, lacking robustness assessments across diverse model versions, prompt templates, and tool simulators, which may affect generalizability. Additionally, certain single-step syntactic attacks perform better against lightweight judges, where FinHarness shows slight regression, indicating the need for integrating fast-reject mechanisms to complement the current architecture. These limitations highlight areas for future enhancement in dynamic learning and broader validation.

Abstract

Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination -- too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end-to-end with three components: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per-step risk and adaptively routes verification between a lightweight and an advanced-tier LLM judge. Fired risk factors are re-injected into the agent input as ex-ante evidence, enabling the agent to refuse, re-plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval ($41.1\% \to 39.3\%$), and uses $4.7\times$ fewer advanced-judge calls than an always-advanced ablation.

cs.CL

References (10)

LlamaFirewall: An open source guardrail system for building secure AI agents

Sa-hana Chennabasappa, Cyrus Nikolaidis, D. Song et al.

2025 73 citations ⭐ Influential View Analysis →

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li et al.

2024 74 citations ⭐ Influential

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen et al.

2026 1 citations ⭐ Influential View Analysis →

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines, Gary Lopez, M. Hall et al.

2024 176 citations View Analysis →

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì et al.

2023 3969 citations View Analysis →

Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra et al.

2023 1289 citations View Analysis →

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu et al.

2024 179 citations View Analysis →

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al.

2026 2 citations View Analysis →

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

Hengyu An, Jinghuai Zhang, Tianyu Du et al.

2025 31 citations View Analysis →

StruQ: Defending Against Prompt Injection with Structured Queries

Sizhe Chen, Julien Piet, Chawin Sitawarin et al.

2024 271 citations View Analysis →