RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

TL;DR

RubricsTree constructs a hierarchical Boolean rubric system guided by expert-curated clinical criteria, enabling scalable, expert-aligned evaluation with over 100 atomic metrics, surpassing industry baselines.

cs.CL 🔴 Advanced 2026-06-17 43 views

Weizhi Zhang Zechen Li Hamid Palangi Ben Graef A. Ali Heydari Simon A. Lee Salman Rahman Ray Luo Zeinab Esmaeilpour Erik Schenck Chloe Zhang Yamin Li Menglian Zhou Philip S. Yu Daniel McDuff Lindsey Sunden Mark Malhotra Shwetak Patel Ahmed A. Metwally

AI Reader Arxiv Page Download PDF

medical AI automatic evaluation hierarchical taxonomy expert alignment scalability

Key Findings

Methodology

RubricsTree employs a hierarchical knowledge graph guided by clinical experts, evolving from 4,000 real user queries to include over 100 verifiable Boolean metrics. Its core components include: • Building a layered taxonomy from macro capabilities to micro verification points, grounded in clinical literature; • Implementing a context-aware adaptive routing mechanism that uses semantic relevance scores (g(q,c,li)) to selectively activate relevant rubric subsets, reducing noise; • Applying an auto-weighting scheme that distributes weights uniformly from parent to child nodes, ensuring consistent aggregation of micro-verifications; • Conducting systematic meta-evaluation using ICC and Cohen’s κ to validate alignment with expert judgments and robustness under perturbations, ensuring scientific rigor.

Key Results

RubricsTree achieves an overall ICC of 0.876 and Cohen’s κ of 0.787 in expert alignment, significantly outperforming the baseline (ICC 0.291, κ 0.431). It demonstrates high agreement across four clinical scenarios, including health data, action plans, symptoms, and explanations, with κ scores exceeding 0.65 in all cases.
In robustness tests simulating degraded inputs—such as missing instructions, incomplete data, inappropriate prompts, or inaccurate signals—RubricsTree detects over 93% of corrupted responses, with average penalty (ΔMP) values remaining positive and high detection rates, ensuring safety and reliability.
When used as a structured instruction or reward signal for model fine-tuning, RubricsTree boosts the performance of models like Gemini, GPT, and Qwen by up to 66% on HealthBench, confirming its practical utility for continuous model optimization.

Significance

This work addresses the critical challenge of scalable, reliable, and expert-aligned evaluation of personal health AI systems. By integrating a hierarchical, verifiable rubric system with adaptive routing, it overcomes the limitations of static benchmarks and subjective auto-judges. The framework enables real-time, fine-grained assessment aligned with clinical standards, facilitating safer deployment and iterative improvement of health agents. Its ability to detect contextual degradation and improve model performance has profound implications for healthcare industry standards, regulatory approval, and global health equity, especially in resource-limited settings. Ultimately, RubricsTree paves the way for trustworthy, scalable, and continuously evolving AI-driven healthcare solutions.

Technical Contribution

The key technical innovations include: • Development of a hierarchical, expert-guided rubric taxonomy grounded in clinical literature, enabling precise, verifiable evaluation metrics; • Design of a semantic relevance-based adaptive routing mechanism that dynamically activates relevant rubric subsets, greatly reducing evaluation cost and noise; • Implementation of an automatic, top-down weight distribution scheme that ensures micro-verifications contribute proportionally to overall scores without manual tuning; • Introduction of a comprehensive meta-evaluation protocol that measures alignment with expert judgments, robustness to perturbations, and invariance across judge settings, establishing a rigorous validation framework. These contributions collectively advance the state-of-the-art in automated, expert-aligned evaluation for complex, open-ended AI tasks in healthcare.

Novelty

This work is the first to integrate a hierarchical, expert-verified Boolean rubric system with a dynamic, semantic relevance-based routing mechanism for large-scale health AI evaluation. Unlike prior static benchmarks or heuristic auto-judges, RubricsTree offers a structured, evolving, and clinically grounded evaluation framework that adapts to individual queries and ongoing knowledge updates. Its systematic meta-evaluation ensures high expert alignment and robustness, setting a new standard for trustworthy AI assessment in healthcare. This approach fundamentally shifts the paradigm from holistic, subjective scoring to transparent, atomic verification, enabling scalable, precise, and safe AI deployment.

Limitations

While highly effective, the system relies on continuous expert input for evolving the rubric taxonomy, which can be resource-intensive and may lag behind rapid advances in medical knowledge. Automating this process remains a challenge.
The framework primarily targets structured, verifiable clinical criteria; it may be less effective for highly subjective or emergent medical issues requiring nuanced judgment beyond binary verification.
In extremely complex or multi-modal scenarios, the current hierarchical Boolean approach might oversimplify certain assessments, necessitating integration with more advanced interpretability methods or probabilistic models.

Future Work

Future directions include automating knowledge graph updates via NLP and machine learning techniques, integrating multi-modal data (images, signals), and expanding the framework to handle more subjective or emergent medical judgments. Additionally, exploring real-time deployment in clinical workflows and regulatory validation will be crucial for industry adoption. Further research may also focus on combining this structured evaluation with explainability tools to enhance transparency and trustworthiness of AI health systems.

AI Executive Summary

The rapid proliferation of personal health data from wearable sensors and digital health records has catalyzed the development of intelligent personal health agents (PHAs). These AI-powered systems aim to provide real-time health insights, personalized recommendations, and multi-step reasoning, transforming healthcare from episodic treatment to continuous wellness management. However, evaluating the performance of such complex, open-ended systems remains a significant challenge.

Traditional evaluation methods, such as static multiple-choice benchmarks like MedQA and MedMCQA, are inadequate for assessing the nuanced, multi-turn responses required in real-world health scenarios. Expert annotation, while accurate, is prohibitively costly and unscalable. Conversely, existing auto-judging approaches suffer from inconsistency, subjectivity, and limited clinical alignment. This gap hampers the safe and effective deployment of personal health AI at scale.

Addressing this critical need, the authors introduce RubricsTree, a hierarchical, expert-guided evaluation framework that decomposes health responses into over 100 verifiable Boolean metrics. These metrics are organized into a layered taxonomy, from broad capabilities like medical skills and health memory down to atomic, clinically grounded verification points. The system employs a semantic relevance-based adaptive routing mechanism, which dynamically activates only the pertinent subset of rubrics for each user query, greatly improving efficiency and relevance.

A key innovation is the automatic, top-down weight distribution scheme that aggregates micro-verifications into a comprehensive evaluation score, ensuring clinical reliability without manual tuning. The framework also incorporates a systematic meta-evaluation protocol, measuring alignment with expert judgments using ICC and Cohen’s κ, and robustness under simulated perturbations such as missing or incorrect data. Experimental results show that RubricsTree achieves an ICC of 0.876 and κ of 0.787, outperforming industry baselines significantly.

Furthermore, when integrated as a structured instruction or reward signal, RubricsTree boosts model performance—such as Gemini, GPT, and Qwen—by up to 66% on HealthBench, demonstrating its practical utility for model optimization and safety assurance. Its ability to detect degraded inputs with over 93% detection rate under various perturbations underscores its robustness.

This work marks a substantial step forward in scalable, expert-aligned evaluation of health AI, enabling continuous, safe, and effective deployment. By bridging clinical rigor with automation, RubricsTree addresses longstanding challenges in healthcare AI assessment, paving the way for trustworthy, scalable, and evolving personal health systems that can serve diverse populations worldwide. Future work will focus on automating knowledge updates, incorporating multi-modal data, and expanding to subjective or emergent medical judgments, further strengthening its role as a foundational evaluation infrastructure for the industry.

Deep Dive

Abstract

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

cs.CL cs.AI

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation