Developing and evaluating a chatbot to support maternal health care

TL;DR

Developed a chatbot for maternal health in India using stage-aware triage and hybrid retrieval, achieving 86.7% emergency recall.

cs.AI 🔴 Advanced 2026-03-14 2 views

Smriti Jha Vidhi Jain Jianyu Xu Grace Liu Sowmya Ramesh Jitender Nagpal Gretchen Chapman Benjamin Bellows Siddhartha Goyal Aarti Singh Bryan Wilder

AI Reader Arxiv Page Download PDF

maternal health chatbot hybrid retrieval multilingual processing high-risk triage

Key Findings

Methodology

The study developed a chatbot system for maternal health in India, integrating stage-aware triage, curated guideline-based hybrid retrieval, and evidence-conditioned generation from an LLM. The system ensures safety in high-risk scenarios through a multi-layered evaluation workflow. Specific methods include: • Stage-aware triage: routing high-risk queries to expert templates. • Hybrid retrieval: retrieving over curated maternal/newborn guidelines. • Evidence-conditioned generation: using a large language model (LLM) for generation.

Key Results

Result 1: In the labeled triage benchmark (N=150), the system achieved 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off.
Result 2: In the synthetic multi-evidence retrieval benchmark (N=100), the system was evaluated using chunk-level evidence labels.
Result 3: LLM-as-judge comparison on real queries (N=781) using clinician-designed criteria and expert validation.

Significance

The study is significant for developing trustworthy medical assistants in multilingual, noisy settings. By employing a defense-in-depth design paired with multi-method evaluation, the system provides reliable health information in noisy environments. This approach not only contributes to academia but also offers new possibilities for practical applications, especially in low-resource areas, enhancing access to maternal health information.

Technical Contribution

Technical contributions include: • Introducing a stage-aware triage mechanism that evaluates and routes risks based on different maternal stages. • Developing a hybrid retrieval system that combines sparse and dense retrieval techniques to improve retrieval accuracy and coverage. • Designing a multi-layered evaluation strategy for high-risk deployment under limited expert supervision.

Novelty

This study is the first to combine stage-aware triage and hybrid retrieval techniques in the maternal health domain, providing a novel solution. Compared to existing work, this system better handles short queries in multilingual, noisy environments and provides evidence-based generation.

Limitations

Limitation 1: The system may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence.
Limitation 2: The system's applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms.
Limitation 3: In some cases, the system may not fully replace human expert judgment, especially in complex medical decision-making.

Future Work

Future work could include: • Expanding the system to support more languages and regions, increasing its applicability. • Further optimizing retrieval and generation mechanisms to enhance the handling of complex queries. • Conducting larger-scale real-world testing to validate the system's performance in different scenarios.

AI Executive Summary

Providing effective medical care during pregnancy remains a key challenge for global public health. Despite progress towards ensuring antenatal care access, many pregnant women still lack access to medical information and expert care. To address this issue, researchers have developed a novel chatbot system aimed at providing reliable health information for maternal health in India.

The system, developed through a collaboration between academic researchers, a health tech company, a public health nonprofit, and a hospital, integrates stage-aware triage, hybrid retrieval, and evidence-conditioned generation. Stage-aware triage identifies high-risk queries and routes them to expert templates, ensuring appropriate guidance in emergencies. Hybrid retrieval retrieves information over curated maternal/newborn guidelines, combining sparse and dense retrieval techniques to improve accuracy and coverage.

In experiments, the system achieved 86.7% emergency recall in the labeled triage benchmark, explicitly reporting the missed-emergency vs. over-escalation trade-off. Additionally, the synthetic multi-evidence retrieval benchmark and LLM-as-judge comparison on real queries demonstrated the system's effectiveness. A multi-layered evaluation strategy allows for high-risk deployment under limited expert supervision, ensuring safety and reliability.

This study is significant for academia and offers new possibilities for practical applications. Especially in low-resource areas, the system can enhance access to maternal health information, facilitating early detection of high-risk pregnancies and adoption of health-supporting behaviors.

However, the system may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence. Furthermore, its applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms. Future work could include expanding the system to support more languages and regions, further optimizing retrieval and generation mechanisms, and conducting larger-scale real-world testing.

Deep Analysis

Background

Maternal health is a critical area of global public health. Despite progress in ensuring antenatal care access, many pregnant women still lack access to medical information and expert care. In recent years, researchers have begun exploring the potential of large language models (LLMs) in accessing health information. However, bridging the gap between prototypes and deployable systems remains challenging, especially in low-resource, multilingual environments. Existing rule-based chatbots often perform poorly in handling complex medical issues, while LLMs, despite their prowess in natural language processing, lack customization for specific domains.

Core Problem

In low-resource environments, users generally have low health literacy and limited access to medical information. User queries are often short, underspecified, and code-mixed across languages. Answering these queries requires regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. Existing systems often perform poorly in handling these complex issues, failing to provide reliable health information and guidance.

Innovation

The core innovations of this study include: • Stage-aware triage: Evaluating and routing risks based on different maternal stages to ensure appropriate guidance in emergencies. • Hybrid retrieval: Combining sparse and dense retrieval techniques to improve retrieval accuracy and coverage. • Evidence-conditioned generation: Using a large language model for generation, ensuring answers are based on reliable evidence. • Multi-layered evaluation strategy: Allowing high-risk deployment under limited expert supervision, ensuring safety and reliability.

Methodology

�� Stage-aware triage: Using a structured taxonomy to evaluate and route risks based on different maternal stages. • Hybrid retrieval: Retrieving over curated maternal/newborn guidelines, combining sparse and dense retrieval techniques. • Evidence-conditioned generation: Using a large language model for generation, ensuring answers are based on reliable evidence. • Multi-layered evaluation strategy: Allowing high-risk deployment under limited expert supervision, ensuring safety and reliability.

Experiments

The experimental design includes three main components: • Labeled triage benchmark: Testing the system's emergency recall rate on 150 samples. • Synthetic multi-evidence retrieval benchmark: Evaluating using chunk-level evidence labels on 100 samples. • LLM-as-judge comparison on real queries: Evaluating using clinician-designed criteria on 781 samples and expert validation.

Results

Experimental results show that the system achieved 86.7% emergency recall in the labeled triage benchmark, explicitly reporting the missed-emergency vs. over-escalation trade-off. In the synthetic multi-evidence retrieval benchmark, the system was evaluated using chunk-level evidence labels, demonstrating good retrieval performance. Additionally, the LLM-as-judge comparison on real queries demonstrated the system's effectiveness, with expert validation further confirming its reliability.

Applications

The system can be directly applied to maternal health information services in India, helping to improve the accessibility and accuracy of health information. By providing reliable health information and guidance, the system can facilitate early detection of high-risk pregnancies and adoption of health-supporting behaviors. Additionally, the system can be used in other low-resource, multilingual environments for health information services.

Limitations & Outlook

Despite the system's good performance in experiments, it may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence. Furthermore, its applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms. Future work could include expanding the system to support more languages and regions, further optimizing retrieval and generation mechanisms, and conducting larger-scale real-world testing.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (like our health guidelines), but you need to adjust the method based on different ingredients (like different maternal stages). Our chatbot is like a smart assistant that not only helps you find the key steps in the recipe but also gives advice based on the ingredients you have. For example, if you find some ingredients in the fridge that are about to expire, it will remind you to use those first, just like our system prioritizes high-risk health issues. This assistant can also explain each step in simple language, ensuring that the dish you make is both delicious and safe.

ELI14 Explained like you're 14

Hey there, imagine you're playing a super cool game called 'Maternal Health Guardian.' In this game, you need to help a pregnant mom get the health information she needs. She sends you messages, but sometimes they're short and in different languages. Your task is to find the best answer, like finding hidden treasures in the game!

To complete the task, you have a super assistant that helps you find the most important information from various guides. This assistant is like an NPC (non-player character) in the game, telling you what to do next and making sure you don't miss any important clues.

Sometimes, the pregnant mom asks really urgent questions, like she's not feeling well. That's when your assistant alerts you that this question is important and needs immediate attention, just like an emergency mission in the game!

Through this game, you not only help the mom get the information she needs but also learn a lot about health. Isn't that cool?

Glossary

Stage-aware triage

A mechanism that evaluates and routes risks based on different maternal stages, ensuring appropriate guidance in emergencies.

Used to identify high-risk queries and route them to expert templates.

Hybrid retrieval

Combines sparse and dense retrieval techniques to improve retrieval accuracy and coverage.

Retrieves over curated maternal/newborn guidelines.

Evidence-conditioned generation

Uses a large language model for generation, ensuring answers are based on reliable evidence.

Generates answers based on retrieved evidence.

LLM-as-judge

A method using large language models as evaluation standards to compare generated answers with expert standards.

Used for comparison on real queries with clinician-designed criteria.

Emergency recall rate

Measures the system's effectiveness in identifying and handling emergencies, usually expressed as a percentage.

Achieved 86.7% in the labeled triage benchmark.

Multi-layered evaluation strategy

Allows high-risk deployment under limited expert supervision, ensuring safety and reliability.

Used to evaluate the system's performance in different scenarios.

Sparse retrieval

A retrieval technique based on keyword matching, typically used for precise matches.

Combined with dense retrieval in hybrid retrieval.

Dense retrieval

A retrieval technique based on semantic similarity, capable of handling queries in different languages and expressions.

Combined with sparse retrieval in hybrid retrieval.

Chunk-level evidence labels

Labels used to mark whether retrieved evidence chunks are directly related to the question.

Used in the synthetic multi-evidence retrieval benchmark.

Expert templates

Predefined response templates used to handle high-risk or emergency queries.

Used in stage-aware triage for routing high-risk queries.

Open Questions Unanswered questions from this research

1 In low-resource environments, effectively handling short queries in multilingual and noisy settings remains a challenge. Existing systems often perform poorly in handling these complex issues, failing to provide reliable health information and guidance. Future research needs to explore more effective retrieval and generation mechanisms to improve system applicability and accuracy.
2 Despite the system's good performance in experiments, it may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence. Future research needs to explore more flexible generation mechanisms to improve system performance in complex scenarios.
3 The system's applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms. Future research needs to explore more general solutions to improve the system's cross-language and cross-region applicability.
4 In some cases, the system may not fully replace human expert judgment, especially in complex medical decision-making. Future research needs to explore how to better integrate human expert knowledge with the system's automation capabilities to improve system reliability.
5 How to deploy high-risk scenarios under limited expert supervision remains a challenge. Future research needs to explore more effective evaluation strategies to ensure system safety and reliability in different scenarios.

Applications

Immediate Applications

Maternal health information services in India

The system can be directly applied to maternal health information services in India, helping to improve the accessibility and accuracy of health information.

Health information services in low-resource environments

The system can be used in other low-resource, multilingual environments for health information services, providing reliable health information and guidance.

Early detection of high-risk pregnancies

By providing reliable health information and guidance, the system can facilitate early detection of high-risk pregnancies and adoption of health-supporting behaviors.

Long-term Vision

Cross-language and cross-region applicability

In the future, the system can be expanded to support more languages and regions, increasing its applicability and impact.

Handling complex medical issues

In the future, the system can further optimize retrieval and generation mechanisms to enhance the handling of complex queries, enhancing the system's intelligence level.

Abstract

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

cs.AI cs.CL cs.IR

References (20)

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei, Li Dong et al.

2020 1947 citations View Analysis →

Evaluating a retrieval-augmented pregnancy chatbot: a comprehensibility–accuracy-readability study of the DIAN AI assistant

P. Valan, Pulidindi Venugopal, Italy Anna Sandionigi Quantia Consulting srl et al.

2025 2 citations

Facilitating Aboriginal Perinatal Mental Health Information Access with a Retrieval-Augmented LLM-based Chatbot

Made Srinitha Millinia Utami, Wai Hang Kwok, Jayne Kotz et al.

2025 2 citations

Reciprocal rank fusion outperforms condorcet and individual rank learning methods

G. Cormack, C. Clarke, Stefan Büttcher

2009 784 citations

Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation

Jack Krolik, Herprit Mahal, Feroz Ahmad et al.

2024 16 citations View Analysis →

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.

J. Ayers, Adam Poliak, M. Dredze et al.

2023 1802 citations

RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations

Karen Ka Yan Ng, Izuki Matsuba, Peter Chengming Zhang

2024 67 citations

Hindi Chatbot for Supporting Maternal and Child Health Related Queries in Rural India

Ritwik Mishra, Simranjeet Singh, Jasmeet Kaur et al.

2023 8 citations

Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages

Zihao Li, Yucheng Shi, Zirui Liu et al.

2024 48 citations View Analysis →

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Qiao Jin, Won Kim, Qingyu Chen et al.

2023 218 citations View Analysis →

Gender Bias in Large Language Models across Multiple Languages: A Case Study of ChatGPT

Yitian Ding, Jinman Zhao, Chen Jia et al.

2025 12 citations

Building Certified Medical Chatbots: Overcoming Unstructured Data Limitations with Modular RAG

Leonardo Sanna, Patrizio Bellan, Simone Magnolini et al.

2024 4 citations

A Chatbot for Perinatal Women’s and Partners’ Obstetric and Mental Health Care: Development and Usability Evaluation Study

K. Chung, Hee-Young Cho, Jin Young Park

2020 68 citations

HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings

Varun Gumma, Anandhita Raghunath, Mohit Jain et al.

2024 7 citations View Analysis →

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi et al.

2024 1078 citations View Analysis →

An Analysis of Fusion Functions for Hybrid Retrieval

Sebastian Bruch, Siyu Gai, A. Ingber

2022 60 citations View Analysis →

Taxonomy of Risks posed by Language Models

Laura Weidinger, Jonathan Uesato, Maribeth Rauh et al.

2022 851 citations

RAGAs: Automated Evaluation of Retrieval Augmented Generation

ES Shahul, J. James, Luis Espinosa Anke et al.

2023 524 citations View Analysis →

Artificial Intelligence for Women and Child Healthcare: Is AI Able to Change the Beginning of a New Story? A Perspective

Patricia Takako Endo

2025 4 citations

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao et al.

2023 2994 citations View Analysis →

Developing and evaluating a chatbot to support maternal health care

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Stage-aware triage

Hybrid retrieval

Evidence-conditioned generation

LLM-as-judge

Emergency recall rate

Multi-layered evaluation strategy

Sparse retrieval

Dense retrieval

Chunk-level evidence labels

Expert templates

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Maternal health information services in India

Health information services in low-resource environments

Early detection of high-risk pregnancies

Long-term Vision

Cross-language and cross-region applicability

Handling complex medical issues

Abstract

References (20)

Related Papers

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing