Developing and evaluating a chatbot to support maternal health care
Developed a chatbot for maternal health in India using stage-aware triage and hybrid retrieval, achieving 86.7% emergency recall.
Key Findings
Methodology
The study developed a chatbot system for maternal health in India, integrating stage-aware triage, curated guideline-based hybrid retrieval, and evidence-conditioned generation from an LLM. The system ensures safety in high-risk scenarios through a multi-layered evaluation workflow. Specific methods include: • Stage-aware triage: routing high-risk queries to expert templates. • Hybrid retrieval: retrieving over curated maternal/newborn guidelines. • Evidence-conditioned generation: using a large language model (LLM) for generation.
Key Results
- Result 1: In the labeled triage benchmark (N=150), the system achieved 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off.
- Result 2: In the synthetic multi-evidence retrieval benchmark (N=100), the system was evaluated using chunk-level evidence labels.
- Result 3: LLM-as-judge comparison on real queries (N=781) using clinician-designed criteria and expert validation.
Significance
The study is significant for developing trustworthy medical assistants in multilingual, noisy settings. By employing a defense-in-depth design paired with multi-method evaluation, the system provides reliable health information in noisy environments. This approach not only contributes to academia but also offers new possibilities for practical applications, especially in low-resource areas, enhancing access to maternal health information.
Technical Contribution
Technical contributions include: • Introducing a stage-aware triage mechanism that evaluates and routes risks based on different maternal stages. • Developing a hybrid retrieval system that combines sparse and dense retrieval techniques to improve retrieval accuracy and coverage. • Designing a multi-layered evaluation strategy for high-risk deployment under limited expert supervision.
Novelty
This study is the first to combine stage-aware triage and hybrid retrieval techniques in the maternal health domain, providing a novel solution. Compared to existing work, this system better handles short queries in multilingual, noisy environments and provides evidence-based generation.
Limitations
- Limitation 1: The system may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence.
- Limitation 2: The system's applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms.
- Limitation 3: In some cases, the system may not fully replace human expert judgment, especially in complex medical decision-making.
Future Work
Future work could include: • Expanding the system to support more languages and regions, increasing its applicability. • Further optimizing retrieval and generation mechanisms to enhance the handling of complex queries. • Conducting larger-scale real-world testing to validate the system's performance in different scenarios.
AI Executive Summary
Providing effective medical care during pregnancy remains a key challenge for global public health. Despite progress towards ensuring antenatal care access, many pregnant women still lack access to medical information and expert care. To address this issue, researchers have developed a novel chatbot system aimed at providing reliable health information for maternal health in India.
The system, developed through a collaboration between academic researchers, a health tech company, a public health nonprofit, and a hospital, integrates stage-aware triage, hybrid retrieval, and evidence-conditioned generation. Stage-aware triage identifies high-risk queries and routes them to expert templates, ensuring appropriate guidance in emergencies. Hybrid retrieval retrieves information over curated maternal/newborn guidelines, combining sparse and dense retrieval techniques to improve accuracy and coverage.
In experiments, the system achieved 86.7% emergency recall in the labeled triage benchmark, explicitly reporting the missed-emergency vs. over-escalation trade-off. Additionally, the synthetic multi-evidence retrieval benchmark and LLM-as-judge comparison on real queries demonstrated the system's effectiveness. A multi-layered evaluation strategy allows for high-risk deployment under limited expert supervision, ensuring safety and reliability.
This study is significant for academia and offers new possibilities for practical applications. Especially in low-resource areas, the system can enhance access to maternal health information, facilitating early detection of high-risk pregnancies and adoption of health-supporting behaviors.
However, the system may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence. Furthermore, its applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms. Future work could include expanding the system to support more languages and regions, further optimizing retrieval and generation mechanisms, and conducting larger-scale real-world testing.
Deep Analysis
Background
Maternal health is a critical area of global public health. Despite progress in ensuring antenatal care access, many pregnant women still lack access to medical information and expert care. In recent years, researchers have begun exploring the potential of large language models (LLMs) in accessing health information. However, bridging the gap between prototypes and deployable systems remains challenging, especially in low-resource, multilingual environments. Existing rule-based chatbots often perform poorly in handling complex medical issues, while LLMs, despite their prowess in natural language processing, lack customization for specific domains.
Core Problem
In low-resource environments, users generally have low health literacy and limited access to medical information. User queries are often short, underspecified, and code-mixed across languages. Answering these queries requires regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. Existing systems often perform poorly in handling these complex issues, failing to provide reliable health information and guidance.
Innovation
The core innovations of this study include: • Stage-aware triage: Evaluating and routing risks based on different maternal stages to ensure appropriate guidance in emergencies. • Hybrid retrieval: Combining sparse and dense retrieval techniques to improve retrieval accuracy and coverage. • Evidence-conditioned generation: Using a large language model for generation, ensuring answers are based on reliable evidence. • Multi-layered evaluation strategy: Allowing high-risk deployment under limited expert supervision, ensuring safety and reliability.
Methodology
- �� Stage-aware triage: Using a structured taxonomy to evaluate and route risks based on different maternal stages. • Hybrid retrieval: Retrieving over curated maternal/newborn guidelines, combining sparse and dense retrieval techniques. • Evidence-conditioned generation: Using a large language model for generation, ensuring answers are based on reliable evidence. • Multi-layered evaluation strategy: Allowing high-risk deployment under limited expert supervision, ensuring safety and reliability.
Experiments
The experimental design includes three main components: • Labeled triage benchmark: Testing the system's emergency recall rate on 150 samples. • Synthetic multi-evidence retrieval benchmark: Evaluating using chunk-level evidence labels on 100 samples. • LLM-as-judge comparison on real queries: Evaluating using clinician-designed criteria on 781 samples and expert validation.
Results
Experimental results show that the system achieved 86.7% emergency recall in the labeled triage benchmark, explicitly reporting the missed-emergency vs. over-escalation trade-off. In the synthetic multi-evidence retrieval benchmark, the system was evaluated using chunk-level evidence labels, demonstrating good retrieval performance. Additionally, the LLM-as-judge comparison on real queries demonstrated the system's effectiveness, with expert validation further confirming its reliability.
Applications
The system can be directly applied to maternal health information services in India, helping to improve the accessibility and accuracy of health information. By providing reliable health information and guidance, the system can facilitate early detection of high-risk pregnancies and adoption of health-supporting behaviors. Additionally, the system can be used in other low-resource, multilingual environments for health information services.
Limitations & Outlook
Despite the system's good performance in experiments, it may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence. Furthermore, its applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms. Future work could include expanding the system to support more languages and regions, further optimizing retrieval and generation mechanisms, and conducting larger-scale real-world testing.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You have a recipe (like our health guidelines), but you need to adjust the method based on different ingredients (like different maternal stages). Our chatbot is like a smart assistant that not only helps you find the key steps in the recipe but also gives advice based on the ingredients you have. For example, if you find some ingredients in the fridge that are about to expire, it will remind you to use those first, just like our system prioritizes high-risk health issues. This assistant can also explain each step in simple language, ensuring that the dish you make is both delicious and safe.
ELI14 Explained like you're 14
Hey there, imagine you're playing a super cool game called 'Maternal Health Guardian.' In this game, you need to help a pregnant mom get the health information she needs. She sends you messages, but sometimes they're short and in different languages. Your task is to find the best answer, like finding hidden treasures in the game!
To complete the task, you have a super assistant that helps you find the most important information from various guides. This assistant is like an NPC (non-player character) in the game, telling you what to do next and making sure you don't miss any important clues.
Sometimes, the pregnant mom asks really urgent questions, like she's not feeling well. That's when your assistant alerts you that this question is important and needs immediate attention, just like an emergency mission in the game!
Through this game, you not only help the mom get the information she needs but also learn a lot about health. Isn't that cool?
Glossary
Stage-aware triage
A mechanism that evaluates and routes risks based on different maternal stages, ensuring appropriate guidance in emergencies.
Used to identify high-risk queries and route them to expert templates.
Hybrid retrieval
Combines sparse and dense retrieval techniques to improve retrieval accuracy and coverage.
Retrieves over curated maternal/newborn guidelines.
Evidence-conditioned generation
Uses a large language model for generation, ensuring answers are based on reliable evidence.
Generates answers based on retrieved evidence.
LLM-as-judge
A method using large language models as evaluation standards to compare generated answers with expert standards.
Used for comparison on real queries with clinician-designed criteria.
Emergency recall rate
Measures the system's effectiveness in identifying and handling emergencies, usually expressed as a percentage.
Achieved 86.7% in the labeled triage benchmark.
Multi-layered evaluation strategy
Allows high-risk deployment under limited expert supervision, ensuring safety and reliability.
Used to evaluate the system's performance in different scenarios.
Sparse retrieval
A retrieval technique based on keyword matching, typically used for precise matches.
Combined with dense retrieval in hybrid retrieval.
Dense retrieval
A retrieval technique based on semantic similarity, capable of handling queries in different languages and expressions.
Combined with sparse retrieval in hybrid retrieval.
Chunk-level evidence labels
Labels used to mark whether retrieved evidence chunks are directly related to the question.
Used in the synthetic multi-evidence retrieval benchmark.
Expert templates
Predefined response templates used to handle high-risk or emergency queries.
Used in stage-aware triage for routing high-risk queries.
Open Questions Unanswered questions from this research
- 1 In low-resource environments, effectively handling short queries in multilingual and noisy settings remains a challenge. Existing systems often perform poorly in handling these complex issues, failing to provide reliable health information and guidance. Future research needs to explore more effective retrieval and generation mechanisms to improve system applicability and accuracy.
- 2 Despite the system's good performance in experiments, it may be limited in handling very complex medical issues as it relies on predefined templates and retrieved evidence. Future research needs to explore more flexible generation mechanisms to improve system performance in complex scenarios.
- 3 The system's applicability may be limited in other languages or regions due to its reliance on specific language models and retrieval mechanisms. Future research needs to explore more general solutions to improve the system's cross-language and cross-region applicability.
- 4 In some cases, the system may not fully replace human expert judgment, especially in complex medical decision-making. Future research needs to explore how to better integrate human expert knowledge with the system's automation capabilities to improve system reliability.
- 5 How to deploy high-risk scenarios under limited expert supervision remains a challenge. Future research needs to explore more effective evaluation strategies to ensure system safety and reliability in different scenarios.
Applications
Immediate Applications
Maternal health information services in India
The system can be directly applied to maternal health information services in India, helping to improve the accessibility and accuracy of health information.
Health information services in low-resource environments
The system can be used in other low-resource, multilingual environments for health information services, providing reliable health information and guidance.
Early detection of high-risk pregnancies
By providing reliable health information and guidance, the system can facilitate early detection of high-risk pregnancies and adoption of health-supporting behaviors.
Long-term Vision
Cross-language and cross-region applicability
In the future, the system can be expanded to support more languages and regions, increasing its applicability and impact.
Handling complex medical issues
In the future, the system can further optimize retrieval and generation mechanisms to enhance the handling of complex queries, enhancing the system's intelligence level.
Abstract
The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.
References (20)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang, Furu Wei, Li Dong et al.
Evaluating a retrieval-augmented pregnancy chatbot: a comprehensibility–accuracy-readability study of the DIAN AI assistant
P. Valan, Pulidindi Venugopal, Italy Anna Sandionigi Quantia Consulting srl et al.
Facilitating Aboriginal Perinatal Mental Health Information Access with a Retrieval-Augmented LLM-based Chatbot
Made Srinitha Millinia Utami, Wai Hang Kwok, Jayne Kotz et al.
Reciprocal rank fusion outperforms condorcet and individual rank learning methods
G. Cormack, C. Clarke, Stefan Büttcher
Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation
Jack Krolik, Herprit Mahal, Feroz Ahmad et al.
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.
J. Ayers, Adam Poliak, M. Dredze et al.
RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations
Karen Ka Yan Ng, Izuki Matsuba, Peter Chengming Zhang
Hindi Chatbot for Supporting Maternal and Child Health Related Queries in Rural India
Ritwik Mishra, Simranjeet Singh, Jasmeet Kaur et al.
Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages
Zihao Li, Yucheng Shi, Zirui Liu et al.
MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
Qiao Jin, Won Kim, Qingyu Chen et al.
Gender Bias in Large Language Models across Multiple Languages: A Case Study of ChatGPT
Yitian Ding, Jinman Zhao, Chen Jia et al.
Building Certified Medical Chatbots: Overcoming Unstructured Data Limitations with Modular RAG
Leonardo Sanna, Patrizio Bellan, Simone Magnolini et al.
A Chatbot for Perinatal Women’s and Partners’ Obstetric and Mental Health Care: Development and Usability Evaluation Study
K. Chung, Hee-Young Cho, Jin Young Park
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma, Anandhita Raghunath, Mohit Jain et al.
A Survey on LLM-as-a-Judge
Jiawei Gu, Xuhui Jiang, Zhichao Shi et al.
An Analysis of Fusion Functions for Hybrid Retrieval
Sebastian Bruch, Siyu Gai, A. Ingber
Taxonomy of Risks posed by Language Models
Laura Weidinger, Jonathan Uesato, Maribeth Rauh et al.
RAGAs: Automated Evaluation of Retrieval Augmented Generation
ES Shahul, J. James, Luis Espinosa Anke et al.
Artificial Intelligence for Women and Child Healthcare: Is AI Able to Change the Beginning of a New Story? A Perspective
Patricia Takako Endo
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao et al.