Comparative Analysis of Large Language Models in Generating Telugu Responses for Maternal Health Queries

Key Findings

Methodology

This study combines automated semantic analysis with expert evaluation, using BERT Score to assess the semantic similarity between LLM-generated answers and expert responses. Ten gynecologists fluent in Telugu conducted qualitative evaluations of the generated answers, examining accuracy, fluency, relevance, coherence, and completeness. This comprehensive evaluation framework explores the impact of input language on the quality of Telugu responses.

Key Results

Result 1: Perplexity achieved the highest F1 score of 0.704 when prompted in English, indicating strong semantic alignment with expert responses.
Result 2: Gemini demonstrated high semantic similarity in both English and Telugu prompts, with F1 scores closely following Perplexity.
Result 3: ChatGPT's semantic alignment improved when prompted in Telugu, highlighting its sensitivity to input language.

Significance

This study reveals the critical role of selecting the appropriate LLM and prompt language in obtaining high-quality information in low-resource language settings. The findings emphasize the need for improved LLM assistance in regional languages, particularly in sensitive areas like maternal health. By providing a comprehensive evaluation of different models, the study offers valuable insights and guidance for future LLM applications in regional languages.

Technical Contribution

The technical contributions of this study include the first systematic evaluation of LLMs in generating Telugu maternal health responses, combining BERT Score and expert evaluation for comprehensive performance analysis. The study reveals the impact of model selection and prompt language on answer quality, offering new perspectives and methods for optimizing LLMs in regional languages.

Novelty

This study is the first to systematically evaluate LLM performance in the low-resource language of Telugu, specifically in the maternal health domain. Compared to previous studies, this research not only examines semantic similarity but also provides a comprehensive performance analysis through expert evaluation.

Limitations

Limitation 1: The study is limited to Telugu and does not cover other low-resource languages, restricting the generalizability of the conclusions.
Limitation 2: The subjectivity of expert evaluations may affect the objectivity of the results, requiring further validation.
Limitation 3: The training data of LLMs was not analyzed in detail, which may affect the comprehensive understanding of model performance.

Future Work

Future research directions include expanding to more low-resource languages, increasing the diversity of maternal health questions, and fine-tuning LLMs specifically for regional languages. Additionally, the study will explore user trust and usage of AI-generated advice to enhance the reliability and acceptance of AI tools in real healthcare settings.

AI Executive Summary

In low-resource language settings, particularly languages like Telugu, the performance of large language models (LLMs) remains underexplored. Existing research primarily focuses on resource-rich languages, overlooking the potential and challenges of applying LLMs in regional languages.

This study conducts a comparative analysis of three LLMs: ChatGPT-4o, GeminiAI, and Perplexity AI, examining their performance in generating Telugu responses to maternal health queries. The study employs BERT Score as a semantic similarity metric and combines it with expert evaluations from professional gynecologists, assessing the accuracy, fluency, relevance, coherence, and completeness of the generated answers.

The results indicate that Gemini excels in producing accurate and coherent Telugu responses related to maternal health, while Perplexity performs well when prompted in Telugu. ChatGPT's performance shows room for improvement, especially when prompted in English. The study highlights the importance of selecting the appropriate LLM and prompt language for retrieving high-quality information.

These findings not only provide valuable insights and guidance for future LLM applications in regional languages but also emphasize the need for improved LLM assistance in healthcare, particularly in low-resource languages. By offering a comprehensive evaluation of different models, the study provides new perspectives and methods for optimizing LLMs in regional languages.

However, the study has some limitations, such as being limited to Telugu and not covering other low-resource languages, which restricts the generalizability of the conclusions. Additionally, the subjectivity of expert evaluations may affect the objectivity of the results, requiring further validation. Future research directions include expanding to more low-resource languages, increasing the diversity of maternal health questions, and fine-tuning LLMs specifically for regional languages.

Deep Analysis

Background

In recent years, large language models (LLMs) have made significant advancements in the field of natural language processing, particularly in generating fluent and contextually appropriate text. However, their performance in low-resource languages remains inconsistent. Existing research primarily focuses on resource-rich languages like English and Chinese, neglecting the potential and challenges of applying LLMs in regional languages. In sensitive domains like maternal health, the accuracy and trustworthiness of models are crucial. Therefore, systematically evaluating LLM performance in low-resource languages is of great importance.

Core Problem

The core problem of this study is to evaluate LLM performance in generating Telugu responses to maternal health queries. As Telugu is a low-resource language, the performance of existing models in this language has not been thoroughly studied. Additionally, the maternal health domain requires high accuracy and completeness of information, necessitating the development of a comprehensive evaluation framework to thoroughly assess model performance in this domain.

Innovation

The core innovations of this study include:

1) The first systematic evaluation of LLM performance in the low-resource language of Telugu, specifically in the maternal health domain.

2) A comprehensive performance analysis combining BERT Score and expert evaluation, revealing the impact of model selection and prompt language on answer quality.

3) A proposed comprehensive evaluation framework that combines automated semantic analysis and expert evaluation, offering new perspectives and methods for optimizing LLMs in regional languages.

Methodology

The methodology of this study includes the following steps:

�� Data Collection: Collect common maternal health-related questions covering topics such as nutrition, symptom management, fetal development, and antenatal care, presented in both English and Telugu.
�� Model Generation: Use ChatGPT-4o, GeminiAI, and Perplexity AI to generate Telugu responses.
�� Automated Evaluation: Use BERT Score to assess the semantic similarity between generated answers and expert responses.
�� Expert Evaluation: Invite ten gynecologists fluent in Telugu to conduct qualitative evaluations of the generated answers, examining accuracy, fluency, relevance, coherence, and completeness.
�� Comprehensive Analysis: Combine automated and expert evaluation results to analyze the impact of input language on the quality of generated answers.

Experiments

The experimental design includes the following aspects:

�� Datasets: Use a bilingual dataset covering common maternal health-related questions.
�� Baselines: Select ChatGPT-4o, GeminiAI, and Perplexity AI as baseline models.
�� Evaluation Metrics: Use BERT Score to assess semantic similarity and combine expert evaluation to examine accuracy, fluency, relevance, coherence, and completeness.
�� Hyperparameters: Conduct experiments based on the default settings of the models to ensure reproducibility of the results.
�� Ablation Studies: Analyze the impact of input language on the quality of generated answers.

Results

The results analysis shows:

�� Perplexity achieved the highest F1 score of 0.704 when prompted in English, indicating strong semantic alignment with expert responses.
�� Gemini demonstrated high semantic similarity in both English and Telugu prompts, with F1 scores closely following Perplexity.
�� ChatGPT's semantic alignment improved when prompted in Telugu, highlighting its sensitivity to input language.
�� Expert evaluations show that Gemini excels in producing accurate and coherent Telugu responses related to maternal health, while Perplexity performs well when prompted in Telugu.

Applications

The application scenarios of this study include:

�� Medical Consultation: Use LLMs to provide accurate maternal health information in low-resource language settings, enhancing the accessibility of healthcare services.
�� Educational Training: Provide training materials in regional languages for medical professionals, promoting the dissemination and sharing of knowledge.
�� Health Management: Offer personalized health management advice for pregnant women, improving the effectiveness of health management.

Limitations & Outlook

The limitations of this study include:

�� The study is limited to Telugu and does not cover other low-resource languages, restricting the generalizability of the conclusions.
�� The subjectivity of expert evaluations may affect the objectivity of the results, requiring further validation.
�� The training data of LLMs was not analyzed in detail, which may affect the comprehensive understanding of model performance. Future research directions include expanding to more low-resource languages, increasing the diversity of maternal health questions, and fine-tuning LLMs specifically for regional languages.

Plain Language Accessible to non-experts

Imagine you're in a kitchen with three chefs: ChatGPT, Gemini, and Perplexity. Their task is to prepare a dish (answer) based on a recipe (question). The kitchen has recipes in two languages: English and Telugu. Chef Gemini can make delicious dishes regardless of the recipe language. Chef Perplexity performs better when using Telugu recipes. Chef ChatGPT occasionally misses important steps when using English recipes. This study is like evaluating these three chefs' performances under different recipe languages to see who can make dishes that best match the recipe requirements. This way, we can understand which chef performs better in different language settings and how to improve their cooking skills.

ELI14 Explained like you're 14

Imagine you're playing a game with three characters: ChatGPT, Gemini, and Perplexity. Their task is to answer questions about maternal health. The game has prompts in two languages: English and Telugu. Gemini character can give great answers no matter the prompt language. Perplexity character performs better when given Telugu prompts. ChatGPT character sometimes misses important details when using English prompts. This study is like evaluating these three characters' performances under different prompt languages to see who can give answers that best match the game's requirements. This way, we can understand which character performs better in different language settings and how to improve their answering skills.

Glossary

Large Language Model

A large language model is a deep learning-based natural language processing model capable of generating and understanding natural language text.

In this paper, large language models are used to generate Telugu maternal health responses.

BERT Score

BERT Score is a metric used to evaluate semantic similarity by comparing contextual embeddings to capture semantic alignment.

This paper uses BERT Score to assess the semantic similarity between LLM-generated answers and expert responses.

Semantic Similarity

Semantic similarity refers to the degree of similarity in meaning between two texts, typically evaluated by comparing contextual embeddings.

In this paper, semantic similarity is used to evaluate the alignment of LLM-generated answers with expert responses.

Telugu

Telugu is a regional language in India, considered a low-resource language, and is relatively under-researched in natural language processing.

This paper studies LLM performance in generating Telugu maternal health responses.

Expert Evaluation

Expert evaluation involves qualitative analysis by domain experts to assess the accuracy, fluency, relevance, coherence, and completeness of generated text.

This paper combines expert evaluation with automated metrics for comprehensive analysis of LLM-generated answers.

Accuracy

Accuracy refers to the correctness of medical facts in the generated text and is a critical dimension of expert evaluation.

In this paper, accuracy is used to assess the medical correctness of LLM-generated maternal health responses.

Fluency

Fluency refers to the grammatical and natural use of language in the generated text and is a critical dimension of expert evaluation.

In this paper, fluency is used to assess the language quality of LLM-generated Telugu text.

Relevance

Relevance refers to the appropriateness and focus of the generated text on the query and is a critical dimension of expert evaluation.

In this paper, relevance is used to assess the focus of LLM-generated answers on maternal health queries.

Coherence

Coherence refers to the logical structure and flow of the generated text and is a critical dimension of expert evaluation.

In this paper, coherence is used to assess the logical flow of LLM-generated answers.

Completeness

Completeness refers to the coverage of all aspects of the query in the generated text and is a critical dimension of expert evaluation.

In this paper, completeness is used to assess the comprehensiveness of LLM-generated answers to maternal health queries.

Open Questions Unanswered questions from this research

1 How to achieve similar LLM performance evaluations in other low-resource languages? Current methods focus on Telugu, lacking systematic research on other low-resource languages.
2 How to reduce the subjectivity of expert evaluations affecting result objectivity? Current evaluation methods rely on subjective judgments of experts, potentially leading to inconsistent results.
3 How to optimize LLM training data to improve performance in low-resource languages? Existing studies do not analyze training data in detail, potentially affecting model performance.
4 How to enhance user trust and acceptance of AI-generated advice in real healthcare settings? Current research focuses on model performance evaluation, lacking in-depth study of user experience.
5 How to further optimize LLMs in regional languages to improve their application in maternal health? Existing research focuses on performance evaluation, lacking targeted optimization strategies.

Applications

Immediate Applications

Medical Consultation

Use LLMs to provide accurate maternal health information in low-resource language settings, enhancing the accessibility of healthcare services.

Educational Training

Provide training materials in regional languages for medical professionals, promoting the dissemination and sharing of knowledge.

Health Management

Offer personalized health management advice for pregnant women, improving the effectiveness of health management.

Long-term Vision

Regional Language Healthcare Services

Enhance the accessibility and quality of healthcare services in low-resource language settings by optimizing LLM performance in regional languages.

Global Health Information Sharing

Promote the sharing and dissemination of global health information through the application of multilingual LLMs, improving global health levels.

Abstract

Large Language Models (LLMs) have been progressively exhibiting there capabilities in various areas of research. The performance of the LLMs in acute maternal healthcare area, predominantly in low resource languages like Telugu, Hindi, Tamil, Urdu etc are still unstudied. This study presents how ChatGPT-4o, GeminiAI, and Perplexity AI respond to pregnancy related questions asked in different languages. A bilingual dataset is used to obtain results by applying the semantic similarity metrics (BERT Score) and expert assessments from expertise gynecologists. Multiple parameters like accuracy, fluency, relevance, coherence and completeness are taken into consideration by the gynecologists to rate the responses generated by the LLMs. Gemini excels in other LLMs in terms of producing accurate and coherent pregnancy relevant responses in Telugu, while Perplexity demonstrated well when the prompts were in Telugu. ChatGPT's performance can be improved. The results states that both selecting an LLM and prompting language plays a crucial role in retrieving the information. Altogether, we emphasize for the improvement of LLMs assistance in regional languages for healthcare purposes.

cs.IR

References (6)

IndicXNLI: Evaluating Multilingual Inference for Indian Languages

Divyanshu Aggarwal, V. Gupta, Anoop Kunchukuttan

2022 37 citations View Analysis →

Quality assessment of large language models’ output in maternal health

Henrique A. Lima, Pedro H. F. S. Trocoli-couto, Z. Moazzam et al.

2025 4 citations

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

Harman Singh, Nitish Gupta, Shikhar Bharadwaj et al.

2024 57 citations View Analysis →

Readability, quality and accuracy of generative artificial intelligence chatbots for commonly asked questions about labor epidurals: a comparison of ChatGPT and Bard.

D. Lee, M. Brown, J. Hammond et al.

2024 16 citations

Evaluating Telugu Proficiency in Large Language Models_ A Comparative Analysis of ChatGPT and Gemini

Katikela Sreeharsha Kishore, Rahimanuddin Shaik

2024 5 citations View Analysis →

Analysis of Indic Language Capabilities in LLMs

Aatman Vaidya, Tarunima Prabhakar, Denny George et al.

2025 3 citations View Analysis →

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Large Language Model

BERT Score

Semantic Similarity

Telugu

Expert Evaluation

Accuracy

Fluency

Relevance

Coherence

Completeness

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Medical Consultation

Educational Training

Health Management

Long-term Vision

Regional Language Healthcare Services

Global Health Information Sharing

Abstract

References (6)

Related Papers

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

ECLASS-Augmented Semantic Product Search for Electronic Components