Comparative Analysis of Large Language Models in Generating Telugu Responses for Maternal Health Queries
Using BERT Score and expert evaluation, this study analyzes ChatGPT-4o, GeminiAI, and Perplexity AI's performance in generating Telugu maternal health responses, with Gemini leading.
Key Findings
Methodology
This study combines automated semantic analysis with expert evaluation, using BERT Score to assess the semantic similarity between LLM-generated answers and expert responses. Ten gynecologists fluent in Telugu conducted qualitative evaluations of the generated answers, examining accuracy, fluency, relevance, coherence, and completeness. This comprehensive evaluation framework explores the impact of input language on the quality of Telugu responses.
Key Results
- Result 1: Perplexity achieved the highest F1 score of 0.704 when prompted in English, indicating strong semantic alignment with expert responses.
- Result 2: Gemini demonstrated high semantic similarity in both English and Telugu prompts, with F1 scores closely following Perplexity.
- Result 3: ChatGPT's semantic alignment improved when prompted in Telugu, highlighting its sensitivity to input language.
Significance
This study reveals the critical role of selecting the appropriate LLM and prompt language in obtaining high-quality information in low-resource language settings. The findings emphasize the need for improved LLM assistance in regional languages, particularly in sensitive areas like maternal health. By providing a comprehensive evaluation of different models, the study offers valuable insights and guidance for future LLM applications in regional languages.
Technical Contribution
The technical contributions of this study include the first systematic evaluation of LLMs in generating Telugu maternal health responses, combining BERT Score and expert evaluation for comprehensive performance analysis. The study reveals the impact of model selection and prompt language on answer quality, offering new perspectives and methods for optimizing LLMs in regional languages.
Novelty
This study is the first to systematically evaluate LLM performance in the low-resource language of Telugu, specifically in the maternal health domain. Compared to previous studies, this research not only examines semantic similarity but also provides a comprehensive performance analysis through expert evaluation.
Limitations
- Limitation 1: The study is limited to Telugu and does not cover other low-resource languages, restricting the generalizability of the conclusions.
- Limitation 2: The subjectivity of expert evaluations may affect the objectivity of the results, requiring further validation.
- Limitation 3: The training data of LLMs was not analyzed in detail, which may affect the comprehensive understanding of model performance.
Future Work
Future research directions include expanding to more low-resource languages, increasing the diversity of maternal health questions, and fine-tuning LLMs specifically for regional languages. Additionally, the study will explore user trust and usage of AI-generated advice to enhance the reliability and acceptance of AI tools in real healthcare settings.
AI Executive Summary
In low-resource language settings, particularly languages like Telugu, the performance of large language models (LLMs) remains underexplored. Existing research primarily focuses on resource-rich languages, overlooking the potential and challenges of applying LLMs in regional languages.
This study conducts a comparative analysis of three LLMs: ChatGPT-4o, GeminiAI, and Perplexity AI, examining their performance in generating Telugu responses to maternal health queries. The study employs BERT Score as a semantic similarity metric and combines it with expert evaluations from professional gynecologists, assessing the accuracy, fluency, relevance, coherence, and completeness of the generated answers.
The results indicate that Gemini excels in producing accurate and coherent Telugu responses related to maternal health, while Perplexity performs well when prompted in Telugu. ChatGPT's performance shows room for improvement, especially when prompted in English. The study highlights the importance of selecting the appropriate LLM and prompt language for retrieving high-quality information.
These findings not only provide valuable insights and guidance for future LLM applications in regional languages but also emphasize the need for improved LLM assistance in healthcare, particularly in low-resource languages. By offering a comprehensive evaluation of different models, the study provides new perspectives and methods for optimizing LLMs in regional languages.
However, the study has some limitations, such as being limited to Telugu and not covering other low-resource languages, which restricts the generalizability of the conclusions. Additionally, the subjectivity of expert evaluations may affect the objectivity of the results, requiring further validation. Future research directions include expanding to more low-resource languages, increasing the diversity of maternal health questions, and fine-tuning LLMs specifically for regional languages.
Deep Analysis
Background
In recent years, large language models (LLMs) have made significant advancements in the field of natural language processing, particularly in generating fluent and contextually appropriate text. However, their performance in low-resource languages remains inconsistent. Existing research primarily focuses on resource-rich languages like English and Chinese, neglecting the potential and challenges of applying LLMs in regional languages. In sensitive domains like maternal health, the accuracy and trustworthiness of models are crucial. Therefore, systematically evaluating LLM performance in low-resource languages is of great importance.
Core Problem
The core problem of this study is to evaluate LLM performance in generating Telugu responses to maternal health queries. As Telugu is a low-resource language, the performance of existing models in this language has not been thoroughly studied. Additionally, the maternal health domain requires high accuracy and completeness of information, necessitating the development of a comprehensive evaluation framework to thoroughly assess model performance in this domain.
Innovation
The core innovations of this study include:
1) The first systematic evaluation of LLM performance in the low-resource language of Telugu, specifically in the maternal health domain.
2) A comprehensive performance analysis combining BERT Score and expert evaluation, revealing the impact of model selection and prompt language on answer quality.
3) A proposed comprehensive evaluation framework that combines automated semantic analysis and expert evaluation, offering new perspectives and methods for optimizing LLMs in regional languages.
Methodology
The methodology of this study includes the following steps:
- �� Data Collection: Collect common maternal health-related questions covering topics such as nutrition, symptom management, fetal development, and antenatal care, presented in both English and Telugu.
- �� Model Generation: Use ChatGPT-4o, GeminiAI, and Perplexity AI to generate Telugu responses.
- �� Automated Evaluation: Use BERT Score to assess the semantic similarity between generated answers and expert responses.
- �� Expert Evaluation: Invite ten gynecologists fluent in Telugu to conduct qualitative evaluations of the generated answers, examining accuracy, fluency, relevance, coherence, and completeness.
- �� Comprehensive Analysis: Combine automated and expert evaluation results to analyze the impact of input language on the quality of generated answers.
Experiments
The experimental design includes the following aspects:
- �� Datasets: Use a bilingual dataset covering common maternal health-related questions.
- �� Baselines: Select ChatGPT-4o, GeminiAI, and Perplexity AI as baseline models.
- �� Evaluation Metrics: Use BERT Score to assess semantic similarity and combine expert evaluation to examine accuracy, fluency, relevance, coherence, and completeness.
- �� Hyperparameters: Conduct experiments based on the default settings of the models to ensure reproducibility of the results.
- �� Ablation Studies: Analyze the impact of input language on the quality of generated answers.
Results
The results analysis shows:
- �� Perplexity achieved the highest F1 score of 0.704 when prompted in English, indicating strong semantic alignment with expert responses.
- �� Gemini demonstrated high semantic similarity in both English and Telugu prompts, with F1 scores closely following Perplexity.
- �� ChatGPT's semantic alignment improved when prompted in Telugu, highlighting its sensitivity to input language.
- �� Expert evaluations show that Gemini excels in producing accurate and coherent Telugu responses related to maternal health, while Perplexity performs well when prompted in Telugu.
Applications
The application scenarios of this study include:
- �� Medical Consultation: Use LLMs to provide accurate maternal health information in low-resource language settings, enhancing the accessibility of healthcare services.
- �� Educational Training: Provide training materials in regional languages for medical professionals, promoting the dissemination and sharing of knowledge.
- �� Health Management: Offer personalized health management advice for pregnant women, improving the effectiveness of health management.
Limitations & Outlook
The limitations of this study include:
- �� The study is limited to Telugu and does not cover other low-resource languages, restricting the generalizability of the conclusions.
- �� The subjectivity of expert evaluations may affect the objectivity of the results, requiring further validation.
- �� The training data of LLMs was not analyzed in detail, which may affect the comprehensive understanding of model performance. Future research directions include expanding to more low-resource languages, increasing the diversity of maternal health questions, and fine-tuning LLMs specifically for regional languages.
Plain Language Accessible to non-experts
Imagine you're in a kitchen with three chefs: ChatGPT, Gemini, and Perplexity. Their task is to prepare a dish (answer) based on a recipe (question). The kitchen has recipes in two languages: English and Telugu. Chef Gemini can make delicious dishes regardless of the recipe language. Chef Perplexity performs better when using Telugu recipes. Chef ChatGPT occasionally misses important steps when using English recipes. This study is like evaluating these three chefs' performances under different recipe languages to see who can make dishes that best match the recipe requirements. This way, we can understand which chef performs better in different language settings and how to improve their cooking skills.
ELI14 Explained like you're 14
Imagine you're playing a game with three characters: ChatGPT, Gemini, and Perplexity. Their task is to answer questions about maternal health. The game has prompts in two languages: English and Telugu. Gemini character can give great answers no matter the prompt language. Perplexity character performs better when given Telugu prompts. ChatGPT character sometimes misses important details when using English prompts. This study is like evaluating these three characters' performances under different prompt languages to see who can give answers that best match the game's requirements. This way, we can understand which character performs better in different language settings and how to improve their answering skills.
Glossary
Large Language Model
A large language model is a deep learning-based natural language processing model capable of generating and understanding natural language text.
In this paper, large language models are used to generate Telugu maternal health responses.
BERT Score
BERT Score is a metric used to evaluate semantic similarity by comparing contextual embeddings to capture semantic alignment.
This paper uses BERT Score to assess the semantic similarity between LLM-generated answers and expert responses.
Semantic Similarity
Semantic similarity refers to the degree of similarity in meaning between two texts, typically evaluated by comparing contextual embeddings.
In this paper, semantic similarity is used to evaluate the alignment of LLM-generated answers with expert responses.
Telugu
Telugu is a regional language in India, considered a low-resource language, and is relatively under-researched in natural language processing.
This paper studies LLM performance in generating Telugu maternal health responses.
Expert Evaluation
Expert evaluation involves qualitative analysis by domain experts to assess the accuracy, fluency, relevance, coherence, and completeness of generated text.
This paper combines expert evaluation with automated metrics for comprehensive analysis of LLM-generated answers.
Accuracy
Accuracy refers to the correctness of medical facts in the generated text and is a critical dimension of expert evaluation.
In this paper, accuracy is used to assess the medical correctness of LLM-generated maternal health responses.
Fluency
Fluency refers to the grammatical and natural use of language in the generated text and is a critical dimension of expert evaluation.
In this paper, fluency is used to assess the language quality of LLM-generated Telugu text.
Relevance
Relevance refers to the appropriateness and focus of the generated text on the query and is a critical dimension of expert evaluation.
In this paper, relevance is used to assess the focus of LLM-generated answers on maternal health queries.
Coherence
Coherence refers to the logical structure and flow of the generated text and is a critical dimension of expert evaluation.
In this paper, coherence is used to assess the logical flow of LLM-generated answers.
Completeness
Completeness refers to the coverage of all aspects of the query in the generated text and is a critical dimension of expert evaluation.
In this paper, completeness is used to assess the comprehensiveness of LLM-generated answers to maternal health queries.
Open Questions Unanswered questions from this research
- 1 How to achieve similar LLM performance evaluations in other low-resource languages? Current methods focus on Telugu, lacking systematic research on other low-resource languages.
- 2 How to reduce the subjectivity of expert evaluations affecting result objectivity? Current evaluation methods rely on subjective judgments of experts, potentially leading to inconsistent results.
- 3 How to optimize LLM training data to improve performance in low-resource languages? Existing studies do not analyze training data in detail, potentially affecting model performance.
- 4 How to enhance user trust and acceptance of AI-generated advice in real healthcare settings? Current research focuses on model performance evaluation, lacking in-depth study of user experience.
- 5 How to further optimize LLMs in regional languages to improve their application in maternal health? Existing research focuses on performance evaluation, lacking targeted optimization strategies.
Applications
Immediate Applications
Medical Consultation
Use LLMs to provide accurate maternal health information in low-resource language settings, enhancing the accessibility of healthcare services.
Educational Training
Provide training materials in regional languages for medical professionals, promoting the dissemination and sharing of knowledge.
Health Management
Offer personalized health management advice for pregnant women, improving the effectiveness of health management.
Long-term Vision
Regional Language Healthcare Services
Enhance the accessibility and quality of healthcare services in low-resource language settings by optimizing LLM performance in regional languages.
Global Health Information Sharing
Promote the sharing and dissemination of global health information through the application of multilingual LLMs, improving global health levels.
Abstract
Large Language Models (LLMs) have been progressively exhibiting there capabilities in various areas of research. The performance of the LLMs in acute maternal healthcare area, predominantly in low resource languages like Telugu, Hindi, Tamil, Urdu etc are still unstudied. This study presents how ChatGPT-4o, GeminiAI, and Perplexity AI respond to pregnancy related questions asked in different languages. A bilingual dataset is used to obtain results by applying the semantic similarity metrics (BERT Score) and expert assessments from expertise gynecologists. Multiple parameters like accuracy, fluency, relevance, coherence and completeness are taken into consideration by the gynecologists to rate the responses generated by the LLMs. Gemini excels in other LLMs in terms of producing accurate and coherent pregnancy relevant responses in Telugu, while Perplexity demonstrated well when the prompts were in Telugu. ChatGPT's performance can be improved. The results states that both selecting an LLM and prompting language plays a crucial role in retrieving the information. Altogether, we emphasize for the improvement of LLMs assistance in regional languages for healthcare purposes.
References (6)
IndicXNLI: Evaluating Multilingual Inference for Indian Languages
Divyanshu Aggarwal, V. Gupta, Anoop Kunchukuttan
Quality assessment of large language models’ output in maternal health
Henrique A. Lima, Pedro H. F. S. Trocoli-couto, Z. Moazzam et al.
IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages
Harman Singh, Nitish Gupta, Shikhar Bharadwaj et al.
Readability, quality and accuracy of generative artificial intelligence chatbots for commonly asked questions about labor epidurals: a comparison of ChatGPT and Bard.
D. Lee, M. Brown, J. Hammond et al.
Evaluating Telugu Proficiency in Large Language Models_ A Comparative Analysis of ChatGPT and Gemini
Katikela Sreeharsha Kishore, Rahimanuddin Shaik
Analysis of Indic Language Capabilities in LLMs
Aatman Vaidya, Tarunima Prabhakar, Denny George et al.