Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Introduced DOWIS dataset to evaluate SLLMs in multilingual settings, finding text prompts outperform spoken prompts.
Key Findings
Methodology
The paper introduces the DOWIS dataset, a multilingual dataset of spoken and written prompts designed to evaluate Speech Large Language Models (SLLMs) in instruction-following tasks. DOWIS spans nine tasks and eleven languages, providing ten prompt variants per task-language pair across five styles. The study analyzes the interplay between prompt modality, style, language, and task type.
Key Results
- Result 1: Text prompts outperform spoken prompts in low-resource and cross-lingual settings, particularly in tasks with text output, where text prompts significantly outperform spoken prompts.
- Result 2: For tasks requiring speech output, such as text-to-speech synthesis and speech-to-speech translation, spoken prompts perform on par or better than text prompts.
- Result 3: Informal text and spoken instructions consistently perform worse across tasks, highlighting the importance of diverse prompt styles for model evaluation.
Significance
The introduction of the DOWIS dataset fills a gap in current SLLM evaluations, providing a more realistic and comprehensive evaluation method. By analyzing the impact of different prompt modalities and styles, the study reveals current models' shortcomings in handling spoken instructions and emphasizes the importance of considering diverse prompts in model development. This research provides a crucial foundation for future model improvements and evaluations.
Technical Contribution
The technical contribution of this paper lies in providing the first multilingual dataset of spoken and textual prompts that can be combined with existing task benchmarks, lowering the barrier for speech instruction-following evaluation. The use of Phi-4 Multimodal and Qwen2.5-Omni models in the study demonstrates performance differences under various prompt conditions, offering directions for future model improvements.
Novelty
DOWIS is the first multilingual parallel spoken and textual prompt dataset written and recorded by native speakers. Unlike existing benchmarks, DOWIS decouples instructions from task inputs, allowing it to be paired with any existing benchmark, providing more natural and diverse language evaluation.
Limitations
- Limitation 1: In low-resource and cross-lingual settings, spoken prompts still underperform compared to text prompts, indicating difficulties in handling spoken instructions.
- Limitation 2: Informal prompts perform poorly across tasks, possibly due to their more colloquial nature.
- Limitation 3: Models show a preference for prompts from different genders, possibly reflecting gender biases in the models.
Future Work
Future research can explore improving models' performance under spoken instructions, especially in low-resource and cross-lingual settings. Additionally, studies can further analyze the impact of different prompt styles and genders on model performance to reduce biases.
AI Executive Summary
In recent years, Speech Large Language Models (SLLMs) have made significant progress in supporting a wide range of tasks. However, these models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, this paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. DOWIS spans nine tasks and eleven languages, providing ten prompt variants per task-language pair across five styles. Using DOWIS, the authors benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output do spoken prompts close the gap, highlighting the need for speech-based prompting in SLLM evaluation. The introduction of the DOWIS dataset fills a gap in current SLLM evaluations, providing a more realistic and comprehensive evaluation method. By analyzing the impact of different prompt modalities and styles, the study reveals current models' shortcomings in handling spoken instructions and emphasizes the importance of considering diverse prompts in model development. This research provides a crucial foundation for future model improvements and evaluations. The technical contribution of this paper lies in providing the first multilingual dataset of spoken and textual prompts that can be combined with existing task benchmarks, lowering the barrier for speech instruction-following evaluation. The use of Phi-4 Multimodal and Qwen2.5-Omni models in the study demonstrates performance differences under various prompt conditions, offering directions for future model improvements. DOWIS is the first multilingual parallel spoken and textual prompt dataset written and recorded by native speakers. Unlike existing benchmarks, DOWIS decouples instructions from task inputs, allowing it to be paired with any existing benchmark, providing more natural and diverse language evaluation. Future research can explore improving models' performance under spoken instructions, especially in low-resource and cross-lingual settings. Additionally, studies can further analyze the impact of different prompt styles and genders on model performance to reduce biases.
Deep Analysis
Background
Speech Large Language Models (SLLMs) have recently achieved remarkable progress in the field of natural language processing. These models can handle both speech and text tasks, demonstrating strong instruction-following capabilities. However, current evaluation methods primarily rely on text prompts, which do not align with how users interact with these models in real-world scenarios. Existing benchmarks for speech instruction-following, such as SpeechInstructBench and Uro-Bench, have limitations, such as supporting only English and Chinese and generating instructions using text-to-speech systems, which cannot be reused with other datasets. Furthermore, these benchmarks focus on general instruction-following and reasoning tasks, while researchers also need to evaluate spoken instruction-following for specific tasks like speech recognition or audio chaptering.
Core Problem
Current evaluation methods for Speech Large Language Models primarily rely on text prompts, which do not reflect how users interact with these models in real-world scenarios. Evaluating models under spoken instructions is crucial for achieving more natural human-machine interaction. However, existing benchmarks for speech instruction evaluation have limitations in terms of language and task coverage, failing to comprehensively reflect model capabilities.
Innovation
This paper presents the DOWIS dataset, the first multilingual dataset of spoken and textual prompts that can be combined with existing task benchmarks. DOWIS includes nine tasks and eleven languages, providing ten prompt variants per task-language pair across five styles. Unlike existing benchmarks, DOWIS decouples instructions from task inputs, allowing it to be paired with any existing benchmark, providing more natural and diverse language evaluation.
Methodology
- οΏ½οΏ½ Construction of the DOWIS dataset: Collect English prompts for nine tasks and translate them into ten languages. β’ Speech recording: Native speakers use phones or computers to record prompts, simulating real-world scenarios. β’ Dataset statistics: DOWIS contains 3 hours and 17 minutes of audio, covering nine tasks and eleven languages. β’ Model evaluation: Benchmark state-of-the-art SLLMs, Phi-4 Multimodal and Qwen2.5-Omni, analyzing the interplay between prompt modality, style, language, and task type.
Experiments
The experimental design includes evaluating Phi-4 Multimodal and Qwen2.5-Omni models using the DOWIS dataset. Evaluation tasks include automatic speech recognition, text-to-speech synthesis, speech translation, machine translation, speech-to-speech translation, speech summarization, text summarization, audio chapter generation, and spoken question answering. The experiments use various datasets for evaluation, such as FLEURS and MCIF, and employ multiple metrics for performance evaluation, such as Word Error Rate (WER), BERTScore, and CometKiwi.
Results
Experimental results show that text prompts outperform spoken prompts in low-resource and cross-lingual settings, particularly in tasks with text output, where text prompts significantly outperform spoken prompts. Only for tasks requiring speech output do spoken prompts close the gap. Furthermore, informal text and spoken instructions consistently perform worse across tasks, highlighting the importance of diverse prompt styles for model evaluation.
Applications
The DOWIS dataset can be used to evaluate Speech Large Language Models' instruction-following capabilities in multilingual settings, providing developers with a more comprehensive evaluation tool. The dataset can also help researchers analyze the impact of different prompt modalities and styles on model performance, driving model improvement and optimization.
Limitations & Outlook
While the DOWIS dataset provides multilingual spoken and textual prompts, spoken prompts still underperform compared to text prompts in low-resource and cross-lingual settings. Additionally, models show a preference for prompts from different genders, possibly reflecting gender biases in the models. Future research can explore improving models' performance under spoken instructions, especially in low-resource and cross-lingual settings.
Plain Language Accessible to non-experts
Imagine you're at an international conference and want a translation assistant to help you understand speakers in different languages. Traditional translation assistants might only support text input, meaning you'd have to manually input each speaker's speech, which is time-consuming and inconvenient. The DOWIS dataset is like a multilingual speech assistant that can understand and process spoken instructions in different languages. Through this dataset, researchers can evaluate and improve the performance of speech assistants, enabling them to understand and respond to spoken instructions more naturally. It's like equipping your translation assistant with a super brain, allowing it to translate speech in different languages more quickly and accurately.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool game that can understand what you say and react to your commands. Isn't that awesome? But, a lot of times, these games can only understand text commands, not spoken ones. It's like wanting your pet dog to understand your commands, but it can only read notes you write. To make these games smarter, scientists have developed something called the DOWIS dataset. This dataset is like a super training camp that helps games learn how to understand spoken instructions in different languages. With this training camp, games can become smarter and understand commands you say in different languages. It's like giving your game character a super brain, making it better at understanding and responding to your spoken commands. Cool, right?
Glossary
Speech Large Language Models (SLLMs)
Models capable of handling both speech and text tasks, demonstrating strong instruction-following capabilities.
Used in this paper to evaluate instruction-following capabilities in multilingual settings.
DOWIS dataset
A multilingual dataset of spoken and written prompts designed to evaluate SLLMs in instruction-following tasks.
Introduced in this paper to fill the gap in current evaluation methods.
Phi-4 Multimodal
One of the state-of-the-art models used for evaluating speech and text tasks.
Used in this paper for benchmarking.
Qwen2.5-Omni
One of the state-of-the-art models used for evaluating speech and text tasks.
Used in this paper for benchmarking.
Text prompts
Textual instructions used to direct models to perform specific tasks.
Used in this paper for performance comparison with spoken prompts.
Spoken prompts
Spoken instructions used to direct models to perform specific tasks.
Used in this paper for performance comparison with text prompts.
Word Error Rate (WER)
A metric used to evaluate speech recognition performance, indicating the proportion of recognition errors.
Used in this paper to evaluate automatic speech recognition tasks.
BERTScore
A metric used to evaluate text generation quality, based on BERT model to compute similarity between generated and reference texts.
Used in this paper to evaluate text generation tasks.
CometKiwi
A metric used to evaluate translation quality without requiring reference translations, highly correlated with human evaluation.
Used in this paper to evaluate machine translation and speech translation tasks.
MCIF
A multimodal crosslingual instruction-following benchmark providing text and spoken question-answering data for evaluation.
Used in this paper to evaluate spoken question-answering tasks.
Open Questions Unanswered questions from this research
- 1 How to improve models' instruction-following capabilities under spoken instructions in low-resource and cross-lingual settings? Current methods perform poorly in these settings, requiring more effective strategies to enhance model generalization.
- 2 How to reduce models' biases when handling spoken prompts from different genders? Studies show models exhibit preferences for prompts from different genders, necessitating further research to reduce such biases.
- 3 How to better handle informal prompts? Informal prompts perform poorly across tasks, possibly due to their more colloquial nature, requiring better methods to handle such prompts.
- 4 How to enhance models' instruction-following capabilities without increasing computational costs? Current methods may require more computational resources when handling spoken instructions.
- 5 How to improve models' robustness in multilingual settings? Current models may struggle with multilingual tasks, requiring more robust models to enhance performance.
Applications
Immediate Applications
Multilingual Speech Assistants
The DOWIS dataset can be used to train and evaluate multilingual speech assistants, enabling them to understand and respond to spoken instructions more naturally.
Cross-Language Translation Tools
The dataset can aid in developing smarter translation tools capable of handling spoken inputs from different languages.
Speech Recognition Systems
Researchers can improve the performance of speech recognition systems, especially in multilingual settings, using the DOWIS dataset.
Long-term Vision
Intelligent Meeting Assistants
In the future, the DOWIS dataset can be used to develop intelligent meeting assistants capable of real-time translation and summarization of meeting content.
Globalized Human-Machine Interaction
Applications of the DOWIS dataset can drive the development of globalized human-machine interaction, enabling users of different languages to interact with technology more naturally.
Abstract
Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
References (20)
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Chih-Kai Yang, Neo Ho, Yen-Ting Piao et al.
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang, Wenyi Yu, Guangzhi Sun et al.
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey, R. Swaminathan, Vijay Girish et al.
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Dingdong Wang, Jincenzi Wu, Junan Li et al.
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li et al.
MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Sara Papi, Maike Zufle, Marco Gaido et al.
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder et al.
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Qian Yang, Jin Xu, Wenrui Liu et al.
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Kuofeng Gao, Shu-Tao Xia, Ke Xu et al.
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci et al.
From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition
A. Morris, V. Maier, P. Green
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions
Fabian Retkowski, Alexander Waibel
Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task
Markus Freitag, Nitika Mathur, Daniel Deutsch et al.
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu et al.
URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models
Ruiqi Yan, Xiquan Li, Wenxi Chen et al.
On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien et al.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier GarcΓa Gilabert, Zachary Hopton et al.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson et al.
VoiceBench: Benchmarking LLM-Based Voice Assistants
Yiming Chen, Xianghu Yue, Chen Zhang et al.
FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau, Min Ma, Simran Khanuja et al.