Do What I Say: A Spoken Prompt Dataset for Instruction-Following

TL;DR

Introduced DOWIS dataset to evaluate SLLMs in multilingual settings, finding text prompts outperform spoken prompts.

cs.CL 🔴 Advanced 2026-03-11 15 views

Maike Züfle Sara Papi Fabian Retkowski Szymon Mazurek Marek Kasztelnik Alexander Waibel Luisa Bentivogli Jan Niehues

Speech Large Language Models Multilingual Dataset Instruction Following Speech Evaluation

Key Findings

Methodology

The paper introduces the DOWIS dataset, a multilingual dataset of spoken and written prompts designed to evaluate Speech Large Language Models (SLLMs) in instruction-following tasks. DOWIS spans nine tasks and eleven languages, providing ten prompt variants per task-language pair across five styles. The study analyzes the interplay between prompt modality, style, language, and task type.

Key Results

Result 1: Text prompts outperform spoken prompts in low-resource and cross-lingual settings, particularly in tasks with text output, where text prompts significantly outperform spoken prompts.
Result 2: For tasks requiring speech output, such as text-to-speech synthesis and speech-to-speech translation, spoken prompts perform on par or better than text prompts.
Result 3: Informal text and spoken instructions consistently perform worse across tasks, highlighting the importance of diverse prompt styles for model evaluation.

Significance

The introduction of the DOWIS dataset fills a gap in current SLLM evaluations, providing a more realistic and comprehensive evaluation method. By analyzing the impact of different prompt modalities and styles, the study reveals current models' shortcomings in handling spoken instructions and emphasizes the importance of considering diverse prompts in model development. This research provides a crucial foundation for future model improvements and evaluations.

Technical Contribution

The technical contribution of this paper lies in providing the first multilingual dataset of spoken and textual prompts that can be combined with existing task benchmarks, lowering the barrier for speech instruction-following evaluation. The use of Phi-4 Multimodal and Qwen2.5-Omni models in the study demonstrates performance differences under various prompt conditions, offering directions for future model improvements.

Novelty

DOWIS is the first multilingual parallel spoken and textual prompt dataset written and recorded by native speakers. Unlike existing benchmarks, DOWIS decouples instructions from task inputs, allowing it to be paired with any existing benchmark, providing more natural and diverse language evaluation.

Limitations

Limitation 1: In low-resource and cross-lingual settings, spoken prompts still underperform compared to text prompts, indicating difficulties in handling spoken instructions.
Limitation 2: Informal prompts perform poorly across tasks, possibly due to their more colloquial nature.
Limitation 3: Models show a preference for prompts from different genders, possibly reflecting gender biases in the models.

Future Work

Future research can explore improving models' performance under spoken instructions, especially in low-resource and cross-lingual settings. Additionally, studies can further analyze the impact of different prompt styles and genders on model performance to reduce biases.

AI Executive Summary

In recent years, Speech Large Language Models (SLLMs) have made significant progress in supporting a wide range of tasks. However, these models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, this paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. DOWIS spans nine tasks and eleven languages, providing ten prompt variants per task-language pair across five styles. Using DOWIS, the authors benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output do spoken prompts close the gap, highlighting the need for speech-based prompting in SLLM evaluation. The introduction of the DOWIS dataset fills a gap in current SLLM evaluations, providing a more realistic and comprehensive evaluation method. By analyzing the impact of different prompt modalities and styles, the study reveals current models' shortcomings in handling spoken instructions and emphasizes the importance of considering diverse prompts in model development. This research provides a crucial foundation for future model improvements and evaluations. The technical contribution of this paper lies in providing the first multilingual dataset of spoken and textual prompts that can be combined with existing task benchmarks, lowering the barrier for speech instruction-following evaluation. The use of Phi-4 Multimodal and Qwen2.5-Omni models in the study demonstrates performance differences under various prompt conditions, offering directions for future model improvements. DOWIS is the first multilingual parallel spoken and textual prompt dataset written and recorded by native speakers. Unlike existing benchmarks, DOWIS decouples instructions from task inputs, allowing it to be paired with any existing benchmark, providing more natural and diverse language evaluation. Future research can explore improving models' performance under spoken instructions, especially in low-resource and cross-lingual settings. Additionally, studies can further analyze the impact of different prompt styles and genders on model performance to reduce biases.

Deep Analysis

Background

Speech Large Language Models (SLLMs) have recently achieved remarkable progress in the field of natural language processing. These models can handle both speech and text tasks, demonstrating strong instruction-following capabilities. However, current evaluation methods primarily rely on text prompts, which do not align with how users interact with these models in real-world scenarios. Existing benchmarks for speech instruction-following, such as SpeechInstructBench and Uro-Bench, have limitations, such as supporting only English and Chinese and generating instructions using text-to-speech systems, which cannot be reused with other datasets. Furthermore, these benchmarks focus on general instruction-following and reasoning tasks, while researchers also need to evaluate spoken instruction-following for specific tasks like speech recognition or audio chaptering.

Core Problem

Current evaluation methods for Speech Large Language Models primarily rely on text prompts, which do not reflect how users interact with these models in real-world scenarios. Evaluating models under spoken instructions is crucial for achieving more natural human-machine interaction. However, existing benchmarks for speech instruction evaluation have limitations in terms of language and task coverage, failing to comprehensively reflect model capabilities.

Innovation

This paper presents the DOWIS dataset, the first multilingual dataset of spoken and textual prompts that can be combined with existing task benchmarks. DOWIS includes nine tasks and eleven languages, providing ten prompt variants per task-language pair across five styles. Unlike existing benchmarks, DOWIS decouples instructions from task inputs, allowing it to be paired with any existing benchmark, providing more natural and diverse language evaluation.

Methodology

�� Construction of the DOWIS dataset: Collect English prompts for nine tasks and translate them into ten languages. • Speech recording: Native speakers use phones or computers to record prompts, simulating real-world scenarios. • Dataset statistics: DOWIS contains 3 hours and 17 minutes of audio, covering nine tasks and eleven languages. • Model evaluation: Benchmark state-of-the-art SLLMs, Phi-4 Multimodal and Qwen2.5-Omni, analyzing the interplay between prompt modality, style, language, and task type.

Experiments

The experimental design includes evaluating Phi-4 Multimodal and Qwen2.5-Omni models using the DOWIS dataset. Evaluation tasks include automatic speech recognition, text-to-speech synthesis, speech translation, machine translation, speech-to-speech translation, speech summarization, text summarization, audio chapter generation, and spoken question answering. The experiments use various datasets for evaluation, such as FLEURS and MCIF, and employ multiple metrics for performance evaluation, such as Word Error Rate (WER), BERTScore, and CometKiwi.

Results

Experimental results show that text prompts outperform spoken prompts in low-resource and cross-lingual settings, particularly in tasks with text output, where text prompts significantly outperform spoken prompts. Only for tasks requiring speech output do spoken prompts close the gap. Furthermore, informal text and spoken instructions consistently perform worse across tasks, highlighting the importance of diverse prompt styles for model evaluation.

Applications

The DOWIS dataset can be used to evaluate Speech Large Language Models' instruction-following capabilities in multilingual settings, providing developers with a more comprehensive evaluation tool. The dataset can also help researchers analyze the impact of different prompt modalities and styles on model performance, driving model improvement and optimization.

Limitations & Outlook

While the DOWIS dataset provides multilingual spoken and textual prompts, spoken prompts still underperform compared to text prompts in low-resource and cross-lingual settings. Additionally, models show a preference for prompts from different genders, possibly reflecting gender biases in the models. Future research can explore improving models' performance under spoken instructions, especially in low-resource and cross-lingual settings.

Plain Language Accessible to non-experts

Imagine you're at an international conference and want a translation assistant to help you understand speakers in different languages. Traditional translation assistants might only support text input, meaning you'd have to manually input each speaker's speech, which is time-consuming and inconvenient. The DOWIS dataset is like a multilingual speech assistant that can understand and process spoken instructions in different languages. Through this dataset, researchers can evaluate and improve the performance of speech assistants, enabling them to understand and respond to spoken instructions more naturally. It's like equipping your translation assistant with a super brain, allowing it to translate speech in different languages more quickly and accurately.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game that can understand what you say and react to your commands. Isn't that awesome? But, a lot of times, these games can only understand text commands, not spoken ones. It's like wanting your pet dog to understand your commands, but it can only read notes you write. To make these games smarter, scientists have developed something called the DOWIS dataset. This dataset is like a super training camp that helps games learn how to understand spoken instructions in different languages. With this training camp, games can become smarter and understand commands you say in different languages. It's like giving your game character a super brain, making it better at understanding and responding to your spoken commands. Cool, right?

Glossary

Speech Large Language Models (SLLMs)

Models capable of handling both speech and text tasks, demonstrating strong instruction-following capabilities.

Used in this paper to evaluate instruction-following capabilities in multilingual settings.

DOWIS dataset

A multilingual dataset of spoken and written prompts designed to evaluate SLLMs in instruction-following tasks.

Introduced in this paper to fill the gap in current evaluation methods.

Phi-4 Multimodal

One of the state-of-the-art models used for evaluating speech and text tasks.

Used in this paper for benchmarking.

Qwen2.5-Omni

One of the state-of-the-art models used for evaluating speech and text tasks.

Used in this paper for benchmarking.

Text prompts

Textual instructions used to direct models to perform specific tasks.

Used in this paper for performance comparison with spoken prompts.

Spoken prompts

Spoken instructions used to direct models to perform specific tasks.

Used in this paper for performance comparison with text prompts.

Word Error Rate (WER)

A metric used to evaluate speech recognition performance, indicating the proportion of recognition errors.

Used in this paper to evaluate automatic speech recognition tasks.

BERTScore

A metric used to evaluate text generation quality, based on BERT model to compute similarity between generated and reference texts.

Used in this paper to evaluate text generation tasks.

CometKiwi

A metric used to evaluate translation quality without requiring reference translations, highly correlated with human evaluation.

Used in this paper to evaluate machine translation and speech translation tasks.

MCIF

A multimodal crosslingual instruction-following benchmark providing text and spoken question-answering data for evaluation.

Used in this paper to evaluate spoken question-answering tasks.

Open Questions Unanswered questions from this research

1 How to improve models' instruction-following capabilities under spoken instructions in low-resource and cross-lingual settings? Current methods perform poorly in these settings, requiring more effective strategies to enhance model generalization.
2 How to reduce models' biases when handling spoken prompts from different genders? Studies show models exhibit preferences for prompts from different genders, necessitating further research to reduce such biases.
3 How to better handle informal prompts? Informal prompts perform poorly across tasks, possibly due to their more colloquial nature, requiring better methods to handle such prompts.
4 How to enhance models' instruction-following capabilities without increasing computational costs? Current methods may require more computational resources when handling spoken instructions.
5 How to improve models' robustness in multilingual settings? Current models may struggle with multilingual tasks, requiring more robust models to enhance performance.

Applications

Immediate Applications

Multilingual Speech Assistants

The DOWIS dataset can be used to train and evaluate multilingual speech assistants, enabling them to understand and respond to spoken instructions more naturally.

Cross-Language Translation Tools

The dataset can aid in developing smarter translation tools capable of handling spoken inputs from different languages.

Speech Recognition Systems

Researchers can improve the performance of speech recognition systems, especially in multilingual settings, using the DOWIS dataset.

Long-term Vision

Intelligent Meeting Assistants

In the future, the DOWIS dataset can be used to develop intelligent meeting assistants capable of real-time translation and summarization of meeting content.

Globalized Human-Machine Interaction

Applications of the DOWIS dataset can drive the development of globalized human-machine interaction, enabling users of different languages to interact with technology more naturally.

Abstract

Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.

cs.CL

References (20)

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Chih-Kai Yang, Neo Ho, Yen-Ting Piao et al.

2025 22 citations ⭐ Influential View Analysis →

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun et al.

2023 487 citations ⭐ Influential View Analysis →

SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Prabhat Pandey, R. Swaminathan, Vijay Girish et al.

2025 9 citations ⭐ Influential View Analysis →

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Jincenzi Wu, Junan Li et al.

2025 42 citations ⭐ Influential View Analysis →

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li et al.

2023 406 citations ⭐ Influential View Analysis →

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Sara Papi, Maike Zufle, Marco Gaido et al.

2025 6 citations ⭐ Influential View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 55138 citations ⭐ Influential View Analysis →

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

Qian Yang, Jin Xu, Wenrui Liu et al.

2024 195 citations ⭐ Influential View Analysis →

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Kuofeng Gao, Shu-Tao Xia, Ke Xu et al.

2024 26 citations ⭐ Influential View Analysis →

Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps

Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci et al.

2024 17 citations View Analysis →

From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition

A. Morris, V. Maier, P. Green

2004 314 citations

From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions

Fabian Retkowski, Alexander Waibel

2024 19 citations View Analysis →

Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task

Markus Freitag, Nitika Mathur, Daniel Deutsch et al.

2024 83 citations

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu et al.

2022 6301 citations View Analysis →

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

Ruiqi Yan, Xiquan Li, Wenxi Chen et al.

2025 8 citations View Analysis →

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien et al.

2025 86 citations View Analysis →

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier García Gilabert, Zachary Hopton et al.

2025 2 citations View Analysis →

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson et al.

2025 342 citations View Analysis →

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang et al.

2024 131 citations View Analysis →

FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech

Alexis Conneau, Min Ma, Simran Khanuja et al.

2022 515 citations View Analysis →

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Speech Large Language Models (SLLMs)

DOWIS dataset

Phi-4 Multimodal

Qwen2.5-Omni

Text prompts

Spoken prompts

Word Error Rate (WER)

BERTScore

CometKiwi

MCIF

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Multilingual Speech Assistants

Cross-Language Translation Tools

Speech Recognition Systems

Long-term Vision

Intelligent Meeting Assistants

Globalized Human-Machine Interaction

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection