SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

TL;DR

SCOPE integrates a frozen LLM with an open-set plugin classifier, achieving 91.05% open-set detection accuracy and 96.63% anomaly correction in ATC readback monitoring.

cs.LG 🔴 Advanced 2026-05-28 85 views

Qihan Deng Minghua Zhang Yang Yang Zhenyu Gao

AI Reader Arxiv Page Download PDF

Natural Language Processing Open-set Recognition Large Language Models Air Traffic Control Real-time Monitoring

Key Findings

Methodology

The proposed SCOPE framework combines a frozen large language model (LLM) with a lightweight plug-in open-set classifier (POC) to detect anomalies in ATC pilot readbacks. The core components include: • POC, which models known categories in semantic feature space and employs KNN for unknown detection; • Diverse Example Anchored Retrieval (DEAR), which retrieves scenario-relevant, diverse examples to enrich context for in-context learning (ICL); • Air Traffic Chain-of-Thought (ATCoT), guiding the model to perform structured semantic reasoning on intents and slots. The system integrates rule-based semantic reordering to generate explanations and correction suggestions. Extensive experiments on semi-synthetic datasets show that SCOPE achieves 91.05% open-set detection accuracy and 96.63% correction rate, with low latency suitable for operational deployment. The architecture leverages the semantic reasoning power of LLMs while maintaining efficiency through plugin modules, addressing the challenge of deploying large models in safety-critical, real-time environments.

Key Results

In few-shot settings, SCOPE attains 91.05% accuracy in open-set detection, outperforming baseline models such as BERT-based classifiers (which achieve around 75%). It also corrects 96.63% of anomalous pilot readbacks, demonstrating strong generalization and robustness across diverse communication scenarios.
The system maintains inference latency below 50 milliseconds, satisfying the strict real-time requirements of air traffic control operations. The combination of DEAR and ATCoT significantly enhances detection performance, especially in complex, unseen communication patterns.
Ablation studies confirm that the integration of structured reasoning and diverse example retrieval contributes substantially to accuracy improvements, validating the effectiveness of the proposed innovations.

Significance

This research addresses a critical gap in aviation safety—automatic detection of communication anomalies in ATC operations. Traditional rule-based and shallow ML methods struggle with the semantic variability and evolving terminology of pilot-controller exchanges. By leveraging the semantic understanding and reasoning capabilities of large language models, SCOPE offers a scalable, interpretable, and efficient solution. Its ability to detect unknown communication patterns and provide explanations enhances situational awareness and decision-making, reducing reliance on human attention and minimizing errors. The framework’s low latency and high accuracy make it suitable for real-time deployment in safety-critical environments, paving the way for more autonomous and intelligent air traffic management systems. The approach also opens avenues for broader applications in multimodal, multi-domain safety monitoring, contributing significantly to the future of intelligent transportation systems.

Technical Contribution

This work introduces several key innovations: • A hybrid architecture combining a frozen LLM with a lightweight open-set classifier (POC), enabling open-set recognition without retraining the large model; • A novel Diverse Example Anchored Retrieval (DEAR) mechanism that selects scenario-relevant, diverse support examples to improve in-context learning performance; • The integration of Air Traffic Chain-of-Thought (ATCoT), which guides the LLM to perform structured semantic reasoning over intents and slots, enhancing interpretability and accuracy. These contributions collectively address the limitations of existing methods, which often rely on fine-tuning large models or lack open-set capabilities. The framework also incorporates rule-based semantic reordering, ensuring that outputs are both accurate and explainable. The combination of these components results in a system capable of high-precision anomaly detection with low inference costs, suitable for real-world safety-critical applications.

Novelty

This study is the first to effectively combine a frozen large language model with a lightweight open-set plugin classifier for real-time ATC readback anomaly detection. Unlike prior approaches that require extensive retraining or fine-tuning, SCOPE leverages in-context learning and a modular plugin design to achieve zero-shot and few-shot performance. Its innovative use of diverse scenario-relevant examples (DEAR) enhances contextual understanding, while the structured reasoning module (ATCoT) improves interpretability and detection of complex communication intents. This integrated approach addresses the open-set recognition challenge in aviation communication, a problem that previous models largely overlooked. The framework’s efficiency, scalability, and explainability mark a significant advancement over existing methods, setting a new standard for safety-critical NLP applications.

Limitations

The model's robustness in real-world noisy environments, such as severe radio interference or multi-language exchanges, remains unverified. Its performance may degrade under such conditions, necessitating further validation.
The system relies on predefined scenario examples and rules, which require continuous updates to keep pace with evolving aviation terminology and procedures. Automating this update process poses a challenge.
Deployment at scale involves hardware and integration complexities, especially in resource-constrained environments. Model compression and edge deployment strategies need further development to facilitate widespread adoption.

Future Work

Future research will focus on enhancing robustness against environmental noise and multilingual scenarios, possibly through multimodal data fusion. Developing adaptive, self-updating example retrieval mechanisms will be crucial for maintaining system relevance over time. Additionally, exploring federated learning approaches could enable continuous online learning without compromising data privacy. Extending the framework to multimodal inputs, such as combining speech, radar, and visual data, could further improve detection accuracy and situational awareness. These directions aim to realize a fully autonomous, scalable, and explainable air traffic safety monitoring system capable of supporting next-generation intelligent transportation networks.

AI Executive Summary

Ensuring the safety and efficiency of modern air traffic relies heavily on precise communication between air traffic controllers (ATCos) and pilots. Traditionally, this process depends on manual verification through pilot readbacks, where pilots repeat instructions to confirm understanding. However, with the surge in air traffic volume, human operators face increasing cognitive loads, leading to a higher risk of miscommunication and aviation incidents. Notably, misinterpretations of instructions have been implicated in approximately 80% of aviation accidents, underscoring the urgent need for automated, reliable monitoring solutions.

Existing approaches have primarily employed rule-based systems or shallow machine learning models, which struggle to generalize across the diverse and evolving phraseology of ATC communications. These methods often lack robustness in real-world scenarios characterized by noise, multi-lingual exchanges, and industry-specific terminologies. The advent of large language models (LLMs), such as GPT-4, has revolutionized natural language understanding, offering unprecedented reasoning and contextual comprehension capabilities. Nonetheless, deploying these models in safety-critical, real-time environments poses significant challenges due to computational costs and the need for interpretability.

In response, this study introduces SCOPE, a novel lightweight framework that couples a frozen LLM with a plug-in open-set classifier (POC), designed explicitly for ATC readback anomaly detection. The system integrates three key modules: DEAR, which retrieves scenario-relevant, diverse examples to enrich context; ATCoT, which guides structured semantic reasoning over intents and slots; and rule-based semantic reordering to generate explanations and corrections. This architecture enables high detection accuracy (91.05%) and correction rate (96.63%) while maintaining low latency suitable for operational deployment.

The experimental results demonstrate that SCOPE outperforms existing baselines, including models based solely on BERT or LSTM, especially in few-shot settings. Its ability to identify unknown communication patterns and provide interpretable explanations marks a significant step toward autonomous, safety-critical AI systems in aviation. The framework’s modular design ensures scalability and adaptability, paving the way for broader applications in multimodal transportation safety monitoring.

Despite these advances, challenges remain. The robustness of the system under extreme noise, multilingual scenarios, and rapidly evolving industry terminology needs further validation. Additionally, integrating continuous online learning and multi-modal data sources will be essential for future development. Overall, SCOPE offers a promising pathway toward safer, smarter air traffic management, with potential to transform the future of transportation safety through AI-driven automation and interpretability.

Deep Dive

Abstract

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

cs.LG cs.AI cs.CL cs.HC cs.IR

References (20)

ATSIU: A large-scale dataset for spoken instruction understanding in air traffic control

Minghua Zhang, Yang Yang, Shengsheng Qian et al.

2025 3 citations ⭐ Influential

AviationCopilot: Building a reliable LLM-based Aviation Copilot inspired by human pilot training

Zhuorui Zhang, Shanshan Feng, Tiance Yang et al.

6 citations ⭐ Influential

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

2019 115208 citations ⭐ Influential View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 5498 citations ⭐ Influential View Analysis →

Analysis of Pilot Response Time to Time-Critical Air Traffic Control Calls

K. Cardosi, P. Boole

1991 19 citations ⭐ Influential

Learning to Select In-Context Demonstration Preferred by Large Language Model

Zheng Zhang, Shaocheng Lan, Lei Song et al.

2025 6 citations ⭐ Influential View Analysis →

Exploring the Role of Diversity in Example Selection for In-Context Learning

Janak Kapuriya, M. Kaushik, Debasis Ganguly et al.

2025 9 citations ⭐ Influential View Analysis →

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Jaime Carbonell, Jade Goldstein-Stewart

1998 1687 citations ⭐ Influential

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao et al.

2020 3789 citations ⭐ Influential View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 58658 citations ⭐ Influential View Analysis →

Index for rating diagnostic tests

PH.D. W. J. YOUDEN

1950 10701 citations

ATCSpeech: a multilingual pilot-controller speech corpus from real Air Traffic Control environment

Bo Yang, Xianlong Tan, Zhengmao Chen et al.

2019 24 citations View Analysis →

Knowledge-augmented encoder for few-shot deep intent recognition in air traffic control

Yi Hui, Yang Yang, Shengsheng Qian et al.

2025 3 citations

An Investigation into the Factors that Affect Miscommunication between Pilots and Air Traffic Controllers in Commercial Aviation

Qiong Wu, B. Molesworth, Dominique Estival

2019 29 citations

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 6209 citations View Analysis →

ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling

Zhanbiao Zhu, Peijie Huang, Haojing Huang et al.

2024 5 citations

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang et al.

2024 1662 citations View Analysis →

Neural Architectures for Named Entity Recognition

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian et al.

2016 4306 citations View Analysis →

Miscommunication in General Aviation: The Influence of External Factors on Communication Errors

B. Molesworth, Dominique Estival

2015 71 citations

Towards Open Set Deep Networks

Abhijit Bendale, T. Boult

2015 1738 citations View Analysis →

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies