Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study
LLM-assisted MIPVU rule script generation enables interpretable Chinese metaphor identification; protocol choice is the main source of variation.
Key Findings
Methodology
This paper introduces an LLM-assisted pipeline for metaphor identification, operationalizing four protocols: MIP/MIPVU lexical analysis, CMDAG conceptual mapping annotation, emotion-based detection, and simile-oriented identification. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision.
Key Results
- Result 1: Evaluated on seven Chinese metaphor datasets, Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while pairwise Cohen's kappa between Protocols A and D is merely 0.001, and Protocols B and C exhibit near-perfect agreement (kappa = 0.986).
- Result 2: An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00.
- Result 3: Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes.
Significance
This study is the first to conduct a cross-protocol comparison in Chinese metaphor identification, revealing that protocol choice is the largest source of variation, exceeding model-level variation. This indicates that in metaphor identification, protocol selection is more crucial than model selection. Furthermore, rule-script architectures achieve competitive performance while maintaining full transparency, offering new directions for future research.
Technical Contribution
The technical contribution of this paper lies in implementing four metaphor identification protocols as executable rule scripts, providing complete interpretability and auditability. Unlike existing end-to-end classifiers, this approach allows for detailed auditing and modification of each decision step, ensuring reproducibility and transparency of results.
Novelty
This study is the first to use LLMs for generating executable metaphor identification rule scripts and to conduct cross-protocol comparisons in the Chinese context. This approach not only improves identification accuracy but also enhances the interpretability and reproducibility of results.
Limitations
- Limitation 1: Due to the lack of morphological markers in Chinese, metaphor identification relies almost entirely on context and world knowledge, increasing complexity.
- Limitation 2: Differences between protocols may lead to inconsistent identification results, especially when dealing with complex metaphor structures.
- Limitation 3: Although rule scripts provide transparency, the LLMs they rely on may introduce biases, particularly when handling unseen corpora.
Future Work
Future research could explore better integration of multiple protocols to improve the accuracy and consistency of metaphor identification. Additionally, developing richer Chinese metaphor annotation resources and enhancing LLMs' contextual understanding capabilities are important research directions.
AI Executive Summary
Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge.
This paper presents an LLM-assisted pipeline that operationalizes four metaphor identification protocols—MIP/MIPVU lexical analysis, CMDAG conceptual mapping annotation, emotion-based detection, and simile-oriented identification—as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision.
We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986).
An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes.
Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency. We release our codebase, protocol implementations, and evaluation scripts to support reproducible research on interpretable figurative language processing.
Deep Analysis
Background
Metaphor pervades human language, structuring how we reason about abstract concepts through concrete experience. Automatic metaphor identification, the task of determining whether a given linguistic expression is used metaphorically, has received sustained attention in computational linguistics, driven by applications in sentiment analysis, machine translation, and discourse understanding.
Despite considerable progress, the field faces a persistent challenge. State-of-the-art neural classifiers based on pre-trained language models achieve strong performance on benchmark datasets but provide no structured explanation for their decisions. A model may correctly flag a token as metaphorical yet offer no insight into the conceptual mapping, basic-meaning contrast, or figurative mechanism that justifies the label. This opacity limits both scientific understanding of what these models learn and practical deployment in educational or annotation-support settings where users need to know.
In Chinese, this interpretability problem is compounded by several challenges. First, Chinese lacks the morphological inflections and derivational patterns that provide surface cues for metaphor in Indo-European languages; the distinction between literal and figurative senses must be resolved almost entirely through context and world knowledge. Second, Chinese figurative language encompasses diverse phenomena, including conceptual metaphor, simile, metonymy, and culture-specific figures of speech, that do not map neatly onto annotation frameworks developed for English. Third, annotated resources for Chinese metaphor remain relatively scarce and fragmented across incompatible annotation schemes.
Core Problem
The core problem in metaphor identification lies in balancing between opaque classifiers and interpretability. Existing neural network models, while performing excellently, lack transparency in their decision-making processes, which is particularly pronounced in Chinese metaphor identification.
The complexity of Chinese lies in its lack of morphological markers, making metaphor identification almost entirely reliant on context and world knowledge. Additionally, Chinese figurative language phenomena are diverse, encompassing conceptual metaphor, simile, metonymy, etc., which do not easily correspond to annotation frameworks developed for English. The scarcity and incompatibility of annotated resources further exacerbate the difficulty of identification.
Therefore, achieving high performance while providing interpretability in metaphor identification remains a pressing challenge.
Innovation
The core innovation of this paper is the introduction of an LLM-assisted pipeline for metaphor identification, operationalizing four protocols as executable rule scripts. This approach not only improves identification accuracy but also enhances the interpretability and reproducibility of results.
- �� LLM-assisted rule script generation: Using LLMs to generate executable metaphor identification rule scripts achieves full transparency of the identification process.
- �� Cross-protocol comparison: Conducting the first cross-protocol comparison in Chinese metaphor identification reveals that protocol choice is the largest source of variation.
- �� Modular design: Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision.
Methodology
The methodology of this paper consists of the following steps:
- �� Preprocessing: Text is segmented, POS-tagged, and normalized to ensure consistency.
- �� Candidate Selection: Depending on the protocol, analysis targets are selected. For example, Protocol A selects all content words, while Protocol B selects sentences containing potential cross-domain expressions.
- �� Semantic Analysis: LLMs are used for contextual meaning and basic meaning retrieval and contrast, or for evaluating emotional valence.
- �� Classification Decision: Binary or multi-class decisions are made based on protocol criteria. For example, Protocol A labels a token as metaphorical if the contextual meaning contrasts with the basic meaning.
- �� Rationale Generation: A structured explanation is generated for each decision, including the specific protocol step that triggered the decision, key evidence, and a confidence indicator.
Experiments
The experimental design includes evaluation on seven Chinese metaphor datasets, covering token-, sentence-, and span-level annotation. The benchmark datasets used include PSU CMC, CMC, CMDAG, Chinese Simile, NLPCC 2024 T9, ConFiguRe, and ChineseMCorpus.
Each protocol is evaluated on its most closely aligned dataset using standard train/dev/test splits. For cross-protocol evaluation, all four protocols are applied to a common subset of PSU CMC, converted to sentence-level labels.
Experiments use GPT-4 as the underlying LLM, with temperature set to 0 to maximize determinism. Evaluation uses standard metrics: precision, recall, and F1 score for binary classification; partial-match F1 for span extraction.
Results
Experimental results show that Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, with an overall accuracy of 0.898. Register-level analysis reveals that academic prose performs best with an F1 of 0.598.
Protocol B (CMDAG) achieves an F1 of 0.347 on sentence-level identification, showing high precision but low recall, indicating that while it correctly identifies metaphors, it misses many metaphorical sentences.
Protocol C (Emotion) mirrors Protocol B's performance, suggesting that emotion-based and conceptual-mapping approaches capture a similar subset of metaphors in this dataset. Protocol D (Simile) achieves an F1 of 0.392 in binary classification, showing moderate precision.
Applications
This method can be directly applied to Chinese text metaphor identification, especially in scenarios requiring high transparency and interpretability, such as educational and annotation support environments.
Additionally, the method can be used to develop more complex figurative language processing systems, integrating multiple protocols to improve identification accuracy and consistency.
In industrial applications, this method can be used in sentiment analysis, machine translation, and discourse understanding, providing more accurate and interpretable results.
Limitations & Outlook
Despite the excellent performance of this method in metaphor identification, there are some limitations. Firstly, due to the complexity of Chinese, metaphor identification relies almost entirely on context and world knowledge, increasing complexity.
Secondly, differences between protocols may lead to inconsistent identification results, especially when dealing with complex metaphor structures.
Finally, although rule scripts provide transparency, the LLMs they rely on may introduce biases, particularly when handling unseen corpora. Future research could explore better integration of multiple protocols to improve the accuracy and consistency of metaphor identification.
Plain Language Accessible to non-experts
Imagine you're in a kitchen, cooking a meal. You have four different recipes, each with its own steps and requirements. To make a delicious dish, you need to choose one recipe and follow the steps one by one.
During this process, you might encounter some issues, like missing ingredients or unclear steps. At this point, you can turn to an experienced chef who will provide advice and guidance based on your needs, helping you solve the problem.
In metaphor identification, we're like choosing and executing different recipes. Each protocol is like a recipe, with its own steps and requirements. We use a large language model (LLM) as our 'chef' to provide advice and guidance when needed.
This way, we can better understand and identify metaphors in text, just like making a delicious dish. Each step is transparent, and we can see how each decision is made and adjust as needed.
ELI14 Explained like you're 14
Hey there! Let's talk about something cool: metaphor identification. Imagine you're playing a game with four different characters, each with their own skills and missions.
In this game, you need to choose a character and complete missions. Each character has its own strengths, like some are good at attacking, while others are good at defending. To win the game, you need to choose the right character based on the situation.
In metaphor identification, we're like choosing different characters. Each protocol is like a character, with its own skills and missions. We use a large language model (LLM) as our 'game guide' to provide advice and guidance when needed.
This way, we can better understand and identify metaphors in text, just like winning a game. Each step is transparent, and we can see how each decision is made and adjust as needed. Isn't that cool?
Glossary
Metaphor Identification
Metaphor identification is the task of determining whether a given linguistic expression is used metaphorically.
In this paper, metaphor identification is achieved through four protocols.
Large Language Model (LLM)
A large language model is a deep learning-based model capable of understanding and generating natural language.
This paper uses LLMs to assist in generating metaphor identification rule scripts.
MIP/MIPVU
MIP/MIPVU is a metaphor identification protocol that identifies metaphors through lexical analysis.
Protocol A uses MIP/MIPVU for token-level metaphor identification.
CMDAG
CMDAG is a metaphor identification protocol that identifies metaphors through conceptual mapping annotation.
Protocol B uses CMDAG for sentence-level metaphor identification.
Emotion-Based Detection
Emotion-based detection is a metaphor identification protocol that identifies metaphors through affective incongruity.
Protocol C uses emotion-based detection for metaphor identification.
Simile-Oriented Identification
Simile-oriented identification is a metaphor identification protocol that identifies metaphors through explicit comparison markers.
Protocol D uses simile-oriented identification for metaphor identification.
Rule Script
A rule script is an executable program used to implement metaphor identification protocols.
The metaphor identification method proposed in this paper uses rule scripts for implementation.
Cross-Protocol Comparison
Cross-protocol comparison refers to comparing the performance of different metaphor identification protocols on the same dataset.
This paper conducts the first cross-protocol comparison in Chinese metaphor identification.
Cohen's kappa
Cohen's kappa is a statistical measure used to assess the agreement between classifiers.
This paper uses Cohen's kappa to evaluate the agreement between protocols.
Interpretability
Interpretability refers to whether the decision-making process of a model or algorithm is transparent and understandable.
The method in this paper achieves high interpretability through rule scripts.
Open Questions Unanswered questions from this research
- 1 Open Question 1: How can we improve the accuracy and consistency of metaphor identification without increasing computational complexity? Current methods perform poorly when dealing with complex metaphor structures, requiring more efficient algorithms.
- 2 Open Question 2: How can we develop richer Chinese metaphor annotation resources to support more comprehensive metaphor identification research? Existing annotation resources are scarce and incompatible, limiting the breadth and depth of research.
- 3 Open Question 3: How can we improve LLMs' contextual understanding capabilities to better support metaphor identification? Current LLMs may introduce biases when handling unseen corpora, requiring stronger contextual understanding capabilities.
- 4 Open Question 4: How can we better integrate multiple protocols to improve the accuracy and consistency of metaphor identification? Existing protocols perform poorly when dealing with complex metaphor structures, requiring more effective combination strategies.
- 5 Open Question 5: How can we improve the performance of metaphor identification while maintaining high transparency? Existing rule script methods, although transparent, may not perform as well as end-to-end classifiers.
Applications
Immediate Applications
Educational Field
This method can be used in metaphor identification teaching in the educational field, helping students better understand and analyze metaphors in text.
Sentiment Analysis
In sentiment analysis, this method can provide more accurate and interpretable results, helping businesses better understand customer feedback.
Machine Translation
In machine translation, this method can improve translation accuracy, especially when dealing with figurative language.
Long-term Vision
Cross-Language Metaphor Identification
This method can be extended to other languages, achieving cross-language metaphor identification and promoting multilingual text analysis.
Intelligent Text Analysis Systems
This method can be used to develop intelligent text analysis systems, providing more comprehensive and in-depth text understanding capabilities.
Abstract
Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.
References (20)
ConFiguRe: Exploring Discourse-level Chinese Figures of Speech
Dawei Zhu, Qiusi Zhan, Zhejian Zhou et al.
Neural Multitask Learning for Simile Recognition
Lizhen Liu, Xiao Hu, Wei Song et al.
The measurement of observer agreement for categorical data.
J. Landis, G. Koch
MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories
Minjin Choi, Sunkyung Lee, Eunseong Choi et al.
Semantic classifications for detection of verb metaphors
Beata Beigman Klebanov, C. W. Leong, E. Gutiérrez et al.
Metaphor Detection with Cross-Lingual Model Transfer
Yulia Tsvetkov, Leonid Boytsov, A. Gershman et al.
Metaphor Detection with Effective Context Denoising
Shunyu Wang, Yucheng Li, Chenghua Lin et al.
A Report on the 2018 VUA Metaphor Detection Shared Task
C. W. Leong, Beata Beigman Klebanov, Ekaterina Shutova
CMDAG: A Chinese Metaphor Dataset with Annotated Grounds as CoT for Boosting Metaphor Generation
Yujie Shao, Xinrong Yao, Xingwei Qu et al.
A method for linguistic metaphor identification : from MIP to MIPVU
G. Steen
Metaphor: A Practical Introduction
Z. Kövecses, R. Benczes
MIP: A method for identifying metaphorically used words in discourse
G. Steen, L. Cameron, A. Cienki et al.
Models of Metaphor in NLP
Ekaterina Shutova
Metaphor Interpretation as Embodied Simulation
R. Gibbs
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou et al.
DeepMet: A Reading Comprehension Paradigm for Token-level Metaphor Detection
Chuandong Su, F. Fukumoto, Xiaoxi Huang et al.
Explainable Metaphor Identification Inspired by Conceptual Metaphor Theory
Mengshi Ge, Rui Mao, E. Cambria
Pre-Training with Whole Word Masking for Chinese BERT
Yiming Cui, Wanxiang Che, Ting Liu et al.
A Report on the 2020 VUA and TOEFL Metaphor Detection Shared Task
C. W. Leong, Beata Beigman Klebanov, Chris Hamill et al.
Verb Metaphor Detection via Contextual Relation Learning
Wei Song, Shuhui Zhou, Ruiji Fu et al.