Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

TL;DR

LLM-assisted MIPVU rule script generation enables interpretable Chinese metaphor identification; protocol choice is the main source of variation.

cs.CL 🔴 Advanced 2026-03-11 12 views

Weihang Huang Mengna Liu

AI Reader Arxiv Page Download PDF

metaphor identification large language models interpretability Chinese processing cross-protocol comparison

Key Findings

Methodology

This paper introduces an LLM-assisted pipeline for metaphor identification, operationalizing four protocols: MIP/MIPVU lexical analysis, CMDAG conceptual mapping annotation, emotion-based detection, and simile-oriented identification. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision.

Key Results

Result 1: Evaluated on seven Chinese metaphor datasets, Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while pairwise Cohen's kappa between Protocols A and D is merely 0.001, and Protocols B and C exhibit near-perfect agreement (kappa = 0.986).
Result 2: An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00.
Result 3: Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes.

Significance

This study is the first to conduct a cross-protocol comparison in Chinese metaphor identification, revealing that protocol choice is the largest source of variation, exceeding model-level variation. This indicates that in metaphor identification, protocol selection is more crucial than model selection. Furthermore, rule-script architectures achieve competitive performance while maintaining full transparency, offering new directions for future research.

Technical Contribution

The technical contribution of this paper lies in implementing four metaphor identification protocols as executable rule scripts, providing complete interpretability and auditability. Unlike existing end-to-end classifiers, this approach allows for detailed auditing and modification of each decision step, ensuring reproducibility and transparency of results.

Novelty

This study is the first to use LLMs for generating executable metaphor identification rule scripts and to conduct cross-protocol comparisons in the Chinese context. This approach not only improves identification accuracy but also enhances the interpretability and reproducibility of results.

Limitations

Limitation 1: Due to the lack of morphological markers in Chinese, metaphor identification relies almost entirely on context and world knowledge, increasing complexity.
Limitation 2: Differences between protocols may lead to inconsistent identification results, especially when dealing with complex metaphor structures.
Limitation 3: Although rule scripts provide transparency, the LLMs they rely on may introduce biases, particularly when handling unseen corpora.

Future Work

Future research could explore better integration of multiple protocols to improve the accuracy and consistency of metaphor identification. Additionally, developing richer Chinese metaphor annotation resources and enhancing LLMs' contextual understanding capabilities are important research directions.

AI Executive Summary

This paper presents an LLM-assisted pipeline that operationalizes four metaphor identification protocols—MIP/MIPVU lexical analysis, CMDAG conceptual mapping annotation, emotion-based detection, and simile-oriented identification—as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision.

We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986).

An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes.

Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency. We release our codebase, protocol implementations, and evaluation scripts to support reproducible research on interpretable figurative language processing.

Deep Analysis

Background

Metaphor pervades human language, structuring how we reason about abstract concepts through concrete experience. Automatic metaphor identification, the task of determining whether a given linguistic expression is used metaphorically, has received sustained attention in computational linguistics, driven by applications in sentiment analysis, machine translation, and discourse understanding.

Despite considerable progress, the field faces a persistent challenge. State-of-the-art neural classifiers based on pre-trained language models achieve strong performance on benchmark datasets but provide no structured explanation for their decisions. A model may correctly flag a token as metaphorical yet offer no insight into the conceptual mapping, basic-meaning contrast, or figurative mechanism that justifies the label. This opacity limits both scientific understanding of what these models learn and practical deployment in educational or annotation-support settings where users need to know.

In Chinese, this interpretability problem is compounded by several challenges. First, Chinese lacks the morphological inflections and derivational patterns that provide surface cues for metaphor in Indo-European languages; the distinction between literal and figurative senses must be resolved almost entirely through context and world knowledge. Second, Chinese figurative language encompasses diverse phenomena, including conceptual metaphor, simile, metonymy, and culture-specific figures of speech, that do not map neatly onto annotation frameworks developed for English. Third, annotated resources for Chinese metaphor remain relatively scarce and fragmented across incompatible annotation schemes.

Core Problem

The core problem in metaphor identification lies in balancing between opaque classifiers and interpretability. Existing neural network models, while performing excellently, lack transparency in their decision-making processes, which is particularly pronounced in Chinese metaphor identification.

The complexity of Chinese lies in its lack of morphological markers, making metaphor identification almost entirely reliant on context and world knowledge. Additionally, Chinese figurative language phenomena are diverse, encompassing conceptual metaphor, simile, metonymy, etc., which do not easily correspond to annotation frameworks developed for English. The scarcity and incompatibility of annotated resources further exacerbate the difficulty of identification.

Therefore, achieving high performance while providing interpretability in metaphor identification remains a pressing challenge.

Innovation

The core innovation of this paper is the introduction of an LLM-assisted pipeline for metaphor identification, operationalizing four protocols as executable rule scripts. This approach not only improves identification accuracy but also enhances the interpretability and reproducibility of results.

�� LLM-assisted rule script generation: Using LLMs to generate executable metaphor identification rule scripts achieves full transparency of the identification process.

�� Cross-protocol comparison: Conducting the first cross-protocol comparison in Chinese metaphor identification reveals that protocol choice is the largest source of variation.

�� Modular design: Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision.

Methodology

The methodology of this paper consists of the following steps:

�� Preprocessing: Text is segmented, POS-tagged, and normalized to ensure consistency.

�� Candidate Selection: Depending on the protocol, analysis targets are selected. For example, Protocol A selects all content words, while Protocol B selects sentences containing potential cross-domain expressions.

�� Semantic Analysis: LLMs are used for contextual meaning and basic meaning retrieval and contrast, or for evaluating emotional valence.

�� Classification Decision: Binary or multi-class decisions are made based on protocol criteria. For example, Protocol A labels a token as metaphorical if the contextual meaning contrasts with the basic meaning.

�� Rationale Generation: A structured explanation is generated for each decision, including the specific protocol step that triggered the decision, key evidence, and a confidence indicator.

Experiments

The experimental design includes evaluation on seven Chinese metaphor datasets, covering token-, sentence-, and span-level annotation. The benchmark datasets used include PSU CMC, CMC, CMDAG, Chinese Simile, NLPCC 2024 T9, ConFiguRe, and ChineseMCorpus.

Each protocol is evaluated on its most closely aligned dataset using standard train/dev/test splits. For cross-protocol evaluation, all four protocols are applied to a common subset of PSU CMC, converted to sentence-level labels.

Experiments use GPT-4 as the underlying LLM, with temperature set to 0 to maximize determinism. Evaluation uses standard metrics: precision, recall, and F1 score for binary classification; partial-match F1 for span extraction.

Results

Experimental results show that Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, with an overall accuracy of 0.898. Register-level analysis reveals that academic prose performs best with an F1 of 0.598.

Protocol B (CMDAG) achieves an F1 of 0.347 on sentence-level identification, showing high precision but low recall, indicating that while it correctly identifies metaphors, it misses many metaphorical sentences.

Protocol C (Emotion) mirrors Protocol B's performance, suggesting that emotion-based and conceptual-mapping approaches capture a similar subset of metaphors in this dataset. Protocol D (Simile) achieves an F1 of 0.392 in binary classification, showing moderate precision.

Applications

This method can be directly applied to Chinese text metaphor identification, especially in scenarios requiring high transparency and interpretability, such as educational and annotation support environments.

Additionally, the method can be used to develop more complex figurative language processing systems, integrating multiple protocols to improve identification accuracy and consistency.

In industrial applications, this method can be used in sentiment analysis, machine translation, and discourse understanding, providing more accurate and interpretable results.

Limitations & Outlook

Despite the excellent performance of this method in metaphor identification, there are some limitations. Firstly, due to the complexity of Chinese, metaphor identification relies almost entirely on context and world knowledge, increasing complexity.

Secondly, differences between protocols may lead to inconsistent identification results, especially when dealing with complex metaphor structures.

Finally, although rule scripts provide transparency, the LLMs they rely on may introduce biases, particularly when handling unseen corpora. Future research could explore better integration of multiple protocols to improve the accuracy and consistency of metaphor identification.

Plain Language Accessible to non-experts

Imagine you're in a kitchen, cooking a meal. You have four different recipes, each with its own steps and requirements. To make a delicious dish, you need to choose one recipe and follow the steps one by one.

During this process, you might encounter some issues, like missing ingredients or unclear steps. At this point, you can turn to an experienced chef who will provide advice and guidance based on your needs, helping you solve the problem.

In metaphor identification, we're like choosing and executing different recipes. Each protocol is like a recipe, with its own steps and requirements. We use a large language model (LLM) as our 'chef' to provide advice and guidance when needed.

This way, we can better understand and identify metaphors in text, just like making a delicious dish. Each step is transparent, and we can see how each decision is made and adjust as needed.

ELI14 Explained like you're 14

Hey there! Let's talk about something cool: metaphor identification. Imagine you're playing a game with four different characters, each with their own skills and missions.

In this game, you need to choose a character and complete missions. Each character has its own strengths, like some are good at attacking, while others are good at defending. To win the game, you need to choose the right character based on the situation.

In metaphor identification, we're like choosing different characters. Each protocol is like a character, with its own skills and missions. We use a large language model (LLM) as our 'game guide' to provide advice and guidance when needed.

This way, we can better understand and identify metaphors in text, just like winning a game. Each step is transparent, and we can see how each decision is made and adjust as needed. Isn't that cool?

Glossary

Metaphor Identification

Metaphor identification is the task of determining whether a given linguistic expression is used metaphorically.

In this paper, metaphor identification is achieved through four protocols.

Large Language Model (LLM)

A large language model is a deep learning-based model capable of understanding and generating natural language.

This paper uses LLMs to assist in generating metaphor identification rule scripts.

MIP/MIPVU

MIP/MIPVU is a metaphor identification protocol that identifies metaphors through lexical analysis.

Protocol A uses MIP/MIPVU for token-level metaphor identification.

CMDAG

CMDAG is a metaphor identification protocol that identifies metaphors through conceptual mapping annotation.

Protocol B uses CMDAG for sentence-level metaphor identification.

Emotion-Based Detection

Emotion-based detection is a metaphor identification protocol that identifies metaphors through affective incongruity.

Protocol C uses emotion-based detection for metaphor identification.

Simile-Oriented Identification

Simile-oriented identification is a metaphor identification protocol that identifies metaphors through explicit comparison markers.

Protocol D uses simile-oriented identification for metaphor identification.

Rule Script

A rule script is an executable program used to implement metaphor identification protocols.

The metaphor identification method proposed in this paper uses rule scripts for implementation.

Cross-Protocol Comparison

Cross-protocol comparison refers to comparing the performance of different metaphor identification protocols on the same dataset.

This paper conducts the first cross-protocol comparison in Chinese metaphor identification.

Cohen's kappa

Cohen's kappa is a statistical measure used to assess the agreement between classifiers.

This paper uses Cohen's kappa to evaluate the agreement between protocols.

Interpretability

Interpretability refers to whether the decision-making process of a model or algorithm is transparent and understandable.

The method in this paper achieves high interpretability through rule scripts.

Open Questions Unanswered questions from this research

1 Open Question 1: How can we improve the accuracy and consistency of metaphor identification without increasing computational complexity? Current methods perform poorly when dealing with complex metaphor structures, requiring more efficient algorithms.
2 Open Question 2: How can we develop richer Chinese metaphor annotation resources to support more comprehensive metaphor identification research? Existing annotation resources are scarce and incompatible, limiting the breadth and depth of research.
3 Open Question 3: How can we improve LLMs' contextual understanding capabilities to better support metaphor identification? Current LLMs may introduce biases when handling unseen corpora, requiring stronger contextual understanding capabilities.
4 Open Question 4: How can we better integrate multiple protocols to improve the accuracy and consistency of metaphor identification? Existing protocols perform poorly when dealing with complex metaphor structures, requiring more effective combination strategies.
5 Open Question 5: How can we improve the performance of metaphor identification while maintaining high transparency? Existing rule script methods, although transparent, may not perform as well as end-to-end classifiers.

Applications

Immediate Applications

Educational Field

This method can be used in metaphor identification teaching in the educational field, helping students better understand and analyze metaphors in text.

Sentiment Analysis

In sentiment analysis, this method can provide more accurate and interpretable results, helping businesses better understand customer feedback.

Machine Translation

In machine translation, this method can improve translation accuracy, especially when dealing with figurative language.

Long-term Vision

Cross-Language Metaphor Identification

This method can be extended to other languages, achieving cross-language metaphor identification and promoting multilingual text analysis.

Intelligent Text Analysis Systems

This method can be used to develop intelligent text analysis systems, providing more comprehensive and in-depth text understanding capabilities.

Abstract

Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.

cs.CL cs.IR

References (20)

ConFiguRe: Exploring Discourse-level Chinese Figures of Speech

Dawei Zhu, Qiusi Zhan, Zhejian Zhou et al.

2022 4 citations ⭐ Influential View Analysis →

Neural Multitask Learning for Simile Recognition

Lizhen Liu, Xiao Hu, Wei Song et al.

2018 41 citations ⭐ Influential

The measurement of observer agreement for categorical data.

J. Landis, G. Koch

1977 76646 citations

MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories

Minjin Choi, Sunkyung Lee, Eunseong Choi et al.

2021 123 citations View Analysis →

Semantic classifications for detection of verb metaphors

Beata Beigman Klebanov, C. W. Leong, E. Gutiérrez et al.

2016 62 citations

Metaphor Detection with Cross-Lingual Model Transfer

Yulia Tsvetkov, Leonid Boytsov, A. Gershman et al.

2014 270 citations

Metaphor Detection with Effective Context Denoising

Shunyu Wang, Yucheng Li, Chenghua Lin et al.

2023 22 citations View Analysis →

A Report on the 2018 VUA Metaphor Detection Shared Task

C. W. Leong, Beata Beigman Klebanov, Ekaterina Shutova

2018 92 citations

CMDAG: A Chinese Metaphor Dataset with Annotated Grounds as CoT for Boosting Metaphor Generation

Yujie Shao, Xinrong Yao, Xingwei Qu et al.

2024 13 citations View Analysis →

A method for linguistic metaphor identification : from MIP to MIPVU

G. Steen

2010 1309 citations

Metaphor: A Practical Introduction

Z. Kövecses, R. Benczes

2002 2633 citations

MIP: A method for identifying metaphorically used words in discourse

G. Steen, L. Cameron, A. Cienki et al.

2007 1661 citations

Models of Metaphor in NLP

Ekaterina Shutova

2010 119 citations

Metaphor Interpretation as Embodied Simulation

R. Gibbs

2006 441 citations

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou et al.

2022 654 citations View Analysis →

DeepMet: A Reading Comprehension Paradigm for Token-level Metaphor Detection

Chuandong Su, F. Fukumoto, Xiaoxi Huang et al.

2020 97 citations

Explainable Metaphor Identification Inspired by Conceptual Metaphor Theory

Mengshi Ge, Rui Mao, E. Cambria

2022 64 citations

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui, Wanxiang Che, Ting Liu et al.

2019 1262 citations

A Report on the 2020 VUA and TOEFL Metaphor Detection Shared Task

C. W. Leong, Beata Beigman Klebanov, Chris Hamill et al.

2020 93 citations

Verb Metaphor Detection via Contextual Relation Learning

Wei Song, Shuhui Zhou, Ruiji Fu et al.

2021 44 citations

Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Metaphor Identification

Large Language Model (LLM)

MIP/MIPVU

CMDAG

Emotion-Based Detection

Simile-Oriented Identification

Rule Script

Cross-Protocol Comparison

Cohen's kappa

Interpretability

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Educational Field

Sentiment Analysis

Machine Translation

Long-term Vision

Cross-Language Metaphor Identification

Intelligent Text Analysis Systems

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection