From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Key Findings

Methodology

This paper introduces a dual-aspect evaluation framework combining quantitative benchmarking and qualitative error analysis. First, it establishes a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, it conducts a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles using a novel, expert-validated error typology to understand the reasons behind these performance scores.

Key Results

Result 1: Grok-1 excels in Readability and Consistency but compromises on fine-grained legal Accuracy. Claude 3 Opus achieves high Accuracy scores that mask a significant number of subtle but critical reasoning errors.
Result 2: Error analysis identifies Incorrect Example and Misinterpretation as the most prevalent failures, indicating that the primary challenge for current LLMs is controlled, accurate legal reasoning rather than summarization.
Result 3: By integrating a quantitative benchmark with a qualitative deep dive, the study provides a holistic and actionable assessment of LLMs for legal applications.

Significance

This research provides crucial insights into the performance trade-offs of large language models when handling complex legal texts, offering guidance for legal AI applications. It not only aids in model selection but also points the way for model improvements, particularly in enhancing legal reasoning capabilities. By identifying and categorizing error types, the study offers concrete suggestions for future model development.

Technical Contribution

The paper's technical contribution lies in proposing a dual-aspect evaluation framework that combines quantitative benchmarking with qualitative error analysis. This approach not only reveals performance differences in legal text processing but also provides an in-depth understanding of model failure modes. By introducing an expert-validated error typology, the research offers a new perspective for evaluating and improving legal AI.

Novelty

This study is the first to apply a dual-aspect evaluation framework to the assessment of large language models on Vietnamese legal texts. Unlike previous studies that focus primarily on surface performance, this paper delves into the reasoning errors of models, revealing their systemic weaknesses in legal reasoning.

Limitations

Limitation 1: The dataset size is relatively small, consisting of only 60 legal articles, which may not fully reflect model performance across other legal domains.
Limitation 2: The reliance on law students for error annotation, despite rigorous training, may lack the practical experience of professional experts.
Limitation 3: The experimental design is limited to a zero-shot setting, not considering other techniques that might improve performance.

Future Work

Future research could expand the dataset size to cover more legal domains, further validating model generalizability. Additionally, exploring techniques like few-shot learning or chain-of-thought prompting could improve reasoning capabilities. The study should also include open-weight models to enhance reproducibility and investigate the correlation between training data transparency and legal reasoning performance.

AI Executive Summary

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models (LLMs) offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the 'why' behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints Incorrect Example and Misinterpretation as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.

This research method not only reveals performance trade-offs when handling complex legal texts but also provides crucial insights for legal AI applications. It aids in model selection and points the way for model improvements, particularly in enhancing legal reasoning capabilities. By identifying and categorizing error types, the study offers concrete suggestions for future model development.

The paper's technical contribution lies in proposing a dual-aspect evaluation framework that combines quantitative benchmarking with qualitative error analysis. This approach not only reveals performance differences in legal text processing but also provides an in-depth understanding of model failure modes. By introducing an expert-validated error typology, the research offers a new perspective for evaluating and improving legal AI.

This study is the first to apply a dual-aspect evaluation framework to the assessment of large language models on Vietnamese legal texts. Unlike previous studies that focus primarily on surface performance, this paper delves into the reasoning errors of models, revealing their systemic weaknesses in legal reasoning.

Future research could expand the dataset size to cover more legal domains, further validating model generalizability. Additionally, exploring techniques like few-shot learning or chain-of-thought prompting could improve reasoning capabilities. The study should also include open-weight models to enhance reproducibility and investigate the correlation between training data transparency and legal reasoning performance.

Deep Analysis

Background

In Vietnam, the complexity of legal texts and the use of specialized terminology make it difficult for ordinary citizens to understand and access legal information. This situation is particularly pronounced in civil law systems, where legal provisions are often expressed in complex legal language and structures, hindering public understanding of their fundamental rights and obligations. In recent years, the advent of Large Language Models (LLMs) has offered new possibilities for simplifying legal texts. By translating complex legal provisions into more understandable language, LLMs have the potential to lower the barrier to public access to legal information. However, this potential is accompanied by the risk of generating fluent but inaccurate legal simplifications. Therefore, evaluating the capabilities of LLMs in legal text processing has become crucial. Existing research primarily focuses on surface performance metrics such as legal accuracy, user-perceived readability, and output consistency, but these metrics fail to explain the reasons behind model performance.

Core Problem

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While LLMs offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. Existing research primarily focuses on surface performance metrics such as legal accuracy, user-perceived readability, and output consistency, but these metrics fail to explain the reasons behind model performance. A model might achieve a high accuracy score by correctly summarizing the general rule for inheritance, yet completely miss a critical exception for a specific circumstance, a subtle but catastrophic reasoning error that superficial scores would mask.

Innovation

This paper introduces a dual-aspect evaluation framework combining quantitative benchmarking and qualitative error analysis. This approach not only reveals performance differences in legal text processing but also provides an in-depth understanding of model failure modes. By introducing an expert-validated error typology, the research offers a new perspective for evaluating and improving legal AI. Unlike previous studies that focus primarily on surface performance, this paper delves into the reasoning errors of models, revealing their systemic weaknesses in legal reasoning.

Methodology

�� Establish performance benchmark: Evaluate four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three dimensions: Accuracy, Readability, and Consistency.

�� Conduct large-scale error analysis: Analyze 60 complex Vietnamese legal articles using a novel, expert-validated error typology.

�� Dataset selection: Choose 20 articles each from the Penal Code 2015, Civil Code 2015, and Land Law 2024 to ensure representativeness and challenge.

�� Task design: Use zero-shot prompt asking models to act as legal assistants, explaining legal articles and providing practical examples for laypersons.

�� Evaluation metrics: Include Legal Accuracy, Readability, and Consistency, rated by law students and non-expert participants.

Experiments

The experimental design involves selecting four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) and evaluating them using 60 complex legal articles selected from the Penal Code 2015, Civil Code 2015, and Land Law 2024. Each article is generated twice by the models, creating a corpus of 480 outputs. Evaluation metrics include Legal Accuracy, Readability, and Consistency, rated by law students and non-expert participants. Additionally, a novel expert-validated error typology is used for detailed error analysis of model outputs.

Results

The experimental results show that Grok-1 excels in Readability and Consistency but compromises on fine-grained legal Accuracy. Claude 3 Opus achieves high Accuracy scores that mask a significant number of subtle but critical reasoning errors. Error analysis identifies Incorrect Example and Misinterpretation as the most prevalent failures, indicating that the primary challenge for current LLMs is controlled, accurate legal reasoning rather than summarization. By integrating a quantitative benchmark with a qualitative deep dive, the study provides a holistic and actionable assessment of LLMs for legal applications.

Applications

The findings of this study can be directly applied to the simplification of legal texts and public access to legal information. By identifying and categorizing error types, the study offers concrete suggestions for future model development. Additionally, the results can be used to evaluate and improve the safety and reliability of existing legal AI applications, particularly in civil law systems like Vietnam.

Limitations & Outlook

Despite the dual-aspect evaluation framework providing deep insights into the capabilities of LLMs in legal text processing, there are some limitations. First, the dataset size is relatively small, consisting of only 60 legal articles, which may not fully reflect model performance across other legal domains. Second, the reliance on law students for error annotation, despite rigorous training, may lack the practical experience of professional experts. Additionally, the experimental design is limited to a zero-shot setting, not considering other techniques that might improve performance. Future research could expand the dataset size, cover more legal domains, and explore techniques like few-shot learning or chain-of-thought prompting.

Plain Language Accessible to non-experts

Imagine you're in a complex maze, with legal texts written on the walls, and you need to find a path to the exit. Large Language Models are like your guides, helping you understand these complex legal texts and pointing you in the right direction. However, sometimes these guides might take a wrong turn, leading you to a dead end. This is similar to the errors models might make when processing legal texts. To ensure these guides can accurately lead you out of the maze, researchers have designed a new evaluation method. They not only focus on whether the guides can quickly find the exit (i.e., model performance) but also analyze why the guides might take wrong turns (i.e., error analysis). Through this method, they hope to improve the guides' abilities, enabling them to better help you navigate the maze and understand legal texts in the future.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex puzzle game, with lots of levels, and each level has these hard-to-understand legal texts. You need a super-smart assistant to help you solve these puzzles, right? That's what Large Language Models do! They're like NPCs in the game, helping you turn those complex legal texts into something simple and easy to understand. But sometimes, these NPCs can mess up, like missing important info or misunderstanding the rules. To make these NPCs smarter, scientists have come up with a new way to test them. They not only look at whether these NPCs can solve the puzzles quickly but also analyze why they might mess up. This way, they can find ways to improve the NPCs, making them perform better in future games! Isn't that cool?

Glossary

Large Language Model

A large language model is an AI model trained on vast amounts of text data, capable of generating and understanding natural language text.

In this paper, large language models are used to simplify and explain complex legal texts.

Legal Reasoning

Legal reasoning refers to the process of logical analysis and judgment in a legal context, often involving the interpretation and application of legal provisions.

The paper analyzes models' reasoning errors to reveal their systemic weaknesses in legal reasoning.

Error Analysis

Error analysis is an evaluation method that identifies and categorizes errors in model outputs to help improve model performance.

The paper uses error analysis to reveal failure modes in legal text processing.

Text Simplification

Text simplification is the process of converting complex text into a more understandable form, often to improve accessibility.

The paper explores the capabilities and challenges of LLMs in legal text simplification.

Expert Validation

Expert validation involves the assessment of research methods or results by domain experts to confirm their validity.

The paper uses expert validation to ensure the accuracy of the error typology.

Zero-shot Learning

Zero-shot learning is a machine learning method where the model performs reasoning and prediction without having seen specific tasks before.

The paper uses zero-shot learning to evaluate the models' capabilities in legal text processing.

Consistency

Consistency refers to the stability and reliability of a model's output across multiple runs.

The paper evaluates models' consistency to assess their stability in legal text processing.

Readability

Readability refers to the ease with which a text can be read and understood by the target audience.

The paper evaluates the readability of model outputs to assess their performance in legal text simplification.

Legal Accuracy

Legal accuracy refers to the correctness and completeness of a model's output in terms of legal content.

The paper evaluates models' legal accuracy to assess their performance in legal text processing.

Incorrect Example

An incorrect example is when a model provides an example that does not match the legal provision or draws a legally incorrect conclusion.

The paper analyzes incorrect examples to reveal systemic weaknesses in legal reasoning.

Open Questions Unanswered questions from this research

1 Current large language models still exhibit systematic reasoning errors when processing complex legal texts. Although models can generate fluent text, they perform poorly when applying legal principles to novel scenarios. This indicates a fundamental gap between linguistic competence and abstract reasoning capabilities.
2 Existing evaluation methods primarily focus on surface performance metrics like accuracy and readability, failing to delve into the reasons behind model performance. A comprehensive evaluation method combining quantitative benchmarking and qualitative error analysis is needed to reveal systemic weaknesses in models.
3 While large language models perform well in legal text simplification, they often fail in generating specific examples. This suggests that models' reasoning capabilities in generative tasks still need improvement.
4 Current research mainly focuses on English legal texts, with relatively few studies on other languages and legal systems. More research is needed on non-English languages and civil law systems to improve model generalizability and applicability.
5 Existing legal AI applications emphasize factual correctness but may overlook subtle yet critical reasoning errors in generative tasks. A new error typology is needed to capture these unique error types.

Applications

Immediate Applications

Legal Text Simplification

Simplify complex legal texts using large language models to improve public access to and understanding of legal information.

Legal Education Assistance

Provide tools for law students and practitioners to better understand and apply legal provisions.

Legal Information Retrieval

Enhance the efficiency and accuracy of legal information retrieval using large language models, supporting legal research and practice.

Long-term Vision

Intelligent Legal Assistant

Develop intelligent assistants capable of providing accurate legal advice to help the public resolve legal issues.

Safety of Legal AI Systems

Improve the reasoning capabilities of models to enhance the safety and reliability of legal AI systems in public services.

Abstract

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.

cs.CL cs.AI

References (8)

Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models

Matthew Dahl, Varun Magesh, Mirac Suzgun et al.

2024 209 citations ⭐ Influential View Analysis →

Large Language Models in Law: A Survey

Jinqi Lai, Wensheng Gan, Jiayang Wu et al.

2023 211 citations View Analysis →

Access to justice in Vietnam: State supply – private distrust

P. Nicholson

2016 11 citations

A rapid evidence review of evaluation techniques for large language models in legal use cases: trends, gaps, and recommendations for future research

Joshua Kelsall, Xingwei Tan, A. Bergin et al.

2025 4 citations

Text Simplification for Legal Domain: {I}nsights and Challenges

Aparna Garimella, Abhilasha Sancheti, Vinay Aggarwal et al.

2022 20 citations

Top 2 at ALQAC 2024: Large Language Models (LLMs) for Legal Question Answering

H. Q. Pham, Quan Van Nguyen, D. Q. Tran et al.

2025 8 citations

VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao et al.

2025 8 citations View Analysis →

Unsupervised Simplification of Legal Texts

M. Cemri, Tolga Cukur, Aykut Koç

2022 9 citations View Analysis →

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Large Language Model

Legal Reasoning

Error Analysis

Text Simplification

Expert Validation

Zero-shot Learning

Consistency

Readability

Legal Accuracy

Incorrect Example

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Legal Text Simplification

Legal Education Assistance

Legal Information Retrieval

Long-term Vision

Intelligent Legal Assistant

Safety of Legal AI Systems

Abstract

References (8)

Related Papers

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Improving Robustness of Tabular Retrieval via Representational Stability

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

CRAFT: Clustered Regression for Adaptive Filtering of Training data

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering