ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

TL;DR

ESG-Bench significantly reduces hallucinations in long-context ESG report analysis using task-specific Chain-of-Thought prompting strategies.

cs.CL 🔴 Advanced 2026-03-14 2 views

Siqi Sun Ben Peng Wu Mali Jin Peizhen Bai Hanpei Zhang Xingyi Song

AI Reader Arxiv Page Download PDF

ESG hallucination mitigation large language models Chain-of-Thought compliance analysis

Key Findings

Methodology

ESG-Bench frames ESG report analysis as a QA task with verifiability constraints, enabling systematic evaluation of LLMs' ability to extract and reason over ESG content. Task-specific Chain-of-Thought (CoT) prompting strategies and CoT-annotated reasoning paths are used to fine-tune multiple state-of-the-art LLMs, significantly reducing hallucinations.

Key Results

On ESG-Bench, models using CoT strategies significantly outperform standard prompting and direct fine-tuning in reducing hallucinations, with a reduction rate of over 30%.
Experiments demonstrate that CoT strategies are effective not only in the ESG domain but also transferable to other QA benchmarks like HaluEval and BioASQ, showing higher accuracy and reliability.
Comparative analysis of different fine-tuning strategies reveals that CoT fine-tuning enhances reasoning consistency and factual accuracy in long-text contexts.

Significance

The introduction of ESG-Bench provides a systematic framework for ESG report analysis, particularly in socially sensitive and compliance-critical environments to mitigate hallucinations. This research offers new perspectives on the reliability of LLMs when handling complex long texts and lays the groundwork for future compliance analysis tool development.

Technical Contribution

Technical contributions include the first framing of ESG report analysis as a QA task with verifiability constraints and the introduction of Chain-of-Thought prompting strategies to reduce hallucinations. This approach offers a new structured strategy for reasoning in long-text contexts, significantly improving models' factual consistency and reasoning transparency.

Novelty

ESG-Bench is the first benchmark specifically designed for long-context ESG report QA, providing human-verified hallucination annotations and tasks. The novelty lies in applying Chain-of-Thought strategies to long-text analysis, significantly reducing hallucinations.

Limitations

ESG-Bench currently focuses on English ESG reports, not covering multilingual and cross-cultural ESG report analysis.
Due to the complexity and diversity of ESG reports, models may still have limitations when handling reports from specific industries or domains.
The current Chain-of-Thought strategy may perform suboptimally in extremely long texts or highly complex reports.

Future Work

Future research directions include expanding ESG-Bench to cover multilingual and cross-cultural ESG reports, developing more general hallucination mitigation strategies, and exploring how to enhance models' reasoning capabilities in extremely long texts.

AI Executive Summary

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

Accurate and trustworthy ESG (Environmental, Social, and Governance) reporting is increasingly essential for sustainable development, regulatory accountability, and ethical corporate conduct. ESG provides a framework for assessing how companies manage sustainability-related risks across environmental, social, and governance pillars. Once largely voluntary, ESG disclosure has become a legal requirement in many regions, most notably through EU regulations such as the Corporate Sustainability Reporting Directive and the Sustainable Finance Disclosure Regulation. This shift reflects growing expectations for transparency in corporate impacts on society and the environment. ESG reporting therefore plays a critical role in enabling compliance and supporting stakeholders' evaluation of long-term performance.

Corporations now publish extensive ESG reports for investors, regulators, and the public. However, the usefulness of these disclosures depends on their credibility and comparability. Third-party ESG rating agencies such as Sustainalytics and MSCI have been widely criticized for methodological opacity and inconsistency, with studies showing that their scores often diverge substantially even for the same company due to differences in indicator selection, weighting schemes, and data sources. These controversies undermine stakeholder trust and highlight that ESG assessments are far from standardized. Combined with the growing length and complexity of sustainability reports, this inconsistency increases the need for scalable, transparent tools that can support reliable and evidence-grounded interpretation.

The emergence of large language models (LLMs) offers new opportunities for automating the analysis of ESG disclosures at scale. However, the complexity and diversity of ESG reports pose significant challenges for reliable LLM deployment: Companies may engage in greenwashing to appear more sustainable, misleading investors and stakeholders about their true ESG impact. ESG reports require deep contextual understanding, industry-specific knowledge, and familiarity with regulatory frameworks, barriers that LLMs may struggle with due to their reliance on general knowledge. ESG reports involve a mix of text, tables, and graphics. Retrieving and analyzing these documents often span hundreds of pages. LLMs remain limited in efficient document parsing, robust memory recall, and cross-sectional understanding in lengthy reports.

In this paper, we present ESG-Bench, a benchmark for hallucination-aware ESG question answering. We build the dataset through a model–then–annotator pipeline, establish a taxonomy of hallucination types, evaluate multiple LLMs on ESG-Bench, and propose a task-specific Chain-of-Thought strategy for reducing hallucinations in long-context ESG analysis. Our contributions are summarized below:

• ESG-Bench is a benchmark dataset specifically designed for long-context QA and hallucination mitigation in ESG reporting. To the best of our knowledge, it is the first structured resource that supports both systematic evaluation and targeted mitigation of hallucinations in this socially and regulatory significant domain.

• We develop a fine-tuning approach based on task-specific CoT prompting and CoT-annotated reasoning traces. This method significantly improves factual grounding and reduces hallucinated outputs, demonstrating the effectiveness of structured reasoning in a domain-specific QA task.

Deep Analysis

Background

In recent years, as corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting has become a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. ESG provides a framework for assessing how companies manage sustainability-related risks across environmental, social, and governance pillars. Once largely voluntary, ESG disclosure has become a legal requirement in many regions, most notably through EU regulations such as the Corporate Sustainability Reporting Directive and the Sustainable Finance Disclosure Regulation. This shift reflects growing expectations for transparency in corporate impacts on society and the environment. ESG reporting therefore plays a critical role in enabling compliance and supporting stakeholders' evaluation of long-term performance. Corporations now publish extensive ESG reports for investors, regulators, and the public. However, the usefulness of these disclosures depends on their credibility and comparability. Third-party ESG rating agencies such as Sustainalytics and MSCI have been widely criticized for methodological opacity and inconsistency, with studies showing that their scores often diverge substantially even for the same company due to differences in indicator selection, weighting schemes, and data sources. These controversies undermine stakeholder trust and highlight that ESG assessments are far from standardized. Combined with the growing length and complexity of sustainability reports, this inconsistency increases the need for scalable, transparent tools that can support reliable and evidence-grounded interpretation.

Core Problem

The complexity and diversity of ESG reports pose significant challenges for reliable LLM deployment: Companies may engage in greenwashing to appear more sustainable, misleading investors and stakeholders about their true ESG impact. ESG reports require deep contextual understanding, industry-specific knowledge, and familiarity with regulatory frameworks, barriers that LLMs may struggle with due to their reliance on general knowledge. ESG reports involve a mix of text, tables, and graphics. Retrieving and analyzing these documents often span hundreds of pages. LLMs remain limited in efficient document parsing, robust memory recall, and cross-sectional understanding in lengthy reports. LLMs struggle with these demands due to limitations in document parsing, retrieval, and cross-sectional understanding, and also because they rely heavily on parametric knowledge that may conflict with the factual content of ESG reports. This misalignment frequently leads to hallucinations, answers that are not grounded in the source document. We classify hallucinations into two types: (1) where the model introduces unsupported information, and (2) omissive hallucinations, where the model fails to answer despite relevant evidence.

Innovation

ESG-Bench frames ESG report analysis as a QA task with verifiability constraints, enabling systematic evaluation of LLMs' ability to extract and reason over ESG content and providing a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

Methodology

�� ESG-Bench dataset construction: Built through a model–then–annotator pipeline, establishing a taxonomy of hallucination types.

�� Task-specific Chain-of-Thought strategy: Designed task-specific CoT prompting strategies and fine-tuned multiple state-of-the-art LLMs using CoT-annotated reasoning paths.

�� Systematic evaluation: Evaluated multiple LLMs on ESG-Bench, focusing on hallucination mitigation.

�� Experimental validation: Compared different fine-tuning strategies, validating the effectiveness of CoT strategies in reducing hallucinations.

Experiments

The experimental design includes evaluating multiple large language models using the ESG-Bench dataset, focusing on hallucination mitigation. We selected several state-of-the-art LLMs, including Llama-3.2-3B Instruct, Gemma-2-2B-it, and Mistral-7B-Instruct-v0.3. These models are tested on ESG-Bench, HaluEval, and BioASQ benchmarks, assessing their ability to generate responses while identifying hallucinations. The evaluation metrics used include WA (With Answer) and WoA (Without Answer) accuracy to fairly assess the models' ability to generate accurate answers and appropriately abstain when sufficient information is unavailable.

Results

The experimental results show that models using CoT strategies significantly outperform standard prompting and direct fine-tuning in reducing hallucinations, with a reduction rate of over 30%. On ESG-Bench, models using CoT strategies significantly outperform standard prompting and direct fine-tuning in reducing hallucinations, with a reduction rate of over 30%. Experiments demonstrate that CoT strategies are effective not only in the ESG domain but also transferable to other QA benchmarks like HaluEval and BioASQ, showing higher accuracy and reliability. Comparative analysis of different fine-tuning strategies reveals that CoT fine-tuning enhances reasoning consistency and factual accuracy in long-text contexts.

Applications

The application scenarios of ESG-Bench include corporate ESG audits and compliance verification, as well as providing a valuable resource for training summarization models on long ESG documents. Annotator-corrected responses enable fine-tuning of ESG-specific QA models for improved factual grounding, while hallucination labels aid in developing mitigation strategies. The dataset also serves as a benchmarking tool for evaluating answer accuracy, retrieval robustness, and format-specific performance.

Limitations & Outlook

ESG-Bench currently focuses on English ESG reports, not covering multilingual and cross-cultural ESG report analysis. Due to the complexity and diversity of ESG reports, models may still have limitations when handling reports from specific industries or domains. The current Chain-of-Thought strategy may perform suboptimally in extremely long texts or highly complex reports. Future research directions include expanding ESG-Bench to cover multilingual and cross-cultural ESG reports, developing more general hallucination mitigation strategies, and exploring how to enhance models' reasoning capabilities in extremely long texts.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (ESG report), but it's very long and has many complex steps (complex ESG report). You need an assistant (large language model) to help you understand and execute these steps. However, sometimes the assistant might misunderstand some parts of the recipe, leading to a wrong dish (hallucination). To avoid this, we need a new method (ESG-Bench), which acts like a detailed guidebook, helping the assistant better understand the recipe and ensuring each step is accurate. This method uses something called a Chain-of-Thought strategy, which is like adding annotations between each step to ensure the assistant understands the logic behind it before proceeding. It's like giving the assistant a clear thought process, so it doesn't get lost in the complex recipe and makes a delicious dish. This way, we can ensure the assistant's accuracy and reliability when handling complex recipes.

ELI14 Explained like you're 14

Hey there! Imagine you're doing a super long school project (like an ESG report), but it's so long you don't know where to start. So, you get a super smart robot assistant (large language model) to help you out. But sometimes, this robot makes mistakes, like making up stuff that doesn't exist (hallucinations). To make the robot more reliable, we designed a new method (ESG-Bench), like giving it a super compass to keep it from getting lost in the complex report. This compass is called a Chain-of-Thought strategy, which gives the robot some hints at each step to make sure it doesn't mess up. This way, the robot can help you finish the report better instead of causing trouble. Isn't that cool?

Glossary

ESG Report

Environmental, Social, and Governance (ESG) reports are documents companies use to disclose their performance in sustainability and social responsibility.

In this paper, ESG reports are the core objects of analysis and hallucination mitigation.

Hallucination

In natural language processing, hallucination refers to information generated by a model that is inconsistent with or unsupported by the source document.

In this paper, hallucination is the main problem to be mitigated.

Large Language Model (LLM)

A large language model is a deep learning model trained on vast amounts of data, capable of generating and understanding natural language.

In this paper, LLMs are used to analyze and understand ESG reports.

Chain-of-Thought (CoT)

Chain-of-Thought is a prompting strategy that improves reasoning by guiding the model through step-by-step reasoning.

In this paper, CoT is used to reduce hallucinations in long-text contexts.

Question-Answering (QA) Task

A QA task is a natural language processing task aimed at extracting information from text to answer specific questions.

In this paper, ESG report analysis is framed as a QA task.

Fine-tuning

Fine-tuning refers to further training a pre-trained model on specific task data to improve its performance on that task.

In this paper, fine-tuning is used to improve model performance on ESG-Bench.

HaluEval

HaluEval is a benchmark dataset for evaluating hallucinations in model outputs across diverse tasks and domains.

In this paper, HaluEval is used to validate the transferability of CoT strategies.

BioASQ

BioASQ is a scientific literature QA benchmark focused on the life sciences domain.

In this paper, BioASQ is used to validate the transferability of CoT strategies.

Supervised Learning

Supervised learning is a machine learning method that trains models using labeled data to make predictions on new data.

In this paper, supervised learning is used to fine-tune models to reduce hallucinations.

Dataset

A dataset is an organized collection of data used to train and evaluate machine learning models.

In this paper, ESG-Bench is a dataset used for evaluating and reducing hallucinations.

Open Questions Unanswered questions from this research

1 How can hallucinations be effectively reduced in multilingual and cross-cultural ESG reports? Current methods mainly focus on English reports, and more general strategies are needed for the future.
2 How can models' reasoning capabilities be enhanced in extremely long texts or highly complex reports? The existing Chain-of-Thought strategy may perform suboptimally in these scenarios.
3 How can ESG-Bench be applied across different fields and industries? Due to the complexity and diversity of ESG reports, models may still have limitations when handling reports from specific industries or domains.
4 How can models' reasoning consistency and factual accuracy in long-text contexts be further improved? Although CoT strategies have made significant progress, there is still room for improvement.
5 How can other technologies (e.g., multimodal learning) be integrated to enhance ESG report analysis? Current research mainly focuses on text analysis.

Applications

Immediate Applications

Corporate ESG Audits

ESG-Bench can be used for corporate ESG audits, helping to identify and reduce hallucinations in reports, ensuring the accuracy and credibility of disclosures.

Compliance Verification

By using ESG-Bench, regulatory bodies can more effectively verify corporate compliance, ensuring reports meet relevant regulations and standards.

Long-text Summarization

ESG-Bench provides a valuable resource for training summarization models on long ESG documents, helping to improve model accuracy in handling complex long texts.

Long-term Vision

Multilingual ESG Analysis

In the future, ESG-Bench can be expanded to cover multilingual and cross-cultural ESG report analysis, helping global companies improve report transparency and consistency.

Intelligent Compliance Tools

By combining ESG-Bench with other technologies, intelligent compliance tools can be developed to help companies automate compliance processes, improving efficiency and accuracy.

Abstract

cs.CL cs.AI

References (20)

POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization

Xinyu Li, Tianjin Huang, Ronghui Mu et al.

2025 5 citations View Analysis →

ESG investing: Does one score fit all investors’ preferences?

Cynthia Assaf, Jerome Monne, Loïc Harriet et al.

2024 31 citations

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn et al.

2024 975 citations

Greenwashing in environmental, social and governance disclosures

E. Yu, B. Luu, C. Chen

2020 788 citations

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.

2017 3528 citations View Analysis →

CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack

Siqi Sun, Procheta Sen, Wenjie Ruan

2024 5 citations

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, S. Gu, Machel Reid et al.

2022 6589 citations View Analysis →

How Do Companies Respond to Environmental, Social and Governance (ESG) ratings? Evidence from Italy

Ester Clementino, Richard Perkins

2020 379 citations

The independent and moderating role of choice of non-financial reporting format on forecast accuracy and ESG disclosure.

Paola Rossi, P. Candio

2023 16 citations

Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization

G. Chrysostomou, Zhixue Zhao, Miles Williams et al.

2023 26 citations View Analysis →

TextVerifier: Robustness Verification for Textual Classifiers with Certifiable Guarantees

Siqi Sun, Wenjie Ruan

2023 5 citations

ESG Standards: Looming Challenges and Pathways Forward

T. Cort, D. Esty

2020 102 citations

Evaluating Multilingual Language Models for Cross-Lingual ESG Issue Identification

Wing Yan Li, Emmanuele Chersoni, C. Ngai

2024 2 citations

A Coefficient of Agreement for Nominal Scales

Jacob Cohen

1960 41806 citations

Government environmental protection expenditure and national ESG performance: Global evidence

Bingcheng Niu

2024 44 citations

Integration of Environmental, Social, and Governance (ESG) criteria: their impacts on corporate sustainability performance

Anrafel de Souza Barbosa, Maria Cristina Basílio Crispim da Silva, L.B. da Silva et al.

2023 140 citations

RECKONING: Reasoning through Dynamic Knowledge Encoding

Zeming Chen, Gail Weiss, E. Mitchell et al.

2023 16 citations View Analysis →

Mining company sustainability reports to aid financial decision-making

Tushar Goel, Palak Jain, Ishan Verma et al.

2020 10 citations

Position: Building Guardrails for Large Language Models Requires Systematic Design

Yi Dong, Ronghui Mu, Gao Jin et al.

2024 31 citations

Analyzing Sustainability Reports Using Natural Language Processing

A. Luccioni, Emi Baylor, N. Duchêne

2020 56 citations View Analysis →

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

ESG Report

Hallucination

Large Language Model (LLM)

Chain-of-Thought (CoT)

Question-Answering (QA) Task

Fine-tuning

HaluEval

BioASQ

Supervised Learning

Dataset

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Corporate ESG Audits

Compliance Verification

Long-text Summarization

Long-term Vision

Multilingual ESG Analysis

Intelligent Compliance Tools

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration