ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
ESG-Bench significantly reduces hallucinations in long-context ESG report analysis using task-specific Chain-of-Thought prompting strategies.
Key Findings
Methodology
ESG-Bench frames ESG report analysis as a QA task with verifiability constraints, enabling systematic evaluation of LLMs' ability to extract and reason over ESG content. Task-specific Chain-of-Thought (CoT) prompting strategies and CoT-annotated reasoning paths are used to fine-tune multiple state-of-the-art LLMs, significantly reducing hallucinations.
Key Results
- On ESG-Bench, models using CoT strategies significantly outperform standard prompting and direct fine-tuning in reducing hallucinations, with a reduction rate of over 30%.
- Experiments demonstrate that CoT strategies are effective not only in the ESG domain but also transferable to other QA benchmarks like HaluEval and BioASQ, showing higher accuracy and reliability.
- Comparative analysis of different fine-tuning strategies reveals that CoT fine-tuning enhances reasoning consistency and factual accuracy in long-text contexts.
Significance
The introduction of ESG-Bench provides a systematic framework for ESG report analysis, particularly in socially sensitive and compliance-critical environments to mitigate hallucinations. This research offers new perspectives on the reliability of LLMs when handling complex long texts and lays the groundwork for future compliance analysis tool development.
Technical Contribution
Technical contributions include the first framing of ESG report analysis as a QA task with verifiability constraints and the introduction of Chain-of-Thought prompting strategies to reduce hallucinations. This approach offers a new structured strategy for reasoning in long-text contexts, significantly improving models' factual consistency and reasoning transparency.
Novelty
ESG-Bench is the first benchmark specifically designed for long-context ESG report QA, providing human-verified hallucination annotations and tasks. The novelty lies in applying Chain-of-Thought strategies to long-text analysis, significantly reducing hallucinations.
Limitations
- ESG-Bench currently focuses on English ESG reports, not covering multilingual and cross-cultural ESG report analysis.
- Due to the complexity and diversity of ESG reports, models may still have limitations when handling reports from specific industries or domains.
- The current Chain-of-Thought strategy may perform suboptimally in extremely long texts or highly complex reports.
Future Work
Future research directions include expanding ESG-Bench to cover multilingual and cross-cultural ESG reports, developing more general hallucination mitigation strategies, and exploring how to enhance models' reasoning capabilities in extremely long texts.
AI Executive Summary
As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.
Accurate and trustworthy ESG (Environmental, Social, and Governance) reporting is increasingly essential for sustainable development, regulatory accountability, and ethical corporate conduct. ESG provides a framework for assessing how companies manage sustainability-related risks across environmental, social, and governance pillars. Once largely voluntary, ESG disclosure has become a legal requirement in many regions, most notably through EU regulations such as the Corporate Sustainability Reporting Directive and the Sustainable Finance Disclosure Regulation. This shift reflects growing expectations for transparency in corporate impacts on society and the environment. ESG reporting therefore plays a critical role in enabling compliance and supporting stakeholders' evaluation of long-term performance.
Corporations now publish extensive ESG reports for investors, regulators, and the public. However, the usefulness of these disclosures depends on their credibility and comparability. Third-party ESG rating agencies such as Sustainalytics and MSCI have been widely criticized for methodological opacity and inconsistency, with studies showing that their scores often diverge substantially even for the same company due to differences in indicator selection, weighting schemes, and data sources. These controversies undermine stakeholder trust and highlight that ESG assessments are far from standardized. Combined with the growing length and complexity of sustainability reports, this inconsistency increases the need for scalable, transparent tools that can support reliable and evidence-grounded interpretation.
The emergence of large language models (LLMs) offers new opportunities for automating the analysis of ESG disclosures at scale. However, the complexity and diversity of ESG reports pose significant challenges for reliable LLM deployment: Companies may engage in greenwashing to appear more sustainable, misleading investors and stakeholders about their true ESG impact. ESG reports require deep contextual understanding, industry-specific knowledge, and familiarity with regulatory frameworks, barriers that LLMs may struggle with due to their reliance on general knowledge. ESG reports involve a mix of text, tables, and graphics. Retrieving and analyzing these documents often span hundreds of pages. LLMs remain limited in efficient document parsing, robust memory recall, and cross-sectional understanding in lengthy reports.
In this paper, we present ESG-Bench, a benchmark for hallucination-aware ESG question answering. We build the dataset through a model–then–annotator pipeline, establish a taxonomy of hallucination types, evaluate multiple LLMs on ESG-Bench, and propose a task-specific Chain-of-Thought strategy for reducing hallucinations in long-context ESG analysis. Our contributions are summarized below:
• ESG-Bench is a benchmark dataset specifically designed for long-context QA and hallucination mitigation in ESG reporting. To the best of our knowledge, it is the first structured resource that supports both systematic evaluation and targeted mitigation of hallucinations in this socially and regulatory significant domain.
• We develop a fine-tuning approach based on task-specific CoT prompting and CoT-annotated reasoning traces. This method significantly improves factual grounding and reduces hallucinated outputs, demonstrating the effectiveness of structured reasoning in a domain-specific QA task.
Deep Analysis
Background
In recent years, as corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting has become a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. ESG provides a framework for assessing how companies manage sustainability-related risks across environmental, social, and governance pillars. Once largely voluntary, ESG disclosure has become a legal requirement in many regions, most notably through EU regulations such as the Corporate Sustainability Reporting Directive and the Sustainable Finance Disclosure Regulation. This shift reflects growing expectations for transparency in corporate impacts on society and the environment. ESG reporting therefore plays a critical role in enabling compliance and supporting stakeholders' evaluation of long-term performance. Corporations now publish extensive ESG reports for investors, regulators, and the public. However, the usefulness of these disclosures depends on their credibility and comparability. Third-party ESG rating agencies such as Sustainalytics and MSCI have been widely criticized for methodological opacity and inconsistency, with studies showing that their scores often diverge substantially even for the same company due to differences in indicator selection, weighting schemes, and data sources. These controversies undermine stakeholder trust and highlight that ESG assessments are far from standardized. Combined with the growing length and complexity of sustainability reports, this inconsistency increases the need for scalable, transparent tools that can support reliable and evidence-grounded interpretation.
Core Problem
The complexity and diversity of ESG reports pose significant challenges for reliable LLM deployment: Companies may engage in greenwashing to appear more sustainable, misleading investors and stakeholders about their true ESG impact. ESG reports require deep contextual understanding, industry-specific knowledge, and familiarity with regulatory frameworks, barriers that LLMs may struggle with due to their reliance on general knowledge. ESG reports involve a mix of text, tables, and graphics. Retrieving and analyzing these documents often span hundreds of pages. LLMs remain limited in efficient document parsing, robust memory recall, and cross-sectional understanding in lengthy reports. LLMs struggle with these demands due to limitations in document parsing, retrieval, and cross-sectional understanding, and also because they rely heavily on parametric knowledge that may conflict with the factual content of ESG reports. This misalignment frequently leads to hallucinations, answers that are not grounded in the source document. We classify hallucinations into two types: (1) where the model introduces unsupported information, and (2) omissive hallucinations, where the model fails to answer despite relevant evidence.
Innovation
ESG-Bench frames ESG report analysis as a QA task with verifiability constraints, enabling systematic evaluation of LLMs' ability to extract and reason over ESG content and providing a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.
Methodology
- �� ESG-Bench dataset construction: Built through a model–then–annotator pipeline, establishing a taxonomy of hallucination types.
- �� Task-specific Chain-of-Thought strategy: Designed task-specific CoT prompting strategies and fine-tuned multiple state-of-the-art LLMs using CoT-annotated reasoning paths.
- �� Systematic evaluation: Evaluated multiple LLMs on ESG-Bench, focusing on hallucination mitigation.
- �� Experimental validation: Compared different fine-tuning strategies, validating the effectiveness of CoT strategies in reducing hallucinations.
Experiments
The experimental design includes evaluating multiple large language models using the ESG-Bench dataset, focusing on hallucination mitigation. We selected several state-of-the-art LLMs, including Llama-3.2-3B Instruct, Gemma-2-2B-it, and Mistral-7B-Instruct-v0.3. These models are tested on ESG-Bench, HaluEval, and BioASQ benchmarks, assessing their ability to generate responses while identifying hallucinations. The evaluation metrics used include WA (With Answer) and WoA (Without Answer) accuracy to fairly assess the models' ability to generate accurate answers and appropriately abstain when sufficient information is unavailable.
Results
The experimental results show that models using CoT strategies significantly outperform standard prompting and direct fine-tuning in reducing hallucinations, with a reduction rate of over 30%. On ESG-Bench, models using CoT strategies significantly outperform standard prompting and direct fine-tuning in reducing hallucinations, with a reduction rate of over 30%. Experiments demonstrate that CoT strategies are effective not only in the ESG domain but also transferable to other QA benchmarks like HaluEval and BioASQ, showing higher accuracy and reliability. Comparative analysis of different fine-tuning strategies reveals that CoT fine-tuning enhances reasoning consistency and factual accuracy in long-text contexts.
Applications
The application scenarios of ESG-Bench include corporate ESG audits and compliance verification, as well as providing a valuable resource for training summarization models on long ESG documents. Annotator-corrected responses enable fine-tuning of ESG-specific QA models for improved factual grounding, while hallucination labels aid in developing mitigation strategies. The dataset also serves as a benchmarking tool for evaluating answer accuracy, retrieval robustness, and format-specific performance.
Limitations & Outlook
ESG-Bench currently focuses on English ESG reports, not covering multilingual and cross-cultural ESG report analysis. Due to the complexity and diversity of ESG reports, models may still have limitations when handling reports from specific industries or domains. The current Chain-of-Thought strategy may perform suboptimally in extremely long texts or highly complex reports. Future research directions include expanding ESG-Bench to cover multilingual and cross-cultural ESG reports, developing more general hallucination mitigation strategies, and exploring how to enhance models' reasoning capabilities in extremely long texts.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You have a recipe (ESG report), but it's very long and has many complex steps (complex ESG report). You need an assistant (large language model) to help you understand and execute these steps. However, sometimes the assistant might misunderstand some parts of the recipe, leading to a wrong dish (hallucination). To avoid this, we need a new method (ESG-Bench), which acts like a detailed guidebook, helping the assistant better understand the recipe and ensuring each step is accurate. This method uses something called a Chain-of-Thought strategy, which is like adding annotations between each step to ensure the assistant understands the logic behind it before proceeding. It's like giving the assistant a clear thought process, so it doesn't get lost in the complex recipe and makes a delicious dish. This way, we can ensure the assistant's accuracy and reliability when handling complex recipes.
ELI14 Explained like you're 14
Hey there! Imagine you're doing a super long school project (like an ESG report), but it's so long you don't know where to start. So, you get a super smart robot assistant (large language model) to help you out. But sometimes, this robot makes mistakes, like making up stuff that doesn't exist (hallucinations). To make the robot more reliable, we designed a new method (ESG-Bench), like giving it a super compass to keep it from getting lost in the complex report. This compass is called a Chain-of-Thought strategy, which gives the robot some hints at each step to make sure it doesn't mess up. This way, the robot can help you finish the report better instead of causing trouble. Isn't that cool?
Glossary
ESG Report
Environmental, Social, and Governance (ESG) reports are documents companies use to disclose their performance in sustainability and social responsibility.
In this paper, ESG reports are the core objects of analysis and hallucination mitigation.
Hallucination
In natural language processing, hallucination refers to information generated by a model that is inconsistent with or unsupported by the source document.
In this paper, hallucination is the main problem to be mitigated.
Large Language Model (LLM)
A large language model is a deep learning model trained on vast amounts of data, capable of generating and understanding natural language.
In this paper, LLMs are used to analyze and understand ESG reports.
Chain-of-Thought (CoT)
Chain-of-Thought is a prompting strategy that improves reasoning by guiding the model through step-by-step reasoning.
In this paper, CoT is used to reduce hallucinations in long-text contexts.
Question-Answering (QA) Task
A QA task is a natural language processing task aimed at extracting information from text to answer specific questions.
In this paper, ESG report analysis is framed as a QA task.
Fine-tuning
Fine-tuning refers to further training a pre-trained model on specific task data to improve its performance on that task.
In this paper, fine-tuning is used to improve model performance on ESG-Bench.
HaluEval
HaluEval is a benchmark dataset for evaluating hallucinations in model outputs across diverse tasks and domains.
In this paper, HaluEval is used to validate the transferability of CoT strategies.
BioASQ
BioASQ is a scientific literature QA benchmark focused on the life sciences domain.
In this paper, BioASQ is used to validate the transferability of CoT strategies.
Supervised Learning
Supervised learning is a machine learning method that trains models using labeled data to make predictions on new data.
In this paper, supervised learning is used to fine-tune models to reduce hallucinations.
Dataset
A dataset is an organized collection of data used to train and evaluate machine learning models.
In this paper, ESG-Bench is a dataset used for evaluating and reducing hallucinations.
Open Questions Unanswered questions from this research
- 1 How can hallucinations be effectively reduced in multilingual and cross-cultural ESG reports? Current methods mainly focus on English reports, and more general strategies are needed for the future.
- 2 How can models' reasoning capabilities be enhanced in extremely long texts or highly complex reports? The existing Chain-of-Thought strategy may perform suboptimally in these scenarios.
- 3 How can ESG-Bench be applied across different fields and industries? Due to the complexity and diversity of ESG reports, models may still have limitations when handling reports from specific industries or domains.
- 4 How can models' reasoning consistency and factual accuracy in long-text contexts be further improved? Although CoT strategies have made significant progress, there is still room for improvement.
- 5 How can other technologies (e.g., multimodal learning) be integrated to enhance ESG report analysis? Current research mainly focuses on text analysis.
Applications
Immediate Applications
Corporate ESG Audits
ESG-Bench can be used for corporate ESG audits, helping to identify and reduce hallucinations in reports, ensuring the accuracy and credibility of disclosures.
Compliance Verification
By using ESG-Bench, regulatory bodies can more effectively verify corporate compliance, ensuring reports meet relevant regulations and standards.
Long-text Summarization
ESG-Bench provides a valuable resource for training summarization models on long ESG documents, helping to improve model accuracy in handling complex long texts.
Long-term Vision
Multilingual ESG Analysis
In the future, ESG-Bench can be expanded to cover multilingual and cross-cultural ESG report analysis, helping global companies improve report transparency and consistency.
Intelligent Compliance Tools
By combining ESG-Bench with other technologies, intelligent compliance tools can be developed to help companies automate compliance processes, improving efficiency and accuracy.
Abstract
As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.
References (20)
POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization
Xinyu Li, Tianjin Huang, Ronghui Mu et al.
ESG investing: Does one score fit all investors’ preferences?
Cynthia Assaf, Jerome Monne, Loïc Harriet et al.
Detecting hallucinations in large language models using semantic entropy
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn et al.
Greenwashing in environmental, social and governance disclosures
E. Yu, B. Luu, C. Chen
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.
CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack
Siqi Sun, Procheta Sen, Wenjie Ruan
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, S. Gu, Machel Reid et al.
How Do Companies Respond to Environmental, Social and Governance (ESG) ratings? Evidence from Italy
Ester Clementino, Richard Perkins
The independent and moderating role of choice of non-financial reporting format on forecast accuracy and ESG disclosure.
Paola Rossi, P. Candio
Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
G. Chrysostomou, Zhixue Zhao, Miles Williams et al.
TextVerifier: Robustness Verification for Textual Classifiers with Certifiable Guarantees
Siqi Sun, Wenjie Ruan
ESG Standards: Looming Challenges and Pathways Forward
T. Cort, D. Esty
Evaluating Multilingual Language Models for Cross-Lingual ESG Issue Identification
Wing Yan Li, Emmanuele Chersoni, C. Ngai
A Coefficient of Agreement for Nominal Scales
Jacob Cohen
Government environmental protection expenditure and national ESG performance: Global evidence
Bingcheng Niu
Integration of Environmental, Social, and Governance (ESG) criteria: their impacts on corporate sustainability performance
Anrafel de Souza Barbosa, Maria Cristina Basílio Crispim da Silva, L.B. da Silva et al.
RECKONING: Reasoning through Dynamic Knowledge Encoding
Zeming Chen, Gail Weiss, E. Mitchell et al.
Mining company sustainability reports to aid financial decision-making
Tushar Goel, Palak Jain, Ishan Verma et al.
Position: Building Guardrails for Large Language Models Requires Systematic Design
Yi Dong, Ronghui Mu, Gao Jin et al.
Analyzing Sustainability Reports Using Natural Language Processing
A. Luccioni, Emi Baylor, N. Duchêne