Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval
Using structured linked data as a memory layer improves RAG system retrieval accuracy by 29.6% in standard RAG and 29.8% in agentic pipeline.
Key Findings
Methodology
This paper proposes a method using structured linked data as a memory layer to enhance retrieval accuracy and answer quality in Retrieval-Augmented Generation (RAG) systems. The study employs Schema.org markup and dereferenceable entity pages, combined with Vertex AI Vector Search 2.0 and Google Agent Development Kit (ADK) for experiments. The experimental design tests seven conditions, covering three document representations and two retrieval modes, plus an Enhanced+ condition.
Key Results
- Result 1: In standard RAG systems, using an enhanced entity page format (including llms.txt-style agent instructions, breadcrumbs, and neural search capabilities) improved retrieval accuracy by 29.6%.
- Result 2: In the full agentic pipeline, the enhanced entity page format improved retrieval accuracy by 29.8%, demonstrating the advantage of structured data in multi-hop link traversal.
- Result 3: The Enhanced+ variant achieved the highest absolute scores in accuracy and completeness (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant.
Significance
This study demonstrates that structured linked data can significantly enhance the performance of RAG systems, particularly in complex retrieval tasks requiring multi-source information integration. By leveraging Schema.org markup and knowledge graphs, this method offers a new perspective for information retrieval and generation, potentially impacting academia and industry, especially in high-precision and high-completeness applications.
Technical Contribution
The technical contribution of this paper lies in proposing a novel method of using structured linked data as a memory layer, significantly improving retrieval accuracy and answer quality compared to existing RAG systems. By introducing an enhanced entity page format, the paper shows how to leverage knowledge graphs for multi-hop link traversal without constructing graphs, offering new engineering possibilities.
Novelty
This paper is the first to use structured linked data as a memory layer in RAG systems, significantly improving retrieval performance. Compared to existing RAG systems, this method effectively utilizes structured data from knowledge graphs through an enhanced entity page format, providing a new way of information integration.
Limitations
- Limitation 1: Although the Enhanced+ variant achieves the best absolute scores, the incremental gain over the base enhanced format is not statistically significant, indicating that further navigational affordances may not always lead to significant performance improvements.
- Limitation 2: The experimental results may be influenced by specific domain datasets, with more pronounced effects in domains rich in knowledge graph information.
- Limitation 3: Due to the complexity of the experimental design, the reproducibility of results may be limited, especially across different knowledge graphs and data platforms.
Future Work
Future research directions include exploring the application of structured linked data as a memory layer in broader domains, further optimizing the enhanced entity page format, and developing more efficient multi-hop link traversal algorithms. Additionally, research could expand to other types of structured data and knowledge graphs to verify the generalizability of this method.
AI Executive Summary
In the field of information retrieval, Retrieval-Augmented Generation (RAG) systems have become the dominant architecture, yet most systems treat documents as unstructured text, ignoring the rich structured metadata and linked relationships that knowledge graphs provide. This paper proposes a method using structured linked data as a memory layer to enhance retrieval accuracy and answer quality.
By employing Schema.org markup and dereferenceable entity pages, combined with Vertex AI Vector Search 2.0 and Google Agent Development Kit (ADK), the paper conducts experiments across four domains (editorial, legal, travel, e-commerce). The experimental design includes seven conditions, covering three document representations and two retrieval modes, plus an Enhanced+ condition.
The results show that while JSON-LD markup alone provides only modest improvements, the enhanced entity page format significantly improves retrieval accuracy: +29.6% for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant achieves the highest absolute scores in accuracy and completeness, though the incremental gain over the base enhanced format is not statistically significant.
This method offers a new perspective for information retrieval and generation, potentially impacting academia and industry, especially in high-precision and high-completeness applications. By leveraging structured linked data, the paper demonstrates how to use knowledge graphs for multi-hop link traversal without constructing graphs, offering new engineering possibilities.
Future research directions include exploring the application of structured linked data as a memory layer in broader domains, further optimizing the enhanced entity page format, and developing more efficient multi-hop link traversal algorithms. Additionally, research could expand to other types of structured data and knowledge graphs to verify the generalizability of this method.
Deep Analysis
Background
With the rise of generative AI, the way users access information has fundamentally changed. Search engines increasingly augment traditional results with AI-generated summaries, a paradigm exemplified by Google's AI Mode, which retrieves, reasons over, and synthesizes information from multiple web sources. Understanding and optimizing for this new retrieval paradigm is critical for content creators, marketers, and organizations that depend on search visibility.
Retrieval-Augmented Generation (RAG) has emerged as the dominant architecture for grounding large language model (LLM) outputs in factual, up-to-date information. However, most RAG implementations treat documents as unstructured text, discarding the rich structured metadata that many websites already provide via Schema.org markup and knowledge graph representations. This paper investigates whether structured linked data can improve RAG system performance and proposes a novel method to leverage this data.
Core Problem
The prevalent issue with current RAG systems is that they treat documents as flat text, ignoring the structured metadata and linked relationships provided by knowledge graphs. This approach leads to inadequate retrieval accuracy and completeness, especially in complex retrieval tasks requiring multi-source information integration. Addressing this problem is crucial for enhancing RAG system performance, particularly in applications requiring high precision and completeness.
Innovation
The core innovation of this paper is the proposal of using structured linked data as a memory layer in RAG systems. Specifically, the paper effectively utilizes structured data from knowledge graphs through an enhanced entity page format, which includes llms.txt-style agent instructions, breadcrumbs, and neural search capabilities. This method significantly improves retrieval accuracy and answer quality compared to existing RAG systems, providing a new way of information integration.
Methodology
The methodology of this paper includes the following key steps:
- �� Use Schema.org markup and dereferenceable entity pages as sources of structured linked data.
- �� Combine Vertex AI Vector Search 2.0 for retrieval with Google Agent Development Kit (ADK) for agentic reasoning.
- �� The experimental design tests seven conditions, covering three document representations (plain HTML, HTML with JSON-LD, and an enhanced agentic-optimized entity page) and two retrieval modes (standard RAG and agentic RAG with multi-hop link traversal), plus an Enhanced+ condition.
- �� Experiments are conducted across four domains (editorial, legal, travel, e-commerce) to validate the method's effectiveness.
Experiments
The experimental design covers four domains (editorial, legal, travel, e-commerce), using Vertex AI Vector Search 2.0 for retrieval and combining with Google Agent Development Kit (ADK) for agentic reasoning. The experiments include seven conditions, covering three document representations and two retrieval modes, plus an Enhanced+ condition. The dataset includes 2,443 individual query evaluations, ensuring the reliability and reproducibility of the results.
Results
The experimental results show that the enhanced entity page format significantly improves retrieval accuracy: +29.6% for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant achieves the highest absolute scores in accuracy and completeness (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant. This demonstrates the advantage of structured data in multi-hop link traversal.
Applications
This method can be directly applied to high-precision and high-completeness applications such as legal document retrieval, travel information integration, and e-commerce product recommendation. By leveraging structured linked data, the method can significantly enhance retrieval accuracy and completeness, especially in complex retrieval tasks requiring multi-source information integration.
Limitations & Outlook
Despite the excellent performance of this method in experiments, there are some limitations. First, although the Enhanced+ variant achieves the best absolute scores, the incremental gain over the base enhanced format is not statistically significant, indicating that further navigational affordances may not always lead to significant performance improvements. Additionally, the experimental results may be influenced by specific domain datasets, with more pronounced effects in domains rich in knowledge graph information. Finally, due to the complexity of the experimental design, the reproducibility of results may be limited, especially across different knowledge graphs and data platforms.
Plain Language Accessible to non-experts
Imagine searching for a specific book in a large library. Traditional RAG systems are like a librarian who relies solely on the book title catalog, only able to find books by their titles without leveraging the relationships between them. The method proposed in this paper is like a super librarian who has access to a complete bibliography and related information, able to find not only the book by its title but also related books and materials.
This method uses structured linked data, allowing the RAG system to act like the super librarian, utilizing the relationships between books to improve retrieval accuracy and completeness. In this way, the system can better integrate multi-source information, providing more comprehensive and accurate answers.
This method is particularly suitable for complex tasks requiring the integration of large amounts of information, such as legal document retrieval, travel information integration, and e-commerce product recommendation. By leveraging structured linked data, the system can better understand and integrate information, enhancing retrieval accuracy and completeness.
In summary, this method allows RAG systems to act like a super librarian, utilizing the relationships between books to improve retrieval accuracy and completeness. This approach offers a new perspective for information retrieval and generation, potentially impacting academia and industry.
ELI14 Explained like you're 14
Hey there! Imagine you're in a huge library trying to find a book about dinosaurs. A regular librarian can only help you find the book by its title, but if they're a super librarian, they can use the connections between books to find even more related books and information!
That's what this paper's method does! By using structured linked data, the system acts like that super librarian, using the connections between books to improve retrieval accuracy and completeness. This way, you get more comprehensive and accurate answers!
This method is especially useful for complex tasks that need a lot of information, like legal document retrieval, travel information integration, and e-commerce product recommendation. By using structured linked data, the system can better understand and integrate information, enhancing retrieval accuracy and completeness.
So, next time you're in a library looking for a book, think of this super librarian! They're the superheroes of information retrieval!
Glossary
RAG System (Retrieval-Augmented Generation)
An architecture combining information retrieval and generation to enhance the accuracy and completeness of generated results using retrieved information.
The paper explores how to use structured linked data to improve RAG system performance.
Structured Linked Data
Data represented through Schema.org markup and dereferenceable entity pages, providing rich metadata and linked relationships.
The paper uses structured linked data as a memory layer in RAG systems.
Schema.org Markup
A markup format used to embed structured data in web pages, helping search engines better understand page content.
The paper uses Schema.org markup as a source of structured linked data.
Knowledge Graph
A graph structure representing entities and their relationships, widely used in information retrieval and semantic understanding.
The paper investigates how to leverage structured data from knowledge graphs to improve RAG system performance.
Vertex AI Vector Search 2.0
An AI-native search engine designed for efficient information retrieval, combining dense semantic search and sparse keyword search.
The paper uses Vertex AI Vector Search 2.0 for information retrieval.
Google Agent Development Kit (ADK)
A production framework for building multi-tool agents, supporting complex multi-step reasoning and tool use.
The paper combines ADK for agentic reasoning.
Enhanced Entity Page Format
An optimized document representation format including llms.txt-style agent instructions, breadcrumbs, and neural search capabilities.
The paper proposes an enhanced entity page format to improve retrieval performance.
Multi-hop Link Traversal
A retrieval method integrating information through multiple link hops, mimicking the behavior of AI-powered search systems.
The paper explores the application of multi-hop link traversal in RAG systems.
llms.txt-style Agent Instructions
Instruction format providing explicit guidance for LLM agents, helping them better understand and use structured data.
The enhanced entity page format in the paper includes llms.txt-style agent instructions.
Breadcrumb Navigation
A navigation tool helping users understand their position within a website's structure, often used to enhance user experience.
The enhanced entity page format in the paper includes breadcrumb navigation.
Open Questions Unanswered questions from this research
- 1 How can structured linked data as a memory layer be applied in broader domains? Current research focuses on specific domains, and future exploration is needed for its potential in other areas.
- 2 How can the enhanced entity page format be further optimized? Although the current format significantly improves retrieval performance, there is room for improvement, especially in navigational features.
- 3 How can more efficient multi-hop link traversal algorithms be developed? Existing algorithms may be inefficient in some cases, requiring further optimization to improve retrieval efficiency.
- 4 How can the generalizability of this method be verified across different knowledge graphs and data platforms? Current experimental results may be influenced by specific domain datasets, requiring validation on broader platforms.
- 5 How can the reproducibility issue caused by the complexity of experimental design be addressed? More standardized experimental frameworks are needed to improve the reproducibility of results.
Applications
Immediate Applications
Legal Document Retrieval
By leveraging structured linked data, the system can more accurately retrieve and integrate legal documents, enhancing retrieval accuracy and completeness.
Travel Information Integration
Using structured linked data, the system can better integrate travel information, providing users with more comprehensive and accurate travel recommendations.
E-commerce Product Recommendation
By leveraging structured linked data, the system can more accurately recommend e-commerce products, enhancing the shopping experience for users.
Long-term Vision
Cross-domain Information Integration
By further optimizing the application of structured linked data, the system can achieve information integration across broader domains, providing more comprehensive solutions.
Intelligent Search Engines
By leveraging structured linked data, future search engines can more intelligently understand and integrate information, providing users with more accurate and comprehensive search results.
Abstract
Retrieval-Augmented Generation (RAG) systems typically treat documents as flat text, ignoring the structured metadata and linked relationships that knowledge graphs provide. In this paper, we investigate whether structured linked data, specifically Schema.org markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic RAG systems. We conduct a controlled experiment across four domains (editorial, legal, travel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and the Google Agent Development Kit (ADK) for agentic reasoning. Our experimental design tests seven conditions: three document representations (plain HTML, HTML with JSON-LD, and an enhanced agentic-optimized entity page) crossed with two retrieval modes (standard RAG and agentic RAG with multi-hop link traversal), plus an Enhanced+ condition that adds rich navigational affordances and entity interlinking. Our results reveal that while JSON-LD markup alone provides only modest improvements, our enhanced entity page format, incorporating llms.txt-style agent instructions, breadcrumbs, and neural search capabilities, achieves substantial gains: +29.6% accuracy improvement for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant, with richer navigational affordances, achieves the highest absolute scores (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant. We release our dataset, evaluation framework, and enhanced entity page templates to support reproducibility.
References (18)
REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung et al.
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Akari Asai, Zeqiu Wu, Yizhong Wang et al.
WordLift: Meaningful Navigation Systems and Content Recommendation for News Sites running WordPress
A. Volpini, David Riccitelli
Schema.org: Evolution of Structured Data on the Web
R. Guha, D. Brickley, Steve Macbeth
Linked Data - The Story So Far
Christian Bizer, T. Heath, T. Berners-Lee
Unifying Large Language Models and Knowledge Graphs: A Roadmap
Shirui Pan, Linhao Luo, Yufei Wang et al.
Graph Retrieval-Augmented Generation: A Survey
Boci Peng, Yun Zhu, Yongchao Liu et al.
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu et al.
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann et al.
MICO - Media in Context
P. Aichroth, Christian Weigel, T. Kurz et al.
The Semantic Web
G. Goos, J. Hartmanis, J. Leeuwen et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.
Multi-hop Question Answering
Vaibhav Mavi, Anubhav Jangra, A. Jatowt
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì et al.
Knowledge Graphs
Aidan Hogan, E. Blomqvist, Michael Cochez et al.
GEO: Generative Engine Optimization
Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit et al.
LightRAG: Simple and Fast Retrieval-Augmented Generation
Zirui Guo, Lianghao Xia, Yanhua Yu et al.