Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation
Proposes a knowledge refinement framework using automatic rule mining and ASP-based abductive reasoning, improving scene graph generation with +4-8% F1@50 across benchmarks.
Key Findings
Methodology
This paper introduces a model-agnostic, semantically-guided knowledge refinement framework that enhances scene graph generation by systematically mining commonsense-grounded constraints from training data. The process involves an offline rule mining stage, where spatial, functional, and relational regularities are extracted using statistical analysis of annotated datasets such as VG150, PSG, and IndoorVG. These rules are compiled into an ASP (Answer Set Programming) program, which encodes spatial configurations, role cardinalities, and logical regularities. During inference, the neural scene graph predictions are converted into ASP facts, and abductive reasoning is performed to select relations that satisfy the mined constraints, effectively filtering out physically impossible or illogical relations and recovering missing ones. This approach does not require retraining or manual rule authoring, and it transfers seamlessly across different architectures and datasets, providing a formal, interpretable, and scalable solution to improve the logical consistency of scene graphs.
Key Results
- Across three benchmarks (VG150, PSG, IndoorVG), the proposed method consistently improves F1@50 by approximately 4-8 percentage points over baseline models such as Motifs, Transformer, and REACT++, with the largest gains observed in long-tail and zero-shot relation predictions. The Constraint Violation Rate (CVR) decreases by about 15%, indicating fewer physically or logically implausible relations. The zero-shot recall (zsR@50) also improves by 4.5%, demonstrating enhanced generalization. The method maintains comparable object detection performance while significantly boosting relation accuracy and consistency.
- The integration of rules effectively reduces spatial and logical inconsistencies, leading to more coherent scene graphs. The rule verification process filters out noisy or incorrect rules, ensuring robustness. The computational overhead remains minimal, with rule mining taking only a few minutes and reasoning completing within hundreds of milliseconds per image, confirming practical deployability.
- Ablation studies reveal that spatial constraints contribute most to reducing physically impossible relations, while functional and relational rules improve logical coherence and relation diversity. The framework's architecture-agnostic design enables broad applicability, with consistent improvements observed across different neural models and datasets.
Significance
This work addresses a critical bottleneck in scene graph generation: the inability of purely data-driven models to enforce deep structural and commonsense constraints, especially under annotation sparsity. By integrating symbolic reasoning with neural predictions, the framework enhances the logical and spatial plausibility of generated scene graphs, which is vital for downstream tasks like visual reasoning, question answering, and autonomous navigation. Its model-agnostic and rule-based nature ensures broad applicability, providing a pathway toward more trustworthy and interpretable visual understanding systems. The formal, verifiable corrections introduced by the ASP-based approach offer a significant step forward in bridging the gap between deep learning and symbolic AI, fostering more robust and explainable scene understanding.
Technical Contribution
The paper's main technical innovations include: • An automatic, data-driven rule mining pipeline that extracts spatial, functional, and relational regularities from annotated datasets without manual intervention. • The formalization of these rules into an ASP program, enabling logical reasoning over neural predictions. • The development of abductive reasoning procedures that combine neural scores with symbolic constraints, optimizing for globally consistent scene graphs. • A rule verification mechanism that filters out unreliable rules based on empirical validation, ensuring robustness. • Demonstration of the framework's transferability across different neural architectures and datasets, with consistent performance gains. These contributions collectively advance the integration of symbolic reasoning into neural scene understanding, providing a scalable, interpretable, and effective method for knowledge-guided scene graph refinement.
Novelty
This work is pioneering in its fully automated, data-driven extraction of commonsense rules from annotated datasets, coupled with ASP-based abductive reasoning for scene graph refinement. Unlike prior approaches relying on external knowledge bases or manual rule crafting, this method leverages statistical analysis to mine rules and formalizes them into a verifiable reasoning framework. Its architecture-agnostic design allows seamless integration with various neural models, and the formal reasoning ensures explainability and consistency. This combination of automatic rule mining, formal logic, and abductive inference represents a significant innovation, pushing the boundary of how symbolic knowledge can be integrated into deep learning-based scene understanding.
Limitations
- The rule mining process depends heavily on the quality and diversity of training data; noisy or biased datasets may produce less reliable rules, affecting reasoning outcomes.
- ASP-based abductive reasoning, while efficient, may face scalability issues as the number of objects and relations increases significantly, potentially limiting real-time applications.
- The current framework primarily targets static images; extending to dynamic scenes with temporal and causal relations remains an open challenge.
- The approach assumes that the mined rules sufficiently cover the scene semantics; rare or context-specific relations may still be misclassified or missed, requiring further rule expansion.
Future Work
Future research will focus on extending the framework to dynamic and temporal scenes, integrating causal reasoning, and enabling end-to-end training that combines neural and symbolic components. Additionally, efforts will be made to automate rule updating through continual learning, adapt to new domains with minimal supervision, and optimize reasoning efficiency for real-time deployment. Exploring hybrid models that jointly learn rules and predictions could further enhance robustness and scalability, paving the way for more comprehensive and trustworthy scene understanding systems.
AI Executive Summary
Scene graphs serve as a powerful means to structurally interpret visual scenes, enabling applications from image captioning to autonomous navigation. However, current deep learning-based scene graph generation (SGG) models often struggle with relation accuracy, especially under annotation sparsity and long-tail distributions. These models tend to produce relations that violate physical or logical constraints, undermining their reliability and interpretability. Recognizing this challenge, the present study introduces a novel knowledge refinement framework that leverages visual commonsense to improve the consistency and accuracy of scene graphs.
The core innovation lies in automatically mining spatial, functional, and relational rules directly from training data. These rules encapsulate essential regularities, such as spatial configurations (e.g., containment, relative direction), role capacities (e.g., one driver per vehicle), and logical relations (e.g., symmetry, transitivity). Once mined, they are encoded into an ASP (Answer Set Programming) formalism, which enables rigorous, declarative reasoning over neural predictions. During inference, the neural model's relation predictions are transformed into ASP facts, and abductive reasoning is performed to select relations that satisfy the constraints, effectively filtering out physically impossible or illogical relations while recovering missing ones.
Extensive experiments on three benchmarks—VG150, PSG, and IndoorVG—demonstrate the effectiveness of this approach. Across architectures such as Motifs, Transformer, and REACT++, the method consistently improves F1@50 scores by 4-8%, with the most notable gains in long-tail and zero-shot relation predictions. The Constraint Violation Rate (CVR) drops by approximately 15%, indicating enhanced relation plausibility. These improvements are achieved without retraining the neural models, highlighting the framework's flexibility and efficiency.
This work significantly advances the integration of symbolic reasoning into neural scene understanding, addressing a long-standing gap between data-driven learning and structured knowledge. Its model-agnostic, explainable, and scalable design opens new avenues for deploying more trustworthy AI systems in real-world applications like autonomous driving, robotics, and surveillance. Future directions include extending the framework to dynamic scenes, incorporating causal reasoning, and developing end-to-end trainable models that unify neural and symbolic components, ultimately pushing the frontier of intelligent visual perception.
Deep Dive
Abstract
Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.
References (20)
Unbiased Scene Graph Generation From Biased Training
Kaihua Tang, Yulei Niu, Jianqiang Huang et al.
Neural Motifs: Scene Graph Parsing with Global Context
Rowan Zellers, Mark Yatskar, Sam Thomson et al.
Scene Graph Generation by Iterative Message Passing
Danfei Xu, Yuke Zhu, C. Choy et al.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna, Yuke Zhu, O. Groth et al.
Theory Solving Made Easy with Clingo 5
M. Gebser, Roland Kaminski, B. Kaufmann et al.
A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge
M. J. Khan, Filip Ilievski, John G. Breslin et al.
Clingo = ASP + Control: Preliminary Report
M. Gebser, Roland Kaminski, B. Kaufmann et al.
OG-SGG: Ontology-Guided Scene Graph Generation—A Case Study in Transfer Learning for Telepresence Robotics
Fernando Amodeo, F. Caballero, N. Díaz-Rodríguez et al.
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge
Bowen Jiang, Zhijun Zhuang, C. J. Taylor
YOLOv12: Attention-Centric Real-Time Object Detectors
Yunjie Tian, Qixiang Ye, David S. Doermann
Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers
Parth Padalkar, Gopal Gupta
In Defense of Scene Graph Generation for Human-Robot Open-Ended Interaction in Service Robotics
Maelic Neau, Paulo E. Santos, Anne-Gwenn Bosser et al.
Artificial Visual Intelligence - Perceptual Commonsense for Human-Centred Cognitive Technologies
M. Bhatt, Jakob Suchan
REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation
Maelic Neau, Paulo E. Santos, Anne-Gwenn Bosser et al.
Commonsense Visual Sensemaking for Autonomous Driving: On Generalised Neurosymbolic Online Abduction Integrating Vision and Semantics
Jakob Suchan, M. Bhatt, Srikrishna Varadarajan
Learning Visual Commonsense for Robust Scene Graph Generation
Alireza Zareian, Haoxuan You, Zhecan Wang et al.
Auto-Encoding Scene Graphs for Image Captioning
Xu Yang, Kaihua Tang, Hanwang Zhang et al.
NeurASP: Embracing Neural Networks into Answer Set Programming
Zhun Yang, Adam Ishay, Joohyung Lee
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera
Iro Armeni, Zhi-Yang He, JunYoung Gwak et al.
Visual Question Answering over Scene Graph
Soohyeong Lee, Ju-Whan Kim, Youngmin Oh et al.