Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

TL;DR

Proposes a knowledge refinement framework using automatic rule mining and ASP-based abductive reasoning, improving scene graph generation with +4-8% F1@50 across benchmarks.

cs.CV 🔴 Advanced 2026-06-05 97 views

Maëlic Neau Salim Baloch Jakob Suchan Zoe Falomir Mehul Bhatt

AI Reader Arxiv Page Download PDF

Scene Graph Generation Visual Commonsense Knowledge Reasoning ASP Deep Learning

Key Findings

Methodology

This paper introduces a model-agnostic, semantically-guided knowledge refinement framework that enhances scene graph generation by systematically mining commonsense-grounded constraints from training data. The process involves an offline rule mining stage, where spatial, functional, and relational regularities are extracted using statistical analysis of annotated datasets such as VG150, PSG, and IndoorVG. These rules are compiled into an ASP (Answer Set Programming) program, which encodes spatial configurations, role cardinalities, and logical regularities. During inference, the neural scene graph predictions are converted into ASP facts, and abductive reasoning is performed to select relations that satisfy the mined constraints, effectively filtering out physically impossible or illogical relations and recovering missing ones. This approach does not require retraining or manual rule authoring, and it transfers seamlessly across different architectures and datasets, providing a formal, interpretable, and scalable solution to improve the logical consistency of scene graphs.

Key Results

Across three benchmarks (VG150, PSG, IndoorVG), the proposed method consistently improves F1@50 by approximately 4-8 percentage points over baseline models such as Motifs, Transformer, and REACT++, with the largest gains observed in long-tail and zero-shot relation predictions. The Constraint Violation Rate (CVR) decreases by about 15%, indicating fewer physically or logically implausible relations. The zero-shot recall (zsR@50) also improves by 4.5%, demonstrating enhanced generalization. The method maintains comparable object detection performance while significantly boosting relation accuracy and consistency.
The integration of rules effectively reduces spatial and logical inconsistencies, leading to more coherent scene graphs. The rule verification process filters out noisy or incorrect rules, ensuring robustness. The computational overhead remains minimal, with rule mining taking only a few minutes and reasoning completing within hundreds of milliseconds per image, confirming practical deployability.
Ablation studies reveal that spatial constraints contribute most to reducing physically impossible relations, while functional and relational rules improve logical coherence and relation diversity. The framework's architecture-agnostic design enables broad applicability, with consistent improvements observed across different neural models and datasets.

Significance

This work addresses a critical bottleneck in scene graph generation: the inability of purely data-driven models to enforce deep structural and commonsense constraints, especially under annotation sparsity. By integrating symbolic reasoning with neural predictions, the framework enhances the logical and spatial plausibility of generated scene graphs, which is vital for downstream tasks like visual reasoning, question answering, and autonomous navigation. Its model-agnostic and rule-based nature ensures broad applicability, providing a pathway toward more trustworthy and interpretable visual understanding systems. The formal, verifiable corrections introduced by the ASP-based approach offer a significant step forward in bridging the gap between deep learning and symbolic AI, fostering more robust and explainable scene understanding.

Technical Contribution

The paper's main technical innovations include: • An automatic, data-driven rule mining pipeline that extracts spatial, functional, and relational regularities from annotated datasets without manual intervention. • The formalization of these rules into an ASP program, enabling logical reasoning over neural predictions. • The development of abductive reasoning procedures that combine neural scores with symbolic constraints, optimizing for globally consistent scene graphs. • A rule verification mechanism that filters out unreliable rules based on empirical validation, ensuring robustness. • Demonstration of the framework's transferability across different neural architectures and datasets, with consistent performance gains. These contributions collectively advance the integration of symbolic reasoning into neural scene understanding, providing a scalable, interpretable, and effective method for knowledge-guided scene graph refinement.

Novelty

This work is pioneering in its fully automated, data-driven extraction of commonsense rules from annotated datasets, coupled with ASP-based abductive reasoning for scene graph refinement. Unlike prior approaches relying on external knowledge bases or manual rule crafting, this method leverages statistical analysis to mine rules and formalizes them into a verifiable reasoning framework. Its architecture-agnostic design allows seamless integration with various neural models, and the formal reasoning ensures explainability and consistency. This combination of automatic rule mining, formal logic, and abductive inference represents a significant innovation, pushing the boundary of how symbolic knowledge can be integrated into deep learning-based scene understanding.

Limitations

The rule mining process depends heavily on the quality and diversity of training data; noisy or biased datasets may produce less reliable rules, affecting reasoning outcomes.
ASP-based abductive reasoning, while efficient, may face scalability issues as the number of objects and relations increases significantly, potentially limiting real-time applications.
The current framework primarily targets static images; extending to dynamic scenes with temporal and causal relations remains an open challenge.
The approach assumes that the mined rules sufficiently cover the scene semantics; rare or context-specific relations may still be misclassified or missed, requiring further rule expansion.

Future Work

Future research will focus on extending the framework to dynamic and temporal scenes, integrating causal reasoning, and enabling end-to-end training that combines neural and symbolic components. Additionally, efforts will be made to automate rule updating through continual learning, adapt to new domains with minimal supervision, and optimize reasoning efficiency for real-time deployment. Exploring hybrid models that jointly learn rules and predictions could further enhance robustness and scalability, paving the way for more comprehensive and trustworthy scene understanding systems.

AI Executive Summary

Scene graphs serve as a powerful means to structurally interpret visual scenes, enabling applications from image captioning to autonomous navigation. However, current deep learning-based scene graph generation (SGG) models often struggle with relation accuracy, especially under annotation sparsity and long-tail distributions. These models tend to produce relations that violate physical or logical constraints, undermining their reliability and interpretability. Recognizing this challenge, the present study introduces a novel knowledge refinement framework that leverages visual commonsense to improve the consistency and accuracy of scene graphs.

The core innovation lies in automatically mining spatial, functional, and relational rules directly from training data. These rules encapsulate essential regularities, such as spatial configurations (e.g., containment, relative direction), role capacities (e.g., one driver per vehicle), and logical relations (e.g., symmetry, transitivity). Once mined, they are encoded into an ASP (Answer Set Programming) formalism, which enables rigorous, declarative reasoning over neural predictions. During inference, the neural model's relation predictions are transformed into ASP facts, and abductive reasoning is performed to select relations that satisfy the constraints, effectively filtering out physically impossible or illogical relations while recovering missing ones.

Extensive experiments on three benchmarks—VG150, PSG, and IndoorVG—demonstrate the effectiveness of this approach. Across architectures such as Motifs, Transformer, and REACT++, the method consistently improves F1@50 scores by 4-8%, with the most notable gains in long-tail and zero-shot relation predictions. The Constraint Violation Rate (CVR) drops by approximately 15%, indicating enhanced relation plausibility. These improvements are achieved without retraining the neural models, highlighting the framework's flexibility and efficiency.

This work significantly advances the integration of symbolic reasoning into neural scene understanding, addressing a long-standing gap between data-driven learning and structured knowledge. Its model-agnostic, explainable, and scalable design opens new avenues for deploying more trustworthy AI systems in real-world applications like autonomous driving, robotics, and surveillance. Future directions include extending the framework to dynamic scenes, incorporating causal reasoning, and developing end-to-end trainable models that unify neural and symbolic components, ultimately pushing the frontier of intelligent visual perception.

Deep Dive

Abstract

Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.

cs.CV

References (20)

Unbiased Scene Graph Generation From Biased Training

Kaihua Tang, Yulei Niu, Jianqiang Huang et al.

2020 825 citations ⭐ Influential View Analysis →

Neural Motifs: Scene Graph Parsing with Global Context

Rowan Zellers, Mark Yatskar, Sam Thomson et al.

2017 1148 citations ⭐ Influential View Analysis →

Scene Graph Generation by Iterative Message Passing

Danfei Xu, Yuke Zhu, C. Choy et al.

2017 1393 citations ⭐ Influential View Analysis →

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, O. Groth et al.

2016 6529 citations ⭐ Influential View Analysis →

Theory Solving Made Easy with Clingo 5

M. Gebser, Roland Kaminski, B. Kaufmann et al.

2016 257 citations ⭐ Influential

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

M. J. Khan, Filip Ilievski, John G. Breslin et al.

2024 19 citations

Clingo = ASP + Control: Preliminary Report

M. Gebser, Roland Kaminski, B. Kaufmann et al.

2014 358 citations View Analysis →

OG-SGG: Ontology-Guided Scene Graph Generation—A Case Study in Transfer Learning for Telepresence Robotics

Fernando Amodeo, F. Caballero, N. Díaz-Rodríguez et al.

2022 15 citations View Analysis →

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Bowen Jiang, Zhijun Zhuang, C. J. Taylor

2023 13 citations View Analysis →

YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian, Qixiang Ye, David S. Doermann

2025 1548 citations View Analysis →

Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers

Parth Padalkar, Gopal Gupta

2025 2 citations View Analysis →

In Defense of Scene Graph Generation for Human-Robot Open-Ended Interaction in Service Robotics

Maelic Neau, Paulo E. Santos, Anne-Gwenn Bosser et al.

2023 5 citations

Artificial Visual Intelligence - Perceptual Commonsense for Human-Centred Cognitive Technologies

M. Bhatt, Jakob Suchan

2021 8 citations

REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation

Maelic Neau, Paulo E. Santos, Anne-Gwenn Bosser et al.

2024 9 citations View Analysis →

Commonsense Visual Sensemaking for Autonomous Driving: On Generalised Neurosymbolic Online Abduction Integrating Vision and Semantics

Jakob Suchan, M. Bhatt, Srikrishna Varadarajan

2020 40 citations View Analysis →

Learning Visual Commonsense for Robust Scene Graph Generation

Alireza Zareian, Haoxuan You, Zhecan Wang et al.

2020 304 citations View Analysis →

Auto-Encoding Scene Graphs for Image Captioning

Xu Yang, Kaihua Tang, Hanwang Zhang et al.

2018 792 citations View Analysis →

NeurASP: Embracing Neural Networks into Answer Set Programming

Zhun Yang, Adam Ishay, Joohyung Lee

2020 211 citations View Analysis →

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Iro Armeni, Zhi-Yang He, JunYoung Gwak et al.

2019 496 citations View Analysis →

Visual Question Answering over Scene Graph

Soohyeong Lee, Ju-Whan Kim, Youngmin Oh et al.

2019 40 citations

Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence