A Turbo-Inference Strategy for Object Detection and Instance Segmentation

TL;DR

Proposes Turbo-Inference, an iterative inference strategy leveraging detection-segmentation feedback, improving COCO and Cityscapes mAP by over 1% without retraining.

cs.CV 🔴 Advanced 2026-06-11 57 views

Zhen Zhao Gang Zhang Xiaolin Hu Liang Tang

AI Reader Arxiv Page Download PDF

Object Detection Instance Segmentation Inference Strategy Multi-task Learning Deep Learning

Key Findings

Methodology

This paper introduces a Turbo-Inference approach that creates a closed-loop interaction between detection and segmentation tasks during inference. It employs two modules: turbo-detection head and turbo-segmentation head. The turbo-detection head refines initial detection boxes and classification scores by utilizing coarse masks predicted by a baseline model, incorporating mask space structure and uncertainty metrics (Maskness). The turbo-segmentation head then uses these refined detection boxes to generate more accurate masks via RoIAlign and convolutional layers. The process iterates multiple times, with each cycle improving detection localization and mask quality. Crucially, this method does not require retraining the network, making it compatible with existing architectures like Mask R-CNN, HTC, and RTMDet. Extensive experiments on COCO, Cityscapes, and iFLYTEK datasets demonstrate consistent performance gains, validating the effectiveness of the iterative feedback mechanism.

Key Results

On COCO, applying Turbo-Inference to Mask R-CNN with ResNet-50-FPN backbone improved box AP by 1.1% and mask AP by 1.3%, with FPS decreasing from 15.7 to 12.0, indicating a favorable tradeoff between accuracy and speed.
In Cityscapes and iFLYTEK datasets, similar improvements were observed, especially in challenging scenarios with occlusion and complex backgrounds, confirming the method’s robustness.
Multiple iterations of the feedback loop progressively refined detection boxes and masks, with diminishing returns after a few cycles, but overall leading to significant performance boosts compared to baseline models.

Significance

This work introduces a novel inference-stage feedback mechanism that enhances the synergy between detection and segmentation tasks, addressing the long-standing issue of detection accuracy bottlenecking segmentation quality. By operating solely during inference, it avoids retraining costs and can be integrated into existing systems seamlessly. The approach improves robustness in cluttered and occluded scenes, which are common in real-world applications like autonomous driving, remote sensing, and surveillance. Its ability to boost performance without retraining paves the way for deploying more accurate perception systems in resource-constrained environments, marking a significant step forward in multi-task visual understanding.

Technical Contribution

The core technical innovation is the design of a multi-stage, feedback-driven inference pipeline that iteratively refines detection and segmentation outputs. The turbo-detection head employs mask-based box refinement and uncertainty-guided classification adjustment, while the turbo-segmentation head leverages refined detection boxes for more precise mask prediction. The entire process forms a closed loop, enabling progressive enhancement of both tasks. This approach differs fundamentally from existing multi-task models that rely on joint training; here, the optimization occurs dynamically during inference, making it flexible and easy to deploy. The method also introduces the Maskness metric for uncertainty quantification, which effectively filters redundant predictions and improves overall accuracy.

Novelty

This is the first work to implement a purely inference-stage, iterative detection-segmentation feedback loop that leverages mask structure and uncertainty to refine predictions. Unlike prior multi-task methods that require joint training or additional supervision, this approach dynamically enhances results without retraining. Its innovative use of mask-based refinement and uncertainty-guided filtering distinguishes it from existing post-processing or multi-stage refinement techniques, offering a new paradigm for real-time, high-precision multi-task inference.

Limitations

The iterative process increases inference time, which may limit real-time applications, especially with many cycles. Optimization of iteration count is necessary for balancing speed and accuracy.
The method’s effectiveness depends on initial detection and segmentation quality; poor initial results limit the potential of refinement, especially in extremely cluttered scenes.
In scenarios with severe occlusion or tiny objects, the mask-based refinement may struggle, and the method might require additional modules or multi-scale features to handle such cases effectively.

Future Work

Future research could focus on adaptive iteration schemes that determine the optimal number of refinement cycles based on scene complexity, thus balancing speed and accuracy dynamically. Integrating lightweight attention mechanisms or graph neural networks could further improve robustness against occlusion and boundary ambiguity. Extending the framework to 3D detection and multi-modal data, such as combining LiDAR and RGB images, could broaden its applicability. Additionally, exploring hardware-aware optimization for deployment on edge devices remains an important direction.

AI Executive Summary

Object detection and instance segmentation are fundamental tasks in computer vision, underpinning applications from autonomous driving to remote sensing. Traditional top-down methods follow a detect-then-segment paradigm, where an initial detector localizes objects with bounding boxes, and a subsequent segmentation head predicts pixel-level masks within these boxes. While effective, this approach heavily relies on the accuracy of the initial detection; errors in bounding boxes directly impair segmentation quality. Moreover, existing methods typically treat detection and segmentation as separate stages, lacking effective interaction during inference.

Recent advances have sought to bridge this gap by designing joint architectures or multi-stage refinement pipelines. However, these often involve complex training procedures or additional supervision, increasing model complexity and deployment difficulty. The challenge remains: how to leverage the rich information embedded in segmentation masks to refine detection results dynamically, without retraining the entire model?

This paper introduces Turbo-Inference, a novel inference-stage strategy that creates a feedback loop between detection and segmentation tasks. The core idea is to iteratively refine detection boxes and classification scores using mask information, and vice versa, thereby progressively enhancing both outputs. The method employs two modules: turbo-detection head, which refines detection boxes based on mask structure and uncertainty metrics, and turbo-segmentation head, which generates more accurate masks from refined detection boxes. These modules operate in a closed loop, with multiple iterations, akin to a turbocharger boosting engine performance.

The approach is model-agnostic, compatible with architectures like Mask R-CNN, HTC, and RTMDet, and requires no retraining. Extensive experiments on COCO, Cityscapes, and iFLYTEK datasets demonstrate consistent improvements in detection and segmentation metrics. For instance, applying Turbo-Inference to Mask R-CNN with ResNet-50-FPN backbone yields a 1.1% increase in box AP and 1.3% in mask AP, with only a slight reduction in inference speed. The iterative process effectively reduces false positives, improves localization, and enhances mask quality, especially in challenging scenarios involving occlusion and clutter.

This work advances the state of the art by introducing a simple yet powerful mechanism for multi-task refinement during inference. Its ability to boost performance without additional training cost makes it highly practical for real-world deployment. The broader impact includes enabling more accurate perception systems in autonomous vehicles, surveillance, and remote sensing, where robustness and efficiency are paramount. Future directions include adaptive iteration control, integration with multi-scale features, and extension to 3D and multi-modal tasks, promising a rich avenue for further research and industrial application.

Deep Analysis

Background

目标检测和实例分割作为计算机视觉的核心任务，经过多年的发展，已取得显著进展。早期方法如R-CNN系列（Girshick, 2014; 2015）通过区域建议和多阶段分类实现较高的检测精度。随着深度学习的兴起，Faster R-CNN（Ren et al., 2015）引入区域建议网络（RPN），极大提升了检测速度。单阶段检测器如YOLO（Redmon et al., 2016）和FCOS（Tian et al., 2019）在保持高速的同时实现了较高的准确率。在实例分割方面，Mask R-CNN（He et al., 2017）结合区域建议和全卷积网络，成为行业标杆。多任务联合学习逐渐成为主流，推动检测与分割的协同优化，但仍存在检测误差影响分割效果、模型复杂度高等问题。多阶段细化（如Cascade R-CNN）和多任务融合（如HTC）不断优化性能，但在推理阶段的交互机制仍有待创新。

Core Problem

现有检测-分割方法多依赖检测框的准确性，检测误差会直接导致分割质量下降。检测框的边界模糊、遮挡和背景复杂性，严重制约模型性能。此外，检测和分割任务在训练中多为联合优化，但在推理阶段缺乏有效的交互机制，导致两者性能未能充分互补。如何在推理阶段实现检测与分割的动态协同，提升整体性能，成为亟待解决的问题。这不仅关系到模型的精度，还影响到实际应用中的实时性和鲁棒性。

Innovation

本文提出的Turbo-Inference策略创新性在于引入检测-分割闭环机制，利用掩码空间结构和不确定性信息，动态反向优化检测框和分类分数。具体包括：• turbo-detection头，通过掩码反向映射和不确定性指标，细化检测框和分类分数；• turbo-segmentation头，利用细化后的检测框生成更精确的掩码。这两个模块在推理阶段多次迭代，形成闭环，逐步提升检测和分割性能。不同于传统方法在训练中联合优化，本文只在推理中实现信息交互，避免了复杂的训练过程，极大地简化了模型部署。该策略兼容多种检测与分割架构，具有广泛的适用性和扩展性。

Methodology

�� 以预训练检测模型为基础，首先在推理阶段进行常规检测和粗略掩码预测；• turbo-detection头利用掩码空间结构，结合Maskness和Box refinement模块，反向细化检测框：
�� 通过掩码反向映射，利用阈值细化检测框边界；
�� 结合掩码不确定性，调整分类分数，过滤冗余检测；• turbo-segmentation头基于细化后的检测框，利用RoIAlign提取特征，预测更精细的掩码；• 多次迭代上述两个步骤，形成闭环，逐步优化检测与分割结果；• 采用不同的停止条件和迭代次数，权衡性能提升与计算成本。

Experiments

�� 在COCO、Cityscapes和iFLYTEK数据集上进行验证，采用AP指标评估检测和分割性能；• 使用不同的骨干网络（ResNet-50、ResNet-101、Swin Transformer等）进行对比；• 设置不同的迭代轮次（如3轮、4轮），观察性能变化；• 采用标准训练策略，保持模型一致性，重点测试推理阶段的性能提升；• 通过消融实验验证各模块贡献，分析迭代次数与性能关系。

Results

�� 在COCO上，Mask R-CNN基础模型通过Turbo-Inference实现了1.1%的边界框AP和1.3%的掩码AP提升，检测速度由15.7FPS下降至12.0FPS，但性能明显改善；• 在Cityscapes和iFLYTEK上，检测与分割性能均有提升，尤其在遮挡和复杂背景下表现优越；• 多轮迭代带来逐步提升，验证了闭环机制的有效性，且结合Soft NMS进一步增强性能。

Applications

�� 适用于自动驾驶系统中的目标检测与场景理解，提升车辆感知的准确性；• 在遥感图像分析中实现高精度的地物识别与分割，支持土地利用监测；• 智能监控和安防场景中，增强对异常行为和目标的检测能力，提升系统的鲁棒性。

Limitations & Outlook

�� 反复迭代带来计算成本增加，影响实时性，需优化迭代策略；• 初始检测或分割质量不足时，反向优化效果有限，尤其在极端场景中；• 在极端遮挡或极小目标检测中，掩码反向优化可能引入噪声，未来需结合多尺度特征和多模态信息进行改进。

Abstract

Object detection and instance segmentation tasks are closely related. Existing top-down instance segmentation methods usually follow a detect-then-segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo-inference strategy for the top-down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo-detection head and turbo-segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at https://github.com/zhaozhen2333/Turbo-Learning.git.

cs.CV

References (20)

Path Aggregation Network for Instance Segmentation

Shu Liu, Lu Qi, Haifang Qin et al.

2018 7625 citations ⭐ Influential View Analysis →

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár et al.

2017 32005 citations ⭐ Influential View Analysis →

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Chengqi Lyu, Wenwei Zhang, Haian Huang et al.

2022 802 citations ⭐ Influential View Analysis →

Aggregated Residual Transformations for Deep Neural Networks

Saining Xie, Ross B. Girshick, Piotr Dollár et al.

2016 11670 citations ⭐ Influential View Analysis →

Hybrid Task Cascade for Instance Segmentation

Kai Chen, Jiangmiao Pang, Jiaqi Wang et al.

2019 1517 citations ⭐ Influential View Analysis →

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu et al.

2023 1709 citations ⭐ Influential View Analysis →

Feature Pyramid Networks for Object Detection

Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick et al.

2016 26896 citations ⭐ Influential View Analysis →

Deep Residual Learning for Image Recognition

Kaiming He, X. Zhang, Shaoqing Ren et al.

2015 230225 citations ⭐ Influential View Analysis →

CSPNet: A New Backbone that can Enhance Learning Capability of CNN

Chien-Yao Wang, H. Liao, I-Hau Yeh et al.

2019 4100 citations ⭐ Influential View Analysis →

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos et al.

2016 13579 citations ⭐ Influential View Analysis →

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao et al.

2021 32656 citations ⭐ Influential View Analysis →

MMDetection: Open MMLab Detection Toolbox and Benchmark

Kai Chen, Jiaqi Wang, Jiangmiao Pang et al.

2019 3419 citations ⭐ Influential View Analysis →

Soft-NMS — Improving Object Detection with One Line of Code

Navaneeth Bodla, Bharat Singh, R. Chellappa et al.

2017 2064 citations ⭐ Influential View Analysis →

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

Wei Feng, Wentao Liu, Tong Li et al.

2019 13 citations ⭐ Influential View Analysis →

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, M. Maire, Serge J. Belongie et al.

2014 53074 citations View Analysis →

FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation

Junjie He, Pengyu Li, Yifeng Geng et al.

2023 104 citations View Analysis →

Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers

Lei Ke, Yu-Wing Tai, Chi-Keung Tang

2021 219 citations View Analysis →

ImageNet classification with deep convolutional neural networks

A. Krizhevsky, I. Sutskever, Geoffrey E. Hinton

2012 129418 citations

End-to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve et al.

2020 18415 citations View Analysis →

Faster Training of Mask R-CNN by Focusing on Instance Boundaries

Roland S. Zimmermann, Julien N. Siems

2018 73 citations View Analysis →

A Turbo-Inference Strategy for Object Detection and Instance Segmentation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence