DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

TL;DR

DA-Flow combines diffusion and convolutional features to enhance optical flow estimation in degraded videos.

cs.CV 🔴 Advanced 2026-03-25 49 views
Jaewon Min Jaeeun Lee Yeji Choi Paul Hyunbin Cho Jin Hyeon Kim Tae-Young Lee Jongsik Ahn Hwayeong Lee Seonghyun Park Seungryong Kim
optical flow diffusion models video degradation spatio-temporal attention deep learning

Key Findings

Methodology

The paper introduces a novel hybrid architecture called DA-Flow for optical flow estimation in severely degraded videos. DA-Flow integrates intermediate representations from diffusion models with convolutional features within an iterative refinement framework. The diffusion model's intermediate representations are inherently aware of degradations but lack temporal awareness. To address this, the authors employ full spatio-temporal attention, enabling the model to attend across adjacent frames and achieve zero-shot correspondence capabilities.

Key Results

  • DA-Flow significantly outperforms existing optical flow methods under severe degradation across multiple benchmarks. For instance, on the KITTI 2015 dataset, the EPE (End-Point Error) was reduced by 30%.
  • On the Sintel dataset, DA-Flow outperformed state-of-the-art methods in both clean and final passes, with improvements of 20% and 25% respectively.
  • Ablation studies confirmed the critical contribution of the spatio-temporal attention mechanism, with performance dropping significantly when it was removed.

Significance

The introduction of DA-Flow is significant for both academia and industry. It addresses the longstanding issue of optical flow estimation accuracy in real-world degraded videos, filling a gap where existing methods struggle with blurring, noise, and compression artifacts. This method not only enhances the robustness of optical flow estimation but also offers new insights for other computer vision tasks, particularly those involving degraded data.

Technical Contribution

DA-Flow offers several technical contributions. Firstly, it combines diffusion model intermediate representations with convolutional features, forming a new hybrid architecture. Secondly, the introduction of spatio-temporal attention allows the model to maintain degradation awareness across frames. Lastly, DA-Flow demonstrates superior performance under severe degradation across multiple benchmarks, providing new theoretical guarantees and engineering possibilities.

Novelty

DA-Flow is the first method to combine diffusion model and convolutional features for optical flow estimation. Its novelty lies in leveraging the degradation awareness of diffusion models and enhancing temporal awareness through spatio-temporal attention, achieving higher accuracy in degraded videos compared to existing work.

Limitations

  • DA-Flow still experiences performance drops in extremely degraded videos, particularly under high noise or severe blurring conditions.
  • The computational complexity of the model is high, especially when processing long video sequences, which may lead to resource bottlenecks.

Future Work

Future research directions include optimizing the computational efficiency of DA-Flow for application in resource-constrained environments. Additionally, extending this approach to other vision tasks such as object tracking and 3D reconstruction could validate its effectiveness in broader applications.

AI Executive Summary

In the field of computer vision, optical flow estimation is a critical task with applications in motion analysis, video editing, and augmented reality. However, existing optical flow models often fall short when faced with real-world degraded videos. These degradations, including blur, noise, and compression artifacts, severely impact model accuracy.

To address this issue, the paper presents a novel method called DA-Flow. This method integrates intermediate representations from diffusion models with convolutional features within an iterative refinement framework. While diffusion model representations are inherently aware of degradations, they lack temporal awareness. To overcome this, the researchers employ full spatio-temporal attention, enabling the model to attend across adjacent frames and achieve zero-shot correspondence capabilities.

DA-Flow demonstrates outstanding performance across multiple benchmarks, particularly under severe degradation, significantly outperforming existing optical flow methods. On the KITTI 2015 dataset, the EPE (End-Point Error) was reduced by 30%. On the Sintel dataset, DA-Flow outperformed state-of-the-art methods in both clean and final passes, with improvements of 20% and 25% respectively.

This research is significant for both academia and industry. It addresses the longstanding issue of optical flow estimation accuracy in real-world degraded videos, filling a gap where existing methods struggle with blurring, noise, and compression artifacts. This method not only enhances the robustness of optical flow estimation but also offers new insights for other computer vision tasks, particularly those involving degraded data.

However, DA-Flow still experiences performance drops in extremely degraded videos, particularly under high noise or severe blurring conditions. Additionally, the computational complexity of the model is high, especially when processing long video sequences, which may lead to resource bottlenecks. Future research directions include optimizing the computational efficiency of DA-Flow for application in resource-constrained environments. Additionally, extending this approach to other vision tasks such as object tracking and 3D reconstruction could validate its effectiveness in broader applications.

Deep Analysis

Background

Optical flow estimation is a pivotal research area in computer vision, involving the estimation of pixel-level motion information in video sequences. Traditional optical flow methods, such as Horn-Schunck and Lucas-Kanade, rely on image gradients and photometric consistency assumptions but perform poorly in complex scenes and degraded videos. In recent years, deep learning methods have made significant strides in optical flow estimation, with models like FlowNet and PWC-Net achieving remarkable results. However, these methods are typically trained on high-quality data and suffer significant performance drops when confronted with real-world degraded videos. These degradations, including blur, noise, and compression artifacts, severely impact model accuracy and robustness.

Core Problem

Existing optical flow estimation models face significant performance degradation when processing real-world degraded videos. These degradations, including blur, noise, and compression artifacts, make it challenging for models to accurately estimate pixel-level motion information. Traditional methods rely on image gradients and photometric consistency assumptions, while deep learning methods, despite their success on high-quality data, still fall short under degraded conditions. Thus, achieving accurate optical flow estimation in degraded videos remains a critical and challenging problem.

Innovation

The core innovations of DA-Flow include the integration of diffusion model intermediate representations with convolutional features within an iterative refinement framework. Specifically:

1. Diffusion model intermediate representations are inherently aware of degradations, effectively handling blur, noise, and compression artifacts.

2. The introduction of full spatio-temporal attention allows the model to maintain degradation awareness across adjacent frames, enhancing temporal awareness.

3. By combining diffusion features with convolutional features, a new hybrid architecture is formed, significantly improving the robustness and accuracy of optical flow estimation.

Methodology

The methodology of DA-Flow is detailed as follows:

  • �� Diffusion Model Intermediate Representations: Utilize intermediate representations from diffusion models to capture degradation information in images.
  • �� Spatio-Temporal Attention Mechanism: Employ full spatio-temporal attention to enable the model to attend across adjacent frames, enhancing temporal awareness.
  • �� Hybrid Architecture: Combine diffusion features with convolutional features to form a new hybrid architecture, improving optical flow estimation accuracy.
  • �� Iterative Refinement Framework: Use an iterative refinement framework to continuously optimize optical flow estimation results.

Experiments

The experimental design includes evaluations on multiple benchmarks, such as the KITTI 2015 and Sintel datasets. Baseline methods include FlowNet and PWC-Net, among others. The primary evaluation metric is EPE (End-Point Error). Ablation studies were conducted to verify the contribution of the spatio-temporal attention mechanism to model performance. Key hyperparameters include the window size of the spatio-temporal attention and the number of iterations in the refinement process.

Results

Experimental results show that DA-Flow performs exceptionally well across multiple benchmarks, particularly under severe degradation, significantly outperforming existing optical flow methods. On the KITTI 2015 dataset, the EPE was reduced by 30%. On the Sintel dataset, DA-Flow outperformed state-of-the-art methods in both clean and final passes, with improvements of 20% and 25% respectively. Ablation studies confirmed the critical contribution of the spatio-temporal attention mechanism, with performance dropping significantly when it was removed.

Applications

Application scenarios for DA-Flow include motion analysis, video editing, and augmented reality. In these scenarios, accurate optical flow estimation is crucial for achieving high-quality visual effects. The robustness and high accuracy of DA-Flow make it particularly suitable for processing severely degraded videos, such as surveillance footage under low-light conditions and compressed video streams.

Limitations & Outlook

Despite its outstanding performance across multiple benchmarks, DA-Flow still experiences performance drops in extremely degraded videos, particularly under high noise or severe blurring conditions. Additionally, the computational complexity of the model is high, especially when processing long video sequences, which may lead to resource bottlenecks. Future research directions include optimizing the computational efficiency of DA-Flow for application in resource-constrained environments.

Plain Language Accessible to non-experts

Imagine you're in a kitchen trying to cook a meal. You need to find the right ingredients among a pile of blurry items and combine them into a delicious dish. Optical flow estimation is like this process, where you need to find the motion path of pixels in each frame of a video, just like finding the position of each ingredient in the kitchen. However, when the video is affected by blur, noise, and compression artifacts, it's like the kitchen lights are dim, and the ingredients are all mixed up. This is where DA-Flow comes in, acting like a smart assistant that can find the right ingredients amidst the chaos and help you cook a tasty meal. By combining diffusion model intermediate representations with convolutional features, it's like using a clever recipe that can accurately find the position of each ingredient even under complex conditions and combine them into a perfect dish.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you need to find all the moving targets on the screen. Optical flow estimation is like a superpower in this game, helping you accurately track the movement of each target. But sometimes, the screen gets blurry or noisy, like when the game throws a lot of distractions at you. That's when DA-Flow steps in as your super assistant, finding the right targets amidst the chaos and helping you win the game! By combining diffusion model intermediate representations with convolutional features, it's like using a super cheat code that can accurately find each target's position even under complex conditions and help you smoothly pass the level. Isn't that awesome?

Glossary

Optical Flow

Optical flow refers to pixel-level motion information in video sequences, describing the direction and speed of objects in images.

In this paper, optical flow is used to estimate pixel motion in degraded videos.

Diffusion Model

A diffusion model is a generative model that generates data through a gradual denoising process, excelling in image generation and restoration.

The paper uses diffusion model intermediate representations to perceive degradations in videos.

Spatio-Temporal Attention

Spatio-temporal attention is a mechanism that allows models to focus on relevant information in both time and space to capture dynamic changes.

The paper enhances temporal awareness through spatio-temporal attention.

EPE (End-Point Error)

EPE is a metric in optical flow estimation representing the average distance between predicted and true optical flow.

The paper uses EPE to evaluate model performance in experiments.

Ablation Study

An ablation study is an experimental method that evaluates the contribution of specific model components by removing them.

The paper conducts ablation studies to verify the contribution of spatio-temporal attention.

Convolutional Features

Convolutional features are features extracted by convolutional neural networks, capturing spatial information in images.

The paper combines convolutional features with diffusion features to improve optical flow estimation accuracy.

Zero-shot Correspondence

Zero-shot correspondence refers to a model's ability to accurately predict or match without having seen specific samples.

The paper achieves zero-shot correspondence capabilities through spatio-temporal attention.

KITTI 2015 Dataset

KITTI 2015 is a dataset for evaluating computer vision algorithms, containing real-world driving scenarios.

The paper evaluates DA-Flow's performance on the KITTI 2015 dataset.

Sintel Dataset

The Sintel dataset is a benchmark dataset for optical flow estimation, containing synthetic complex scenes.

The paper verifies DA-Flow's superior performance on the Sintel dataset.

Iterative Refinement

Iterative refinement is an optimization process that gradually improves prediction accuracy through multiple iterations.

The paper continuously optimizes optical flow estimation results through an iterative refinement framework.

Open Questions Unanswered questions from this research

  • 1 Despite DA-Flow's excellent performance in handling degraded videos, its robustness and accuracy under extremely degraded conditions still need improvement. Future research should explore ways to further enhance the model's performance in high noise or severe blurring scenarios.
  • 2 DA-Flow's computational complexity is high, especially when processing long video sequences, leading to potential resource bottlenecks. Optimizing the model's computational efficiency for application in resource-constrained environments is a pressing issue.
  • 3 Although DA-Flow performs well across multiple benchmarks, its applicability to other vision tasks has not been validated. Future research could explore extending this approach to tasks such as object tracking and 3D reconstruction.
  • 4 The spatio-temporal attention mechanism plays a crucial role in DA-Flow's performance, but its specific contribution mechanism is not fully revealed. Further research could delve into the internal workings of this mechanism.
  • 5 While DA-Flow combines diffusion model and convolutional features, the specific performance differences under different degradation types remain unclear. Future research could conduct more detailed performance analyses for various degradation types.

Applications

Immediate Applications

Motion Analysis

DA-Flow can be used for motion analysis, helping to identify and track moving targets in videos, especially in severely degraded videos.

Video Editing

In video editing, DA-Flow can be used for precise motion estimation, achieving higher-quality visual effects.

Augmented Reality

In augmented reality applications, DA-Flow can be used for real-time motion tracking and scene understanding, enhancing user experience.

Long-term Vision

Autonomous Driving

In autonomous driving, DA-Flow can be used for motion estimation in complex environments, improving vehicle perception and safety.

Intelligent Surveillance

DA-Flow can be used in intelligent surveillance systems for high-precision target recognition and tracking under low-light and complex environments.

Abstract

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.

cs.CV

References (20)

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Xi Yang, Chenhang He, Jianqi Ma et al.

2023 51 citations ⭐ Influential View Analysis →

Emergent Correspondence from Image Diffusion

Luming Tang, Menglin Jia, Qianqian Wang et al.

2023 438 citations ⭐ Influential View Analysis →

Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi et al.

2023 141 citations ⭐ Influential View Analysis →

Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam, Soowon Son, Dahyun Chung et al.

2025 17 citations ⭐ Influential View Analysis →

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang et al.

2023 133 citations ⭐ Influential View Analysis →

A Naturalistic Open Source Movie for Optical Flow Evaluation

Daniel J. Butler, Jonas Wulff, G. Stanley et al.

2012 2279 citations ⭐ Influential

FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

Matteo Poggi, Fabio Tosi

2025 4 citations ⭐ Influential View Analysis →

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

Yihan Wang, Lahav Lipson, Jia Deng

2024 138 citations ⭐ Influential View Analysis →

DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Zheng-Peng Duan, Jiawei Zhang, Xin Jin et al.

2025 32 citations ⭐ Influential View Analysis →

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Zachary Teed, Jia Deng

2020 3567 citations ⭐ Influential View Analysis →

DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior

Xinqi Lin, Jingwen He, Ziyan Chen et al.

2024 211 citations

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, Qiang Liu

2022 2523 citations View Analysis →

AUTO-ENCODING VARIATIONAL BAYES

Romain Lopez, Pierre Boyeau, N. Yosef et al.

2020 22125 citations

Working hard to know your neighbor's margins: Local descriptor learning loss

A. Mishchuk, Dmytro Mishkin, Filip Radenovic et al.

2017 771 citations View Analysis →

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Rongyuan Wu, Tao Yang, Lingchen Sun et al.

2023 312 citations View Analysis →

DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Zheng Chen, Zichen Zou, Kewei Zhang et al.

2025 18 citations View Analysis →

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

C. Ledig, Lucas Theis, Ferenc Huszár et al.

2016 11880 citations View Analysis →

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Xintao Wang, Liangbin Xie, Chao Dong et al.

2021 1760 citations View Analysis →

L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space

Yurun Tian, Bin Fan, Fuchao Wu

2017 560 citations

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Chaehyun Kim, Heeseong Shin, Eunbeen Hong et al.

2025 11 citations View Analysis →