Paper Insights - AI Arxiv Paper Analysis

cs.CV 2604.24492

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

Deployment-aligned low-precision NAS enhances spaceborne edge AI performance, achieving 0.826 mIoU.

Parampuneet Kaur Thind, Vaibhav Katturu, Giacomo Zema et al.

2026-04-27 38

cs.CV 2604.24029

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

DeepTaxon: An interpretable retrieval-augmented multimodal framework significantly improves species identification and discovery accuracy.

Jiawei Wang, Ming Lei, Yaning Yang et al.

2026-04-27 36

cs.CV 2604.23403

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

Learn&Drop accelerates CNN training by layer dropping, reducing ResNet-152 forward propagation FLOPs by 83.74%.

Giorgio Cruciata, Luca Cruciata, Liliana Lo Presti et al.

2026-04-26 2 citations 36

cs.CV 2604.22686

SS3D: End2End Self-Supervised 3D from Web Videos

SS3D achieves end-to-end self-supervised 3D estimation from monocular video using the YouTube-8M dataset.

Marwane Hariat, Gianni Franchi, David Filliat et al.

2026-04-25 48

cs.CV 2604.22658

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

PASR achieves 81.59% Top-1 retrieval accuracy on Pix3D and 76.43% on Pascal3D datasets.

Jiaxin Shi, Guofeng Zhang, Wufei Ma et al.

2026-04-24 42

cs.CV 2604.22657

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock

Utilizing TARA framework for non-invasive 3D identification of group-housed livestock, achieving 100% accuracy.

Shiva Paudel, TsungCheng Tsai, Dongyi Wang

2026-04-24 34

cs.CV 2604.22595

EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

EV-CLIP efficiently adapts CLIP for few-shot action recognition under visual challenges using visual prompts.

Hyo Jin Jon, Longbin Jin, Eun Yi Kim

2026-04-24 30

cs.CV 2604.22586

FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

FlowAnchor stabilizes video editing signals using spatial attention and adaptive modulation for efficient multi-object scene editing.

Ze Chen, Lan Chen, Yuanhang Li et al.

2026-04-24 18

cs.CV 2604.19715

A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems

Evaluates a VPP dispatch algorithm in smart distribution systems using a co-simulation framework, revealing significant impacts of communication delays.

Houchao Gan

2026-04-22 53

cs.CV 2604.18583

MUA: Mobile Ultra-detailed Animatable Avatars

MUA method achieves up to 2000X lower computational cost using Wavelet-guided Multi-level Spatial Factorized Blendshapes.

Heming Zhu, Guoxing Sun, Marc Habermann

2026-04-21 35

cs.CV 2604.18557

SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy

SynAgent leverages Solo-to-Cooperative Agent Synergy for generalizable humanoid manipulation, significantly enhancing generalization across diverse object geometries.

Wei Yao, Haohan Ma, Hongwen Zhang et al.

2026-04-21 32

cs.CV 2604.18537

MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation

MetaCloak-JPEG enhances JPEG robustness of adversarial perturbations for DreamBooth deepfake prevention, achieving 32.7 dB PSNR.

Tanjim Rahaman Fardin, S M Zunaid Alam, Mahadi Hasan Fahim et al.

2026-04-21 56

cs.CV 2604.18486

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

OneVL achieves one-step latent reasoning and planning with vision-language explanations, surpassing explicit CoT at answer-only latency.

Jinghui Lu, Jiayi Guan, Zhijian Huang et al.

2026-04-21 33

cs.CV 2604.18484

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

XEmbodied model enhances VLA models with 3D geometric and physical cues, improving performance across benchmarks.

Kangan Qian, ChuChu Xie, Yang Zhong et al.

2026-04-21 34

cs.CV 2604.16299

Repurposing 3D Generative Model for Autoregressive Layout Generation

LaviGen framework repurposes 3D generative models for autoregressive layout generation, achieving 19% higher physical plausibility on LayoutVLM benchmark.

Haoran Feng, Yifan Niu, Zehuan Huang et al.

2026-04-18 33

cs.CV 2604.16248

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

This study systematically evaluates various vision-language models for country-level image geolocalization, revealing their limitations in capturing fine-grained geographic cues.

Siddhant Bharadwaj, Ashish Vashist, Fahimul Aleem et al.

2026-04-18 41

cs.CV 2604.16240

CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

CollideNet enhances time-to-collision forecasting precision by disentangling temporal patterns in multi-scale video representation learning.

Nishq Poorav Desai, Ali Etemad, Michael Greenspan

2026-04-18 30

cs.CV 2604.16234

A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Proposed a two-stage deep learning framework using YOLOv8n and RexNet-150, achieving 95% accuracy in cheating detection.

Van-Truong Le, Le-Khanh Nguyen, Trong-Doanh Nguyen

2026-04-18 30

cs.CV 2604.15946

SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

SENSE leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation, achieving a 2.9% precision improvement on PhraseStereo.

Thomas Campagnolo, Ezio Malis, Philippe Martinet et al.

2026-04-17 37

cs.CV 2604.15312

Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

Bi-CMPStereo framework significantly improves accuracy and generalization in event-frame asymmetric stereo matching.

Ninghui Xu, Fabio Tosi, Lihui Wang et al.

2026-04-17 35