SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction
SEAOTTER combines a low-complexity learned latent encoder with a learnable JPEG codec and one-time cloud transcode, achieving 200:1 compression with 7× faster encoding, 3.5× decoding, and 8% accuracy boost on ImageNet.
Key Findings
Methodology
The SEAOTTER framework consists of three main components: a sensor-side analysis transform (GA) based on pre-trained FRAPPE autoencoder, a cloud-side synthesis transform (GS), and a learned JPEG codec (JQ) with a learnable color transform (F, F−1) and quantization matrices. The sensor encoder, designed for extreme resource constraints, produces a quantized int8 latent representation that is losslessly compressed and transmitted. The cloud reconstructs an intermediate image via GS, then applies an end-to-end trained JPEG encoder that incorporates a learnable color transform and quantization, generating a standard JPEG file. This JPEG can be decoded by any conventional decoder, followed by an optional learned inverse color transform, enabling compatibility with existing infrastructure. The entire pipeline achieves high throughput, with encoding 7 times faster than AVIF and decoding 3.5 times faster, while improving downstream task accuracy, such as an 8% increase in ImageNet Top-1 accuracy at a 200:1 compression ratio.
Key Results
- At a compression ratio of 200:1, SEAOTTER outperforms AVIF by a factor of 7 in encoding speed and 3.5 in decoding speed, while boosting ImageNet Top-1 accuracy by 8%, reaching 69.02%. This demonstrates its superior efficiency and effectiveness in large-scale image classification tasks.
- Across various applications including dense semantic segmentation and vision-language models, task-aware fine-tuning further enhances downstream performance, with improvements of up to 19.85 percentage points in accuracy at low bitrates, validating the framework’s adaptability.
- The learnable JPEG codec, with optimized color transforms and quantization matrices, maintains full compatibility with existing JPEG infrastructure, enabling seamless deployment without hardware modifications, and achieves a balanced trade-off between compression efficiency and perceptual quality.
Significance
This work addresses the critical bottleneck in cloud robotics: transmitting high-resolution visual data under severe bandwidth and power constraints. By integrating learned latent representations with standard JPEG formats, SEAOTTER provides a practical, scalable solution that bridges the gap between resource-limited sensors and data-hungry downstream applications. Its compatibility with existing infrastructure accelerates deployment in real-world robotic systems, autonomous vehicles, and remote sensing platforms. The framework’s ability to deliver high throughput and improved accuracy under extreme compression ratios signifies a major advancement in edge-AI and cloud-edge collaboration, fostering more intelligent, efficient, and accessible robotic systems.
Technical Contribution
The paper’s key technical innovations include: (1) a low-complexity, pre-trained FRAPPE encoder tailored for sensor-side deployment, drastically reducing computational costs; (2) a novel end-to-end learned JPEG codec that optimizes color space, quantization, and rate proxy functions, surpassing traditional JPEG performance; (3) a one-time cloud transcode mechanism that converts the autoencoder’s latent into a standard JPEG file, enabling broad compatibility and minimal decoding overhead. These contributions collectively enable resource-constrained sensors to produce high-quality, standard-compliant images suitable for diverse downstream tasks, while maintaining high throughput and low latency.
Novelty
This work is the first to combine ultra-lightweight learned latent encoding with a fully end-to-end trainable JPEG codec that incorporates learnable color transforms and quantization matrices. Unlike prior neural codecs that rely on bespoke formats or expensive decoding, SEAOTTER achieves a unique balance: extremely fast encoding, broad compatibility, and high downstream accuracy. The innovative use of a single learned color transform shared across multiple bitrates, coupled with a one-time cloud transcode, distinguishes it from existing solutions like DE-AAE or WaLLoC, which either lack standard compatibility or are computationally prohibitive. This fusion of classical standards with deep learning marks a significant step forward in practical image compression.
Limitations
- While SEAOTTER excels in efficiency and compatibility, its performance at extremely low bitrates can lead to perceptible quality degradation, potentially impacting high-precision visual tasks.
- The training process involves complex end-to-end optimization with multiple loss components, which can be computationally intensive and sensitive to hyperparameter tuning.
- The generalization across diverse sensors and environmental conditions requires further validation, and robustness to real-world variations remains an open challenge.
Future Work
Future directions include enhancing the perceptual quality at ultra-low bitrates through perceptually optimized loss functions, extending the framework to multi-modal data such as video and LiDAR, and developing dedicated hardware accelerators to further reduce latency and power consumption. Additionally, exploring adaptive bitrate strategies and unsupervised training methods could improve robustness and ease of deployment in diverse robotic platforms.
AI Executive Summary
In the rapidly evolving field of robotics and autonomous systems, high-resolution visual data acquisition has become ubiquitous, driven by affordable, low-power sensors capable of capturing billions of pixels per second. However, the challenge lies in efficiently transmitting and storing this data, especially under strict bandwidth and power constraints typical of edge devices. Conventional image compression standards like JPEG and MPEG, while mature, struggle to meet the demands of modern applications that require both high compression ratios and minimal perceptual loss. Newer codecs such as AV1 and AVIF offer improved rate-distortion performance but demand significant computational resources, making them impractical for resource-constrained sensors.
Recent advances in deep learning have introduced autoencoder-based compression schemes that excel under extreme resource limitations. These neural autoencoders, like DE-AAE and WaLLoC, leverage learned representations to achieve high compression efficiency. Nonetheless, their reliance on bespoke formats and computationally intensive decoding processes hinder widespread adoption, especially in existing infrastructure built around the JPEG standard. The core problem is how to reconcile the need for ultra-efficient, low-complexity encoding with broad compatibility and high downstream task performance.
The paper presents SEAOTTER, a novel framework that addresses this challenge by integrating a pre-trained, ultra-lightweight autoencoder with a learnable JPEG codec. The sensor-side encoder, based on FRAPPE, produces a low-complexity latent representation suitable for real-time deployment on low-power hardware. This latent is losslessly compressed and transmitted to the cloud, where a powerful synthesis transform reconstructs an intermediate image. Subsequently, a learned JPEG encoder, incorporating a learnable color transform and quantization matrices, re-encodes the image into a standard JPEG file. This file can be decoded by any conventional JPEG decoder, ensuring compatibility with existing infrastructure.
One of the key innovations is the one-time transcode process, which converts the autoencoder’s latent into a standard JPEG artifact, optimizing downstream performance without repeated decoding overhead. The learned color transform and quantization matrices are trained end-to-end, surpassing traditional JPEG performance and enabling multi-rate operation. Extensive experiments demonstrate that at a compression ratio of 200:1, SEAOTTER achieves 7× faster encoding and 3.5× faster decoding compared to AVIF, while improving ImageNet classification accuracy by 8%. It also excels across dense segmentation and zero-shot vision-language tasks, with task-specific fine-tuning further boosting downstream metrics.
This work significantly advances the state of the art in resource-efficient image compression for cloud robotics, offering a practical solution that balances speed, accuracy, and infrastructure compatibility. Its ability to operate under severe bandwidth and power constraints, while maintaining high task performance, opens new avenues for deploying intelligent visual systems in real-world scenarios. Future work will focus on improving perceptual quality at ultra-low bitrates, extending multi-modal capabilities, and developing hardware accelerators to facilitate widespread adoption. Overall, SEAOTTER exemplifies how combining classical standards with modern deep learning can unlock new potentials in edge AI and robotic perception.
Deep Dive
Abstract
In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .
References (20)
Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark
Changsheng Gao, Yifan Ma, Qiaoxi Chen et al.
Variational image compression with a scale hyperprior
Johannes Ballé, David C. Minnen, Saurabh Singh et al.
Machine Perceptual Quality: Evaluating the Impact of Severe Lossy Compression on Audio and Image Models
Dan Jacobellis, Daniel Cummings, N. Yadwadkar
Joint Autoregressive and Hierarchical Priors for Learned Image Compression
David C. Minnen, Johannes Ballé, G. Toderici
Learned Compression for Compressed Learning
Dan Jacobellis, N. Yadwadkar
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
Xingyu Xie, Pan Zhou, Huan Li et al.
LSDIR: A Large Scale Dataset for Image Restoration
Yawei Li, K. Zhang, Jingyun Liang et al.
A Fully Digital Time-Mode CMOS Image Sensor with 22.9pJ/frame.pixel and 92dB Dynamic Range
Sangwoo Kim, Taehyoung Kim, Kiwon Seo et al.
Unified Perceptual Parsing for Scene Understanding
Tete Xiao, Yingcheng Liu, Bolei Zhou et al.
MCUCoder: Adaptive Bitrate Learned Video Compression for IoT Devices
Ali Hojjat, Janek Haberer, Olaf Landsiedel
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Richard Zhang, Phillip Isola, Alexei A. Efros et al.
Performance Evaluation of Bluetooth Low Energy: A Systematic Review
Jacopo Tosi, F. Taffoni, Marco Santacatterina et al.
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, R. Socher et al.
Faster Neural Networks Straight from JPEG
L. Gueguen, Alexander Sergeev, B. Kadlec et al.
Dedelayed: Deleting remote inference delay via on-device correction
Dan Jacobellis, Mateen Ulhaq, Fabien Racap'e et al.
VVC Complexity and Software Implementation Analysis
F. Bossen, Karsten Sühring, A. Wieckowski et al.
Image quality assessment: from error visibility to structural similarity
Zhou Wang, A. Bovik, H. Sheikh et al.
Image Quality Assessment: Unifying Structure and Texture Similarity
Keyan Ding, Kede Ma, Shiqi Wang et al.
A 12 pJ/Pixel Analog-to-Information Converter Based 816 × 640 Pixel CMOS Image Sensor
D. G. Chen, Fang Tang, M. Law et al.
Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
O. Guleryuz, Philip A. Chou, Berivan Isik et al.