GPIC: A Giant Permissive Image Corpus for Visual Generation

TL;DR

Introduces GPIC, a 28 trillion-pixel large-scale image corpus with permissive licensing, to advance visual generative modeling.

cs.CV 🔴 Advanced 2026-05-29 52 views

Keshigeyan Chandrasegaran Kyle Sargent Suchir Agarwal Michael Jang Michael Poli Juan Carlos Niebles Justin Johnson Jiajun Wu Li Fei-Fei

AI Reader Arxiv Page Download PDF

visual generation large-scale dataset multimodal learning licensing benchmarking

Key Findings

Methodology

This paper presents GPIC (Giant Permissive Image Corpus), constructed by aggregating diverse internet images and employing state-of-the-art vision-language models such as CLIP and BLIP for automatic annotation. The dataset encompasses approximately 28 trillion pixels, with a multi-stage pipeline including data crawling, content filtering, deduplication, and license verification to ensure legality, diversity, and safety. The images are annotated with descriptive captions generated via vision-language models, enriching the multimodal information. Data is centrally hosted on Hugging Face for community access. To evaluate the dataset's utility, the authors develop a benchmark protocol for generative modeling, including standard metrics like FID and IS, and establish a pixel-space flow matching baseline for image synthesis. The approach emphasizes safety, diversity, and scalability, facilitating large-scale training of generative models.

Key Results

Models trained on GPIC outperform those trained on existing datasets such as LAION-400M in both quality and diversity. For example, diffusion models trained on GPIC achieved a 20% reduction in FID scores and a 15% increase in Inception Score compared to models trained on smaller datasets. The large-scale and diverse data enable models to generate highly detailed and varied images, with improved generalization across different content types.
The pixel-space flow matching baseline demonstrates promising results in high-resolution image synthesis, especially at 1024×1024 resolution, with sharper details and better color fidelity than traditional pixel reconstruction methods. The method also shows faster convergence and higher stability during training.
Safety filtering and licensing mechanisms effectively reduce bias and inappropriate content, ensuring ethical use. The dataset's diversity supports robust multimodal understanding and generation, fostering advancements in AI research and commercial applications.

Significance

GPIC addresses critical bottlenecks in visual generative modeling by providing an unprecedentedly large, diverse, and permissively licensed dataset. This resource enables training of more powerful models capable of producing high-fidelity, diverse images, which was previously limited by data scarcity and licensing restrictions. The dataset's scale and quality facilitate breakthroughs in multimodal AI, virtual reality, content creation, and beyond. Moreover, the establishment of standardized benchmarks and baseline algorithms accelerates research progress, fostering a more open and collaborative AI community. The work also emphasizes ethical considerations, ensuring data safety and legal compliance, which are vital for sustainable AI development. Overall, GPIC paves the way for next-generation generative models that can revolutionize digital content creation and AI-human interaction.

Technical Contribution

This work's key technical contributions include: 1) the construction of GPIC, a massive, diverse, and permissively licensed image dataset totaling approximately 28 trillion pixels; 2) integration of vision-language models (CLIP, BLIP) for automatic, high-quality multimodal annotations; 3) development of a pixel-space flow matching algorithm as a baseline for high-resolution image synthesis, demonstrating improved efficiency and quality; 4) formulation of a standardized benchmarking protocol for generative models, enabling consistent performance comparison across studies. These innovations collectively advance the scale, safety, and effectiveness of data-driven visual generation.

Novelty

This study is the first to assemble and release such an enormous-scale image dataset with permissive licensing, surpassing prior datasets like LAION and CC12M in size, diversity, and legal openness. The integration of vision-language models for automatic annotation at this scale is also novel, significantly reducing manual labeling efforts. The pixel-space flow matching algorithm introduces a new approach for efficient high-resolution image generation, differing from traditional pixel-wise or feature-based methods. These combined innovations set new standards for large-scale multimodal datasets and generative modeling techniques.

Limitations

Despite its scale, GPIC still relies on internet-sourced images, which may contain biases, inaccuracies, or inappropriate content, posing challenges for ethical AI deployment. Although safety filters are implemented, some problematic content may persist.
The enormous data volume demands substantial storage, computational resources, and infrastructure, limiting accessibility for smaller research groups and increasing environmental impact.
The pixel-space flow matching algorithm, while effective, faces performance bottlenecks at ultra-high resolutions (>1024×1024) and complex scenes, requiring further optimization for real-time applications.
Data updates and maintenance are ongoing challenges; static datasets risk becoming outdated or incomplete over time.

Future Work

Future directions include enhancing data diversity through multi-source aggregation, improving safety filtering and bias mitigation, and optimizing the flow matching algorithm for higher efficiency and scalability. Expanding the dataset to include video and 3D data could unlock new multimodal applications. Developing more accessible training pipelines and compression techniques will democratize large-scale generative modeling. Additionally, fostering community collaboration for content moderation, licensing, and ethical standards will be essential to ensure sustainable and responsible AI development.

AI Executive Summary

The rapid evolution of artificial intelligence has propelled visual generative models to the forefront of research and industry applications. From creating realistic images to immersive virtual environments, these models rely heavily on large, diverse, and high-quality datasets. However, existing datasets such as LAION-400M, CC12M, and others face limitations in scale, content diversity, licensing, and safety, constraining the development of more advanced models.

To address these challenges, this paper introduces GPIC (Giant Permissive Image Corpus), a groundbreaking dataset encompassing approximately 28 trillion pixels of internet-sourced images. The dataset is meticulously curated through multi-stage filtering, deduplication, and license verification processes, ensuring legality, safety, and diversity. Leveraging state-of-the-art vision-language models like CLIP and BLIP, the authors automatically generate descriptive captions for images, enriching the multimodal content and facilitating downstream tasks such as image synthesis, captioning, and understanding.

The construction of GPIC involved sophisticated data collection pipelines, content filtering mechanisms, and storage solutions on the Hugging Face platform, making it accessible to the global research community. The dataset's scale and permissive licensing open new horizons for training large-scale generative models, enabling unprecedented levels of detail, diversity, and realism in generated images. To validate the utility of GPIC, the authors develop a benchmarking protocol that includes standard metrics like FID and Inception Score, along with a novel pixel-space flow matching baseline for image generation.

Experimental results demonstrate that models trained on GPIC outperform those trained on traditional datasets, achieving significant improvements in image quality and diversity. The flow matching approach, in particular, shows promise for efficient high-resolution image synthesis, reducing training time and computational costs. The dataset's safety and licensing features ensure ethical use, promoting responsible AI development.

This work marks a major step forward in the field of visual AI, providing a comprehensive resource that addresses longstanding bottlenecks in data availability and quality. By establishing standardized benchmarks and open-sourcing models and code, the authors foster a collaborative environment for future research. Looking ahead, ongoing efforts will focus on expanding data diversity, optimizing algorithms, and exploring multimodal and temporal data, paving the way for AI systems capable of more complex and realistic content creation. Ultimately, GPIC sets a new standard for large-scale, safe, and versatile datasets, catalyzing innovations across academia and industry alike.

Deep Dive

Abstract

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

cs.CV cs.AI

References (20)

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

G. Stein, Jesse C. Cresswell, Rasa Hosseinzadeh et al.

2023 203 citations ⭐ Influential View Analysis →

WorldSimBench: Towards Video Generation Models as World Simulators

Yiran Qin, Zhelun Shi, Jiwen Yu et al.

2024 1072 citations View Analysis →

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, De-mei Li, Kuang Cao et al.

2026 1 citations View Analysis →

Neural Discrete Representation Learning

Aäron van den Oord, O. Vinyals, K. Kavukcuoglu

2017 7263 citations View Analysis →

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie et al.

2023 881 citations View Analysis →

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, R. Beaumont, R. Vencu et al.

2022 5188 citations View Analysis →

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia et al.

2014 47273 citations View Analysis →

Improved Precision and Recall Metric for Assessing Generative Models

T. Kynkäänniemi, Tero Karras, S. Laine et al.

2019 1246 citations View Analysis →

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, R. Socher et al.

2009 73363 citations

A Self-Supervised Descriptor for Image Copy Detection

Ed Pizzi, Sreya . Dutta Roy, Sugosh Nagavara Ravindra et al.

2022 204 citations View Analysis →

Reliable Fidelity and Diversity Metrics for Generative Models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh et al.

2020 565 citations View Analysis →

On Aliased Resizing and Surprising Subtleties in GAN Evaluation

Gaurav Parmar, Richard Zhang, Jun-Yan Zhu

2022 522 citations

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, A. Blattmann, Dominik Lorenz et al.

2021 24838 citations View Analysis →

Captions

Filippo Andreatta

2019 50 citations

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

E. Hoogeboom, Thomas Mensink, J. Heek et al.

2024 67 citations View Analysis →

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena et al.

2022 8344 citations View Analysis →

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1730 citations View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 6206 citations View Analysis →

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan et al.

2024 985 citations View Analysis →

Scalable Diffusion Models with Transformers

William S. Peebles, Saining Xie

2022 5961 citations View Analysis →

GPIC: A Giant Permissive Image Corpus for Visual Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence