VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Key Findings

Methodology

The VFIG method employs a coarse-to-fine training curriculum, starting with supervised fine-tuning (SFT) to learn atomic primitives, followed by reinforcement learning (RL) to optimize global diagram fidelity, layout consistency, and topological edge cases. The VFIG-DATA dataset comprises 66K high-quality figure-SVG pairs, curated from real-world paper figures and procedurally generated diagrams. VFIG-BENCH evaluation suite introduces novel metrics to assess the structural integrity of complex figures.

Key Results

VFIG demonstrates superior performance among open-source models, comparable to GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH, showcasing its exceptional capability in complex figure conversion tasks.
Supported by the VFIG-DATA dataset, VFIG handles more complex figure structures, enhancing model generalization compared to existing small-scale datasets.
In experiments, VFIG exhibited robust adaptability to various types of figures, particularly excelling in professional diagram handling, significantly outperforming other baseline models.

Significance

The introduction of VFIG holds significant implications for academia and industry. It addresses the traditionally labor-intensive task of figure vectorization, reducing the manual effort required to reconstruct figures. By incorporating a large-scale dataset and novel training methods, VFIG not only enhances conversion accuracy but also lays a solid foundation for future research. This study's success is poised to advance the fields of technical illustration and digital design, making graphic editing more efficient and flexible.

Technical Contribution

VFIG's technical contributions lie in its innovative coarse-to-fine training approach and the extensive VFIG-DATA dataset. Compared to existing state-of-the-art methods, VFIG offers new theoretical guarantees and engineering possibilities. The introduction of the reinforcement learning phase significantly improves model performance on complex figure structures. Additionally, the VFIG-BENCH evaluation suite provides new standards for assessing the structural integrity of complex figures.

Novelty

VFIG is the first system to utilize vision-language models for complex figure-to-SVG conversion. Compared to existing figure conversion methods, VFIG achieves higher accuracy and complexity handling capabilities by introducing a large-scale dataset and a coarse-to-fine training strategy.

Limitations

VFIG may experience performance degradation when handling extremely complex or irregular figures, primarily due to limited model generalization in these scenarios.
Despite the large scale of the VFIG-DATA dataset, it may still lack specialized figures from certain domains, potentially affecting model performance in these areas.
The computational cost of VFIG's training and inference processes is high, which may limit its application in resource-constrained environments.

Future Work

Future research directions include expanding the VFIG-DATA dataset to cover more domains, optimizing model computational efficiency, and exploring more advanced training methods to further enhance model performance and adaptability. Additionally, investigating the application of VFIG in real-time graphic editing and augmented reality is a promising avenue for exploration.

AI Executive Summary

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. However, in practice, original vector source files are frequently lost or inaccessible, leaving only 'flat' rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent.

To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams.

Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases.

Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

The introduction of VFIG holds significant implications for academia and industry. It addresses the traditionally labor-intensive task of figure vectorization, reducing the manual effort required to reconstruct figures. By incorporating a large-scale dataset and novel training methods, VFIG not only enhances conversion accuracy but also lays a solid foundation for future research. This study's success is poised to advance the fields of technical illustration and digital design, making graphic editing more efficient and flexible.

However, VFIG may experience performance degradation when handling extremely complex or irregular figures, primarily due to limited model generalization in these scenarios. Despite the large scale of the VFIG-DATA dataset, it may still lack specialized figures from certain domains, potentially affecting model performance in these areas. Future research directions include expanding the VFIG-DATA dataset to cover more domains, optimizing model computational efficiency, and exploring more advanced training methods to further enhance model performance and adaptability.

Deep Analysis

Background

In the fields of technical illustration and digital design, vector graphics formats (SVG) are widely used due to their resolution independence and flexible editability. However, over time, original vector files are often lost or inaccessible, leaving only rasterized images that are difficult to edit. This situation is particularly common in academic publications and professional designs, posing significant challenges for modifying and reusing graphics. Traditional graphic reconstruction methods typically rely on manual operations, which are time-consuming and require specialized skills, making them impractical for large-scale applications. With the advancement of deep learning technologies, automated graphic conversion has become possible, but existing methods still face challenges in handling complex graphics, especially in terms of dataset scale and complexity.

Core Problem

The core problem is how to effectively convert complex rasterized graphics into editable SVG format. The difficulty lies in restoring the geometric structure and semantic information of the graphics. Existing methods often rely on small-scale datasets, lacking generalization capabilities for complex graphics. Additionally, traditional graphic conversion methods perform poorly in handling multi-level structures and topological edge cases, resulting in insufficient fidelity and consistency in conversion results. Solving this problem is crucial for improving the efficiency and flexibility of graphic editing.

Innovation

The core innovations of VFIG lie in its coarse-to-fine training strategy and large-scale dataset support. First, VFIG introduces VFIG-DATA, a dataset comprising 66K high-quality figure-SVG pairs, significantly enhancing model training effectiveness. Second, VFIG employs a coarse-to-fine training curriculum, combining supervised fine-tuning and reinforcement learning to progressively optimize graphic fidelity and consistency. Additionally, the introduction of the VFIG-BENCH evaluation suite provides new standards for assessing the structural integrity of complex figures. Compared to existing methods, VFIG excels in handling complex graphic structures.

Methodology

�� Construction of VFIG-DATA dataset: Collecting 66K high-quality figure-SVG pairs, covering real-world paper figures and procedurally generated diagrams.
�� Coarse-to-fine training strategy:
Supervised Fine-Tuning (SFT): Learning atomic primitives to establish initial graphic structures.
Reinforcement Learning (RL): Optimizing global diagram fidelity, layout consistency, and topological edge cases.
�� VFIG-BENCH evaluation suite:
Introducing novel evaluation metrics to assess the structural integrity of complex figures.
Testing model performance on various types of figures to ensure generalization capability.

Experiments

The experimental design includes training and evaluating the model using the VFIG-DATA dataset. Baseline models include existing figure conversion methods for performance comparison with VFIG. Evaluation metrics include the VLM-Judge score, which measures model performance in complex figure conversion tasks. Experiments also include ablation studies to analyze the impact of different training strategies on model performance. Key hyperparameter adjustments ensure model stability and efficiency.

Results

Experimental results show that VFIG achieves a VLM-Judge score of 0.829 on VFIG-BENCH, significantly outperforming other baseline models. Ablation studies indicate that the coarse-to-fine training strategy plays a crucial role in enhancing model performance. VFIG excels in handling various types of figures, particularly in the conversion of professional diagrams, showing superior fidelity and consistency compared to existing methods.

Applications

Application scenarios for VFIG include graphic editing and reuse in technical illustration and digital design. By automating graphic conversion, designers and researchers can more efficiently modify and extend existing graphics. Additionally, VFIG can be applied in education and publishing, aiding in the rapid generation of high-quality graphic content.

Limitations & Outlook

Despite VFIG's excellent performance in complex figure conversion, it may experience performance degradation when handling extremely complex or irregular figures. Furthermore, the computational cost of VFIG's training and inference processes is high, which may limit its application in resource-constrained environments. Future research can overcome these limitations by expanding datasets and optimizing algorithms.

Plain Language Accessible to non-experts

Imagine you have a very complex picture with lots of details and colors. Now, you want to turn this picture into a format that can be zoomed in and out without losing quality, like turning a photo into a painting. That's what VFIG does. VFIG is like a super-smart artist who can understand these complex pictures and then redraw them in a format called SVG. This process is like turning a photo into a painting that you can edit at will, changing colors, shapes, and even adding new elements. VFIG learns how to convert complex pictures into SVG by studying lots of pictures and their SVG versions. It's like an artist with infinite wisdom, able to complete this task quickly and accurately.

ELI14 Explained like you're 14

Hey there! Did you know that sometimes those cool charts we see online are actually made in a format called SVG? SVG is like a super flexible drawing board that you can zoom in and out without it getting blurry. But sometimes, we only have the picture version of these charts, and changing them becomes a hassle. That's where VFIG comes in! It's like a super-smart robot artist that can turn these pictures into SVG format. This way, we can change these charts however we like! Isn't that awesome? But, VFIG does have some small issues, like it might make a few mistakes when dealing with really complex charts. But scientists are working hard to improve it and make it even stronger!

Glossary

Scalable Vector Graphics (SVG)

SVG is an XML-based vector graphic format that allows graphics to be scaled without losing quality.

In the paper, SVG is the target format for graphic conversion.

Vision-Language Models (VLM)

Vision-language models combine visual and language information to perform tasks.

VFIG uses VLM for figure-to-SVG conversion.

Supervised Fine-Tuning (SFT)

Supervised fine-tuning is a method of optimizing model performance using labeled data.

VFIG uses SFT to learn atomic primitives in the early training stage.

Reinforcement Learning (RL)

Reinforcement learning is a machine learning method that optimizes decisions through a reward mechanism.

VFIG uses RL to optimize global diagram fidelity and consistency.

VFIG-DATA

VFIG-DATA is a large-scale dataset containing 66K figure-SVG pairs.

Used to train the VFIG model, enhancing its generalization capabilities.

VFIG-BENCH

VFIG-BENCH is an evaluation suite for assessing model performance in complex figure conversion tasks.

Used to evaluate VFIG's performance, providing metrics for structural integrity.

VLM-Judge Score

The VLM-Judge score is a metric for measuring model performance in figure conversion tasks.

VFIG achieves a VLM-Judge score of 0.829 on VFIG-BENCH.

Ablation Study

An ablation study is a method of analyzing the impact of removing or modifying model components.

Used to analyze the role of different training strategies in VFIG.

Topology

Topology studies the spatial properties and structures of graphics.

VFIG optimizes the topological edge cases of graphics.

Global Diagram Fidelity

Global diagram fidelity refers to the similarity of the converted graphic to the original.

VFIG optimizes global diagram fidelity through RL.

Open Questions Unanswered questions from this research

1 How can VFIG's performance on extremely complex or irregular figures be further improved? Current methods have limited generalization capabilities in these scenarios, requiring more advanced algorithms and dataset support.
2 How can the computational cost of VFIG be reduced in resource-constrained environments? The current training and inference processes are computationally expensive, limiting its application.
3 How can the VFIG-DATA dataset be expanded to cover more domains? The existing dataset may lack specialized figures from certain fields.
4 What is the potential for VFIG's application in real-time graphic editing? More efficient algorithms are needed to support real-time processing.
5 How can VFIG be applied in emerging fields like augmented reality? Research is needed on its adaptability and performance in different environments.

Applications

Immediate Applications

Technical Illustration Editing

Designers can use VFIG to quickly convert rasterized graphics into SVG format, facilitating subsequent editing and modification, thereby improving work efficiency.

Academic Publishing

Researchers can utilize VFIG to convert diagrams in papers into editable SVG format, facilitating modification and reuse.

Digital Design

Digital designers can use VFIG to convert existing graphic materials into SVG, enhancing design flexibility and scalability.

Long-term Vision

Real-Time Graphic Editing

With algorithm optimization, VFIG is expected to be applied in real-time graphic editing, supporting more efficient design processes.

Augmented Reality Applications

VFIG can be applied in augmented reality, supporting real-time conversion and display of complex graphics, advancing AR technology.

Abstract

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

cs.CV cs.AI

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Scalable Vector Graphics (SVG)

Vision-Language Models (VLM)

Supervised Fine-Tuning (SFT)

Reinforcement Learning (RL)

VFIG-DATA

VFIG-BENCH

VLM-Judge Score

Ablation Study

Topology

Global Diagram Fidelity

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Technical Illustration Editing

Academic Publishing

Digital Design

Long-term Vision

Real-Time Graphic Editing

Augmented Reality Applications

Abstract

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock