2025-11-22T08:40:16.236203

UniVector: Unified Vector Extraction via Instance-Geometry Interaction

Yan, Yue, Xia et al.

Vector extraction retrieves structured vector geometry from raster images, offering high-fidelity representation and broad applicability. Existing methods, however, are usually tailored to a single vector type (e.g., polygons, polylines, line segments), requiring separate models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connections) independently, limiting the ability to capture complex structures. Inspired by the human brain's simultaneous use of semantic and spatial interactions in visual perception, we propose UniVector, a unified VE framework that leverages instance-geometry interaction to extract multiple vector types within a single model. UniVector encodes vectors as structured queries containing both instance- and geometry-level information, and iteratively updates them through an interaction module for cross-level context exchange. A dynamic shape constraint further refines global structures and key points. To benchmark multi-structure scenarios, we introduce the Multi-Vector dataset with diverse polygons, polylines, and line segments. Experiments show UniVector sets a new state of the art on both single- and multi-structure VE tasks. Code and dataset will be released at https://github.com/yyyyll0ss/UniVector.

academic

UniVector: Unified Vector Extraction via Instance-Geometry Interaction

Basic Information

Paper ID: 2510.13234
Title: UniVector: Unified Vector Extraction via Instance-Geometry Interaction
Authors: Yinglong Yan, Jun Yue, Shaobo Xia, Hanmeng Sun, Tianxu Ying, Chengcheng Wu, Sifan Lan, Min He, Pedram Ghamisi, Leyuan Fang
Category: cs.CV (Computer Vision)
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13234v1

Abstract

Vector extraction (VE) retrieves structured vector geometric information from raster images, providing high-fidelity representations and broad applicability. However, existing methods are typically customized for single vector types (e.g., polygons, polylines, line segments), requiring independent models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connectivity) independently, limiting the ability to capture complex structures. Inspired by how the human brain simultaneously employs semantic and spatial interactions in visual perception, the authors propose UniVector, a unified VE framework that extracts multiple vector types within a single model through instance-geometry interaction. UniVector encodes vectors as structured queries containing instance-level and geometry-level information, iteratively updated through interaction modules to achieve cross-level context exchange. Dynamic shape constraints further refine global structure and keypoints.

Research Background and Motivation

Problem Definition

Vector extraction is a core task in computer vision aimed at extracting structured vector information from raster images. Vector data offers advantages over raster data including lightweight storage, high fidelity, and easy editability, with widespread applications in graphic design, cartography, and autonomous driving.

Limitations of Existing Methods

Single Structure Limitation: Existing methods are typically designed specifically for particular vector types (polygons, polylines, or line segments), requiring multiple independent models
Cascaded Architecture Issues: Traditional methods employ cascaded pipelines that process instance attributes and geometric attributes separately, resulting in information gaps
Topological Errors: Lack of instance-level constraints easily produces topological errors in multi-structure scenarios

Research Motivation

Inspired by how the human brain employs both semantic understanding and spatial understanding in visual perception, the authors propose modeling explicit cross-level information fusion through instance-geometry interaction, enabling global structure priors and fine-grained semantic-structure cues to complement each other.

Core Contributions

Unified Representation and Framework: Proposes structured query representation to unify different vector structures and introduces the UniVector instance-geometry interaction learning framework
Instance-Geometry Interaction Modeling: Designs unified vector encoder and instance-geometry interaction decoder for adaptive initialization and refinement of structured queries
Dynamic Shape Constraint (DSC): Introduces DSC to dynamically optimize global structure consistency and local shape precision
Multi-Vector Dataset: Constructs the first multi-structure VE dataset containing polygons, polylines, and line segments

Method Details

Task Definition

Given a raster image, simultaneously extract multiple vector structures (polygons, polylines, line segments) within it, outputting instance categories, bounding boxes, point coordinates, and point categories.

Model Architecture

1. Overall Framework

The UniVector framework contains three main components:

Unified Vector Encoding: Encodes different vector structures as structured queries
Instance-Geometry Interaction Decoding: Iteratively refines queries
Dynamic Shape Constraint: Ensures global structure consistency and local geometric precision

2. Unified Vector Encoding

Structured Query Representation:

Query set $Q_s \in \mathbb{R}^{N \times (M+1) \times C}$ , where N is the maximum number of vector instances, M is the maximum number of points per vector, and C is the channel dimension
Each vector $Q_s^i$ contains instance query $Q_{ins}^i \in \mathbb{R}^C$ and geometry query $Q_{geo}^i \in \mathbb{R}^{M \times C}$

Query Encoding Process:

Instance-level encoding: Employs coarse-to-fine strategy, first selecting image tokens with highest scores to form coarse queries, then refining through instance detection module
Geometry-level encoding: Captures detailed structure through shape deformation module, using intra-frame attention to refine geometry queries

3. Instance-Geometry Interaction Decoding

Structured Feature Extraction: Extends deformable attention by assigning instance reference points and geometry reference points to each vector:

$\begin{cases} R_{geo}^l = \text{Sigmoid}(\text{Sigmoid}^{-1}(R_{ins}^l) + \text{MLP}(Q_{geo}^l)), & l = 0 \\ R_{geo}^l = \text{Sigmoid}(\text{Sigmoid}^{-1}(R_{geo}^l) + \text{MLP}(Q_{geo}^l)), & l \geq 1 \end{cases}$

Instance-Geometry Interaction:

Single-level interaction: Uses self-attention mechanism
Cross-level refinement: Uses cross-attention mechanism

$Q_{ins}^{''} = \text{Concat}(\text{CA}(Q_{ins}^{i'}, Q_{geo}^{i'}), i \in [1, ..., N])$ $Q_{geo}^{''} = \text{Concat}(\text{CA}(Q_{geo}^{i'}, Q_{ins}^{i'}), i \in [1, ..., N])$

4. Dynamic Shape Constraint (DSC)

Keypoint Dynamic Matching: Solves bipartite matching between predicted vector $\hat{P} = \{\hat{p}_i\}_{i=1}^M$ and ground truth $P = \{p_i\}_{i=1}^T$ :

$L_{match}(\hat{P}, P, \beta) = \frac{1}{T}\sum_{i=1}^T(\alpha_p \cdot l_1(p_i, \hat{p}_i) + \alpha_c \cdot l_1(c_i, \hat{c}_i))$

$\beta^* = \arg\min_\beta L_{match}(\hat{P}, P, \beta)$

Vector Shape Supervision: Comprehensive constraint includes direction loss, keypoint loss, and classification loss:

$L_{VSL} = \alpha_1 \cdot L_{dir} + \alpha_2 \cdot L_{kp} + \alpha_3 \cdot L_{cls}$

Technical Innovations

Unified Representation: First proposes structured query unified representation for different vector types
Interaction Mechanism: Designs explicit instance-geometry interaction to bridge information gap between two levels
Dynamic Constraint: Introduces dynamic shape constraint to adapt to shape variations of different vectors

Experimental Setup

Datasets

Multi-Vector Dataset:

First multi-structure vector extraction dataset
20,000 training images, 3,734 validation images
Three semantic categories: buildings (70.6%), road boundaries (18.9%), centerlines (10.5%)
Buildings as polygons, road boundaries as polylines, centerlines as line segments

Single-Structure Datasets:

CrowdAI: 280k+ training images, 60k test images for building extraction
Structured3D: Synthetic 3D house dataset
Topo-Boundary: 25k aerial images for road boundary extraction
Wireframe and York Urban: Standard line segment detection datasets

Evaluation Metrics

Buildings: mAP, IoU, CIoU, PoLiS Road Boundaries and Centerlines:

Pixel-level: Precision, Recall, F1-score (10-pixel tolerance)
Geometry-level: ECM (Entropy Connectivity Measure), APLS (Average Path Length Similarity)

Comparison Methods

Include FFL, HiSup, PolyR-CNN (polygons), Sat2Graph, RNGDet++ (polylines), HAWP, LETR (line segments) and other representative methods.

Experimental Results

Main Results

Multi-Vector Dataset Performance:

Buildings: mAP 49.8% (ResNet-50), 53.4% (Swin-L)
Road Boundaries: F1-score 88.4% (ResNet-50), 90.4% (Swin-L)
Centerlines: F1-score 87.8% (ResNet-50), 88.2% (Swin-L)

SOTA Performance on Single-Structure Datasets:

CrowdAI: AP 72.8% (ResNet-50), 79.9% (Swin-B)
Topo-Boundary: F1-score 90.3%
Wireframe: sAP10 64.5% (ResNet-50), 69.8% (Swin-L)

Ablation Study

Component	Multi-Vector Buildings	CrowdAI	Topo-Boundary
Baseline	39.6	63.9	78.8
+IGID	45.2 (+5.6)	69.3 (+5.4)	85.6 (+6.8)
+UVE	47.6 (+2.4)	71.5 (+2.2)	87.5 (+1.9)
+DSC	49.4 (+1.8)	72.8 (+1.3)	90.3 (+2.8)

Instance-Geometry Interaction Decoding (IGID) provides the largest gain, while Unified Vector Encoding (UVE) and Dynamic Shape Constraint (DSC) provide additional improvements.

Experimental Findings

Training Efficiency: Compared to cascaded multi-model methods, training and inference speed improved 2-20 times
Geometric Precision: Demonstrates more accurate shapes and fewer false detections in complex scenes
Cross-Domain Generalization: Maintains stable performance across different datasets

Vector Extraction Method Classification

Instance-to-Geometry Framework:

First predicts instance representation (bounding box or mask), then infers vector geometry
Representative methods: Mask R-CNN, PolyR-CNN, LETR
Limitations: Depends on instance quality, prone to distortion in dense scenes

Geometry-to-Instance Framework:

First detects geometric points, then predicts connectivity relationships
Representative methods: PolyWorld, GraphMapper, RoadTracer
Limitations: Lacks instance-level priors, prone to topological errors

Advantages of This Work

By explicitly modeling instance-geometry interaction, combining advantages of both frameworks, achieves more accurate multi-structure vector extraction.

Conclusions and Discussion

Main Conclusions

UniVector successfully achieves unified multi-structure vector extraction, achieving SOTA on both single-structure and multi-structure tasks
Instance-geometry interaction mechanism effectively bridges information gap between two levels
Dynamic shape constraint adapts to shape variation requirements of different vector types

Limitations

Fixed maximum point number setting may limit representation of extremely complex shapes
Computational complexity increases compared to single-structure methods
Still faces challenges with extremely small-scale or severely occluded vectors

Future Directions

The authors propose developing zero-shot vector extraction foundation models and applying vector representation to downstream tasks such as visual localization and path planning.

In-Depth Evaluation

Strengths

Strong Innovation: First proposes unified multi-structure vector extraction framework, addressing a long-standing problem in the field
Reasonable Methodology: Instance-geometry interaction design inspired by human cognition has strong theoretical foundation
Comprehensive Experiments: Thorough evaluation on multiple datasets demonstrates method effectiveness
High Practical Value: Significantly improves training efficiency with important application value

Weaknesses

Computational Overhead: Computational complexity increases compared to single-structure methods
Parameter Sensitivity: Weight parameters in dynamic shape constraint require careful tuning
Extreme Cases: Limited handling capability for extremely small targets or severely occluded situations

Impact

Academic Contribution: Pioneering solution to multi-structure unified extraction problem, providing new insights for field development
Practical Value: Important significance in applications such as geographic information systems and autonomous driving
Reproducibility: Commits to open-sourcing code and datasets, facilitating subsequent research

Applicable Scenarios

High-precision map construction
Remote sensing image analysis
Building information extraction
Autonomous driving path planning
Graphic design automation

References

The paper cites 75 related references covering multiple relevant fields including vector extraction, object detection, semantic segmentation, and graph neural networks, providing solid theoretical foundation for this research.

Overall Evaluation: This is a high-quality computer vision paper achieving significant breakthrough in the important task of vector extraction. The method demonstrates strong innovation, reasonable experimental design, convincing results, and possesses important academic value and practical significance.