Vector extraction retrieves structured vector geometry from raster images, offering high-fidelity representation and broad applicability. Existing methods, however, are usually tailored to a single vector type (e.g., polygons, polylines, line segments), requiring separate models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connections) independently, limiting the ability to capture complex structures. Inspired by the human brain's simultaneous use of semantic and spatial interactions in visual perception, we propose UniVector, a unified VE framework that leverages instance-geometry interaction to extract multiple vector types within a single model. UniVector encodes vectors as structured queries containing both instance- and geometry-level information, and iteratively updates them through an interaction module for cross-level context exchange. A dynamic shape constraint further refines global structures and key points. To benchmark multi-structure scenarios, we introduce the Multi-Vector dataset with diverse polygons, polylines, and line segments. Experiments show UniVector sets a new state of the art on both single- and multi-structure VE tasks. Code and dataset will be released at https://github.com/yyyyll0ss/UniVector.
- Paper ID: 2510.13234
- Title: UniVector: Unified Vector Extraction via Instance-Geometry Interaction
- Authors: Yinglong Yan, Jun Yue, Shaobo Xia, Hanmeng Sun, Tianxu Ying, Chengcheng Wu, Sifan Lan, Min He, Pedram Ghamisi, Leyuan Fang
- Category: cs.CV (Computer Vision)
- Publication Date: October 15, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.13234v1
Vector extraction (VE) retrieves structured vector geometric information from raster images, providing high-fidelity representations and broad applicability. However, existing methods are typically customized for single vector types (e.g., polygons, polylines, line segments), requiring independent models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connectivity) independently, limiting the ability to capture complex structures. Inspired by how the human brain simultaneously employs semantic and spatial interactions in visual perception, the authors propose UniVector, a unified VE framework that extracts multiple vector types within a single model through instance-geometry interaction. UniVector encodes vectors as structured queries containing instance-level and geometry-level information, iteratively updated through interaction modules to achieve cross-level context exchange. Dynamic shape constraints further refine global structure and keypoints.
Vector extraction is a core task in computer vision aimed at extracting structured vector information from raster images. Vector data offers advantages over raster data including lightweight storage, high fidelity, and easy editability, with widespread applications in graphic design, cartography, and autonomous driving.
- Single Structure Limitation: Existing methods are typically designed specifically for particular vector types (polygons, polylines, or line segments), requiring multiple independent models
- Cascaded Architecture Issues: Traditional methods employ cascaded pipelines that process instance attributes and geometric attributes separately, resulting in information gaps
- Topological Errors: Lack of instance-level constraints easily produces topological errors in multi-structure scenarios
Inspired by how the human brain employs both semantic understanding and spatial understanding in visual perception, the authors propose modeling explicit cross-level information fusion through instance-geometry interaction, enabling global structure priors and fine-grained semantic-structure cues to complement each other.
- Unified Representation and Framework: Proposes structured query representation to unify different vector structures and introduces the UniVector instance-geometry interaction learning framework
- Instance-Geometry Interaction Modeling: Designs unified vector encoder and instance-geometry interaction decoder for adaptive initialization and refinement of structured queries
- Dynamic Shape Constraint (DSC): Introduces DSC to dynamically optimize global structure consistency and local shape precision
- Multi-Vector Dataset: Constructs the first multi-structure VE dataset containing polygons, polylines, and line segments
Given a raster image, simultaneously extract multiple vector structures (polygons, polylines, line segments) within it, outputting instance categories, bounding boxes, point coordinates, and point categories.
The UniVector framework contains three main components:
- Unified Vector Encoding: Encodes different vector structures as structured queries
- Instance-Geometry Interaction Decoding: Iteratively refines queries
- Dynamic Shape Constraint: Ensures global structure consistency and local geometric precision
Structured Query Representation:
- Query set Qs∈RN×(M+1)×C, where N is the maximum number of vector instances, M is the maximum number of points per vector, and C is the channel dimension
- Each vector Qsi contains instance query Qinsi∈RC and geometry query Qgeoi∈RM×C
Query Encoding Process:
- Instance-level encoding: Employs coarse-to-fine strategy, first selecting image tokens with highest scores to form coarse queries, then refining through instance detection module
- Geometry-level encoding: Captures detailed structure through shape deformation module, using intra-frame attention to refine geometry queries
Structured Feature Extraction:
Extends deformable attention by assigning instance reference points and geometry reference points to each vector:
{Rgeol=Sigmoid(Sigmoid−1(Rinsl)+MLP(Qgeol)),Rgeol=Sigmoid(Sigmoid−1(Rgeol)+MLP(Qgeol)),l=0l≥1
Instance-Geometry Interaction:
- Single-level interaction: Uses self-attention mechanism
- Cross-level refinement: Uses cross-attention mechanism
Qins′′=Concat(CA(Qinsi′,Qgeoi′),i∈[1,...,N])Qgeo′′=Concat(CA(Qgeoi′,Qinsi′),i∈[1,...,N])
Keypoint Dynamic Matching:
Solves bipartite matching between predicted vector P^={p^i}i=1M and ground truth P={pi}i=1T:
Lmatch(P^,P,β)=T1∑i=1T(αp⋅l1(pi,p^i)+αc⋅l1(ci,c^i))
β∗=argminβLmatch(P^,P,β)
Vector Shape Supervision:
Comprehensive constraint includes direction loss, keypoint loss, and classification loss:
LVSL=α1⋅Ldir+α2⋅Lkp+α3⋅Lcls
- Unified Representation: First proposes structured query unified representation for different vector types
- Interaction Mechanism: Designs explicit instance-geometry interaction to bridge information gap between two levels
- Dynamic Constraint: Introduces dynamic shape constraint to adapt to shape variations of different vectors
Multi-Vector Dataset:
- First multi-structure vector extraction dataset
- 20,000 training images, 3,734 validation images
- Three semantic categories: buildings (70.6%), road boundaries (18.9%), centerlines (10.5%)
- Buildings as polygons, road boundaries as polylines, centerlines as line segments
Single-Structure Datasets:
- CrowdAI: 280k+ training images, 60k test images for building extraction
- Structured3D: Synthetic 3D house dataset
- Topo-Boundary: 25k aerial images for road boundary extraction
- Wireframe and York Urban: Standard line segment detection datasets
Buildings: mAP, IoU, CIoU, PoLiS
Road Boundaries and Centerlines:
- Pixel-level: Precision, Recall, F1-score (10-pixel tolerance)
- Geometry-level: ECM (Entropy Connectivity Measure), APLS (Average Path Length Similarity)
Include FFL, HiSup, PolyR-CNN (polygons), Sat2Graph, RNGDet++ (polylines), HAWP, LETR (line segments) and other representative methods.
Multi-Vector Dataset Performance:
- Buildings: mAP 49.8% (ResNet-50), 53.4% (Swin-L)
- Road Boundaries: F1-score 88.4% (ResNet-50), 90.4% (Swin-L)
- Centerlines: F1-score 87.8% (ResNet-50), 88.2% (Swin-L)
SOTA Performance on Single-Structure Datasets:
- CrowdAI: AP 72.8% (ResNet-50), 79.9% (Swin-B)
- Topo-Boundary: F1-score 90.3%
- Wireframe: sAP10 64.5% (ResNet-50), 69.8% (Swin-L)
| Component | Multi-Vector Buildings | CrowdAI | Topo-Boundary |
|---|
| Baseline | 39.6 | 63.9 | 78.8 |
| +IGID | 45.2 (+5.6) | 69.3 (+5.4) | 85.6 (+6.8) |
| +UVE | 47.6 (+2.4) | 71.5 (+2.2) | 87.5 (+1.9) |
| +DSC | 49.4 (+1.8) | 72.8 (+1.3) | 90.3 (+2.8) |
Instance-Geometry Interaction Decoding (IGID) provides the largest gain, while Unified Vector Encoding (UVE) and Dynamic Shape Constraint (DSC) provide additional improvements.
- Training Efficiency: Compared to cascaded multi-model methods, training and inference speed improved 2-20 times
- Geometric Precision: Demonstrates more accurate shapes and fewer false detections in complex scenes
- Cross-Domain Generalization: Maintains stable performance across different datasets
Instance-to-Geometry Framework:
- First predicts instance representation (bounding box or mask), then infers vector geometry
- Representative methods: Mask R-CNN, PolyR-CNN, LETR
- Limitations: Depends on instance quality, prone to distortion in dense scenes
Geometry-to-Instance Framework:
- First detects geometric points, then predicts connectivity relationships
- Representative methods: PolyWorld, GraphMapper, RoadTracer
- Limitations: Lacks instance-level priors, prone to topological errors
By explicitly modeling instance-geometry interaction, combining advantages of both frameworks, achieves more accurate multi-structure vector extraction.
- UniVector successfully achieves unified multi-structure vector extraction, achieving SOTA on both single-structure and multi-structure tasks
- Instance-geometry interaction mechanism effectively bridges information gap between two levels
- Dynamic shape constraint adapts to shape variation requirements of different vector types
- Fixed maximum point number setting may limit representation of extremely complex shapes
- Computational complexity increases compared to single-structure methods
- Still faces challenges with extremely small-scale or severely occluded vectors
The authors propose developing zero-shot vector extraction foundation models and applying vector representation to downstream tasks such as visual localization and path planning.
- Strong Innovation: First proposes unified multi-structure vector extraction framework, addressing a long-standing problem in the field
- Reasonable Methodology: Instance-geometry interaction design inspired by human cognition has strong theoretical foundation
- Comprehensive Experiments: Thorough evaluation on multiple datasets demonstrates method effectiveness
- High Practical Value: Significantly improves training efficiency with important application value
- Computational Overhead: Computational complexity increases compared to single-structure methods
- Parameter Sensitivity: Weight parameters in dynamic shape constraint require careful tuning
- Extreme Cases: Limited handling capability for extremely small targets or severely occluded situations
- Academic Contribution: Pioneering solution to multi-structure unified extraction problem, providing new insights for field development
- Practical Value: Important significance in applications such as geographic information systems and autonomous driving
- Reproducibility: Commits to open-sourcing code and datasets, facilitating subsequent research
- High-precision map construction
- Remote sensing image analysis
- Building information extraction
- Autonomous driving path planning
- Graphic design automation
The paper cites 75 related references covering multiple relevant fields including vector extraction, object detection, semantic segmentation, and graph neural networks, providing solid theoretical foundation for this research.
Overall Evaluation: This is a high-quality computer vision paper achieving significant breakthrough in the important task of vector extraction. The method demonstrates strong innovation, reasonable experimental design, convincing results, and possesses important academic value and practical significance.