2025-11-12T05:37:10.018265

Text-Enhanced Panoptic Symbol Spotting in CAD Drawings

Liu, Gong, Li et al.

With the widespread adoption of Computer-Aided Design(CAD) drawings in engineering, architecture, and industrial design, the ability to accurately interpret and analyze these drawings has become increasingly critical. Among various subtasks, panoptic symbol spotting plays a vital role in enabling downstream applications such as CAD automation and design retrieval. Existing methods primarily focus on geometric primitives within the CAD drawings to address this task, but they face following major problems: they usually overlook the rich textual annotations present in CAD drawings and they lack explicit modeling of relationships among primitives, resulting in incomprehensive understanding of the holistic drawings. To fill this gap, we propose a panoptic symbol spotting framework that incorporates textual annotations. The framework constructs unified representations by jointly modeling geometric and textual primitives. Then, using visual features extract by pretrained CNN as the initial representations, a Transformer-based backbone is employed, enhanced with a type-aware attention mechanism to explicitly model the different types of spatial dependencies between various primitives. Extensive experiments on the real-world dataset demonstrate that the proposed method outperforms existing approaches on symbol spotting tasks involving textual annotations, and exhibits superior robustness when applied to complex CAD drawings.

academic

Text-Enhanced Panoptic Symbol Spotting in CAD Drawings

Basic Information

Paper ID: 2510.11091
Title: Text-Enhanced Panoptic Symbol Spotting in CAD Drawings
Authors: Xianlin Liu, Yan Gong, Bohao Li, Jiajing Huang, Bowen Du, Junchen Ye, Liyan Xu
Classification: cs.CV cs.AI
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11091

Abstract

With the widespread application of Computer-Aided Design (CAD) drawings in engineering, architecture, and industrial design, the ability to accurately interpret and analyze these drawings has become increasingly important. Among various subtasks, panoptic symbol spotting plays a crucial role in supporting downstream applications such as CAD automation and design retrieval. Existing methods primarily focus on geometric primitives in CAD drawings to address this task but face two major challenges: they typically neglect the rich textual annotations present in CAD drawings and lack explicit modeling of relationships between primitives, resulting in incomplete overall drawing comprehension. To address this gap, this paper proposes a panoptic symbol spotting framework that incorporates textual annotations by jointly modeling geometric and textual primitives to construct unified representations. The framework employs a Transformer-based backbone network and type-aware attention mechanisms to explicitly model spatial dependencies between different types of primitives.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is the panoptic symbol spotting task in CAD drawings, which unifies instance-level symbol detection and semantic recognition. It requires identifying both countable "object" categories (such as doors, windows, and furniture) and uncountable "material" categories (such as walls and railings).

Problem Significance

Industrial Demand: CAD drawings are widely used in mechanical manufacturing, architecture, electronics, and aerospace industries. Accurate symbol recognition is fundamental to enabling intelligent design interpretation, automated modeling, and drawing retrieval.
Technical Challenges: Real-world CAD drawings are large-scale and structurally complex, requiring simultaneous understanding of geometric structure and semantic information.
Application Value: Supports CAD automation and design retrieval downstream applications.

Limitations of Existing Methods

Neglect of Textual Information: Existing methods primarily focus on geometric primitives (lines, arcs, circles, etc.) while ignoring the rich textual annotations in CAD drawings, which contain important semantic information such as dimension labels, symbol names, and functional descriptions.
Lack of Relationship Modeling: Absence of explicit modeling of relationships between different types of primitives prevents capturing high-level structural dependencies, limiting representation capacity and model performance.

Research Motivation

Textual annotations in CAD drawings provide semantic clues that complement geometric layouts and serve as an important information source for understanding design intent. By integrating textual annotations with geometric primitives, more comprehensive representations can be constructed, improving recognition accuracy in complex scenarios.

Core Contributions

First Integration of Textual Information into CAD Symbol Recognition: Introduces textual annotations as a key semantic modality in CAD symbol recognition tasks, achieving richer drawing content understanding by combining text and geometric primitives.
Proposes Type-Aware Attention Mechanism: Designs a type-aware attention mechanism to explicitly model spatial relationships between different types of primitives, enhancing the model's understanding of layout structure.
Achieves State-of-the-Art Performance on Real Datasets: Achieves leading performance on the FloorPlanCAD dataset containing textual annotations, validating the practical utility and robustness of the method.

Methodology Details

Task Definition

Input: Vectorized CAD drawing D containing geometric primitives (lines, arcs, circles, ellipses) and textual annotations
Primitive Representation: Each primitive e_i is associated with semantic category l_i and instance index z_i
Output: Predict semantic label l̂_i and instance index ẑ_i for each primitive

Model Architecture

1. Graph Construction Module

Decomposes CAD drawings into a set of basic graphic primitives D = {p_k}, including geometric primitives and textual annotations, serving as vertices in the graph. Introduces a text integration module to process diverse textual primitives while retaining high-quality annotations with meaningful semantics.

2. Feature Initialization

Visual Feature Extraction: Uses pre-trained CNN (HRNetV2-W48) to extract feature maps F from rasterized CAD images
Primitive Features: Obtains initial feature embeddings through bilinear interpolation from feature maps: f_i^0 = ε_CNN(F, c_i)
Edge Feature Construction: Manually constructs edge features describing spatial relationships between different types of primitives

3. Type-Aware Attention Mechanism

Edge Feature Encoding:

Type indicator t: Represents node pair categories (geometry-geometry, geometry-text, text-text)
Geometric relationship vector e ∈ ℝ^7: Captures relative distance, position, and angle
Complete edge feature: E = (t∥e) ∈ ℝ^{N×k×8}

Attention Computation:

Raw attention scores: α_ij^l = (q_i^l · k_j^l) / √(d/h)
Multi-head attention: A^s = Concat(a_ij^1, a_ij^2, ..., a_ij^h)
Structure embedding: T^s = MLP(E)
Enhanced attention: f^s = Softmax(A^s + T^s)f^{s-1}

4. Loss Function

Jointly optimizes semantic classification and instance segmentation:

L = λ_sem · L_sem + λ_ins · L_ins
L_ins = (1/Σm_i) Σ_i ∥o_i - (c_i - p_i)∥ · m_i

where L_sem is cross-entropy loss and L_ins is instance center regression loss.

Technical Innovations

Textual Primitive Integration: First incorporates textual annotations as an independent primitive type in the graph structure, providing semantic guidance.
Type-Aware Modeling: Explicitly distinguishes relationship types between different primitive pairs through type indicators.
Structured Attention: Integrates edge features as bias terms in attention computation, enhancing spatial relationship modeling.

Experimental Setup

Dataset

FloorPlanCAD Dataset: 15,663 CAD drawings with rich textual annotations
Categories: 35 object categories, distinguishing countable "object" and uncountable "material" classes
Annotations: Line-level annotations with category labels and instance indices for object classes, semantic categories only for material classes
Partitioning: Regular 14m×14m blocks for convenient training and evaluation

Evaluation Metrics

Employs specialized CAD symbol recognition evaluation metrics:

Recognition Quality (RQ): RQ = |TP|/(|TP| + 0.5|FP| + 0.5|FN|)
Segmentation Quality (SQ): SQ = Σ_{(s_p,s_g)∈TP} IoU(s_p,s_g) / |TP|
Panoptic Quality (PQ): PQ = RQ × SQ

Comparison Methods

CADTransformer: Transformer-based baseline method
CADTransformer + text: Baseline variant with text addition

Implementation Details

Optimizer: Adam (β₁=0.9, β₂=0.99, lr=2.5×10⁻⁵)
Architecture: 6 attention heads, maximum 16 neighbors per primitive
Training: 50 epochs, batch size 2, 2 RTX 3090 GPUs
Loss Weights: λ_sem=1, λ_ins=0.3

Experimental Results

Main Results

Method	PQ	RQ	SQ	F1
CADTransformer	0.7152	0.8298	0.8619	0.7754
CADTransformer + text	0.7352	0.8404	0.8748	0.7834
Our Method	0.7371	0.8381	0.8794	0.7877

Key Findings:

Text integration improves PQ from 0.7152 to 0.7352, demonstrating the positive contribution of semantic features.
The type-aware attention mechanism further improves PQ to 0.7371.
Outperforms baseline methods on all evaluation metrics.

Category-Level Analysis

The paper provides detailed performance analysis across 32 categories, with main findings:

Advantageous Categories: Significant improvements in door types (single doors, double doors, sliding doors) and furniture categories (sofas, beds, chairs).
Challenging Categories: Slight performance degradation on categories with complex geometric appearance and non-standardized annotations, such as bay windows.
Overall Trend: Better performance on most symbol types, demonstrating the method's generalization capability.

Case Analysis

Visualization results show that compared to CADTransformer, the proposed method produces fewer misclassifications in complex regions, particularly demonstrating greater robustness in challenging areas prone to baseline model confusion.

Classification of CAD Symbol Recognition Methods

Pixel-Based Methods: Treat symbol recognition as an image task using object detection or image segmentation techniques, but lose geometric precision and incur high computational costs.
Primitive-Based Methods: Directly operate on geometric primitives using graph neural networks or Transformers to model relationships, preserving structural information but struggling with complex hierarchical relationships.
Point Cloud-Based Methods: Abstract primitives as high-dimensional point cloud structures to capture rich geometric information but often neglect semantic cues.

Paper Positioning

This paper belongs to primitive-based methods but innovatively incorporates textual semantic information, filling the gap in multimodal understanding in existing approaches.

Conclusions and Discussion

Main Conclusions

Textual annotations are an important semantic information source in CAD drawings; incorporating text significantly improves symbol recognition performance.
Type-aware attention mechanisms effectively model spatial dependencies between different types of primitives.
Joint modeling of geometry and text provides more comprehensive CAD drawing understanding.

Limitations

Text Quality Dependency: Method performance depends on the quality and consistency of textual annotations.
Computational Complexity: Adding textual primitives and type-aware mechanisms may increase computational overhead.
Dataset Limitations: Validation only on architectural floor plan datasets; generalization to other CAD domains remains to be verified.

Future Directions

Extension to other CAD domains (mechanical, electronics, etc.)
Investigation of more efficient multimodal fusion mechanisms
Exploration of self-supervised learning to reduce dependence on annotated data

In-Depth Evaluation

Strengths

Accurate Problem Identification: Precisely identifies the key issue of existing methods neglecting textual information.
Reasonable Method Design: The type-aware attention mechanism is ingeniously designed to explicitly model different types of relationships.
Comprehensive Experiments: Provides thorough comparative experiments, ablation studies, and case analyses.
Significant Performance Improvement: Achieves notable improvements on real large-scale datasets.
Clear Writing: Well-structured paper with accurate technical descriptions.

Weaknesses

Limited Innovation: Main contribution is applying existing techniques (Transformer + text) to a new domain.
Lack of Theoretical Analysis: Insufficient in-depth theoretical analysis of why textual information is effective.
Computational Overhead Not Analyzed: No analysis of computational complexity and runtime.
Insufficient Generalization Verification: Validation on only one dataset; lacks cross-domain experiments.

Impact

Academic Value: Introduces a multimodal perspective to CAD understanding, potentially inspiring subsequent research.
Practical Value: Simple and effective method, easily applicable to industrial applications.
Reproducibility: Detailed implementation descriptions provide good reproducibility.

Applicable Scenarios

Architectural CAD Analysis: Particularly suitable for architectural floor plans with rich textual annotations.
Engineering Drawing Understanding: Extensible to other engineering drawings with textual annotations.
CAD Automation: Provides foundational technology support for CAD automation and intelligent design systems.

References

The paper cites 75 relevant references covering multiple domains including CAD analysis, computer vision, and deep learning, demonstrating comprehensive literature review. Key references include the FloorPlanCAD dataset and CADTransformer, directly related works.

Overall Assessment: This is a technically sound application-oriented paper with clear problem definition. While technical innovation is relatively limited, it accurately identifies practical problems and proposes effective solutions, achieving significant improvements on real datasets. The paper contributes meaningfully to the CAD understanding field, particularly providing valuable exploration in multimodal information fusion.