2025-11-23T10:40:16.838465

Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking

Khanchi, Amer, Poullis
Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.
academic

Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking

Basic Information

Abstract

Multi-object tracking (MOT) methods typically rely on Intersection over Union (IoU) for association, which becomes unreliable when targets are similar or occluded, and computing segmentation mask IoU is computationally expensive. This paper uses segmentation masks to capture target shape without computing segmentation IoU. Instead, it fuses depth and mask features processed through a self-supervised trained compact encoder, producing stable target representations as additional similarity cues beyond bounding box IoU and re-identification features. Depth maps are obtained through zero-shot depth estimation, and target masks are acquired via a promptable visual segmentation model for fine-grained spatial cues. This work is the first to use self-supervised encoders to optimize segmentation masks without computing mask IoU. Experiments on challenging benchmarks with non-linear motion, occlusion, and crowded scenes (such as SportsMOT and DanceTrack) demonstrate that the method outperforms state-of-the-art TBD methods on most metrics.

Research Background and Motivation

Problem Definition

The core challenges in multi-object tracking include:

  1. Occlusion Problem: When targets are partially or completely occluded, traditional 2D cues (such as bounding box IoU) become unreliable
  2. Appearance Similarity: Targets with similar appearance are difficult to distinguish, leading to frequent ID switches
  3. Computational Efficiency: Direct computation of segmentation mask IoU incurs high computational costs
  4. Complex Motion: Target association under non-linear motion patterns is challenging

Research Motivation

Existing MOT methods primarily rely on 2D cues for data association, performing poorly in complex scenes. For example, when two pedestrians walk in parallel but at different depths, they may be indistinguishable in 2D view. This paper proposes a 3D spatial-aware approach combining depth and segmentation information to provide more robust target association.

Limitations of Existing Methods

  1. Joint Detection-ReID (JDR) Methods: High computational requirements, requiring joint training of detection and tracking
  2. Tracking-by-Detection (TBD) Methods: Primarily rely on appearance embeddings rather than spatial-aware cues
  3. Depth-Aware Methods: Treat depth as auxiliary signal rather than primary association cue
  4. Self-Supervised ReID Learning: Rely on contrastive or clustering targets without leveraging fused 3D spatial information

Core Contributions

  1. Self-Supervised Encoder Design: Enhances temporal stability and discriminability of depth-segmentation features
  2. Novel Approach: First to use self-supervised encoders to optimize segmentation masks and integrate them into matching scores without computing mask IoU
  3. Competitive Performance: Achieves competitive performance across various tracking scenarios, particularly excelling in occluded scenes
  4. Efficient Implementation: Avoids expensive mask IoU computation while maintaining fine-grained spatial reasoning capability

Method Details

Task Definition

Input: Consecutive frames in video sequences and target detection bounding boxes Output: Target identity associations across frames, maintaining ID consistency Constraints: Real-time requirements, handling occlusion and appearance similarity

Model Architecture

1. Depth-Segmentation Fusion Module

  • Zero-Shot Depth Estimation: Uses Depth Pro to generate depth maps representing relative spatial information
  • Promptable Visual Segmentation (PVS): Employs SAM2 for spatiotemporal shape alignment
    • For tracked trajectories in frame t-1, uses bounding boxes as prompts to generate precise segmentation masks
    • For new detections in frame t, back-propagates to frame t-1 for alignment
    • Performs pixel-wise multiplication of masks with corresponding depth maps to generate fused depth-segmentation embeddings

2. Self-Supervised Depth-Segmentation Encoder

Architecture Design:

  • Encoder: 3 convolutional layers (4×4 kernel, stride 2), channels from 1→32→64→128
  • Batch normalization and ReLU activation
  • Linear layer producing 2048-dimensional bottleneck features
  • Decoder: Mirror structure with transposed convolution upsampling

Training Objectives:

L_total = L_recon + L_bottleneck
L_recon = ||f_i - f̂_i||²₂
L_bottleneck = ||b_{t-1} - b_t||²₂

Temporal Consistency Update:

emb_t = C · emb_{t-1} + (1-C) · emb_new
C = T + (1-T) · (1 - (DC-thresh)/(1-thresh))

3. Appearance-Motion Module

  • Non-Linear Kalman Filter: Models target motion dynamics, integrating Observation-Centered Re-Update (ORU) mechanism
  • Motion Matching: Computes S_IoU (spatial overlap) and S_ang (angular consistency)
  • Appearance Matching: Extracts appearance embeddings using FastReID, calculates cosine similarity S_emb

Technical Innovations

  1. Avoiding Mask IoU Computation: Replaces expensive mask IoU with cosine similarity of encoder embeddings
  2. Multi-Modal Fusion: Pixel-level fusion of depth and segmentation information provides fine-grained spatial cues
  3. Self-Supervised Optimization: Enhances feature quality through reconstruction and bottleneck consistency losses
  4. Temporal Stability: Dynamic weighted embedding update strategy maintains cross-frame consistency

Overall Association Strategy

Match_t = S_IoU_t(X̂,D) + S_ang_t(X̂,D) + S_sd_t(X̂,D) + S_emb_t(X̂,D)

Uses Hungarian algorithm for optimal data association.

Experimental Setup

Datasets

  1. SportsMOT: Fast, unpredictable motion with frequent occlusion
  2. DanceTrack: Highly non-linear motion, frequent occlusion, close-range interactions
    • 40 training sequences, 25 validation sequences, 35 test sequences
  3. MOT17: Medium-density crowds, structured pedestrian motion, relatively linear and predictable

Evaluation Metrics

  • HOTA: Higher Order Tracking Accuracy, balancing detection and association accuracy
  • AssA: Association Accuracy, emphasizing identity preservation
  • DetA: Detection Accuracy
  • IDF1: Identity F1 score, focusing on identity preservation and association quality
  • MOTA: Multiple Object Tracking Accuracy, emphasizing detection-level performance
  • FPS: Frame rate based on tracking components

Comparison Methods

TBD Methods: ByteTrack, OC-SORT, Deep OC-SORT, DiffMOT, CMTrack, etc. JDR Methods: FairMOT, TransTrack, MOTRv2, etc.

Implementation Details

  • Detector: YOLOX (consistent with latest MOT methods)
  • Training: Single NVIDIA A100 GPU, batch size 128, 12 epochs
  • Optimizer: Adam, learning rate 1e-3
  • Inference: Batch size 1, association stage exceeds 125 FPS (DanceTrack validation set)

Experimental Results

Main Results

SportsMOT Test Set

MethodHOTA↑IDF1↑AssA↑MOTA↑DetA↑
DiffMOT*76.276.165.197.189.3
SelfTrEncMOT*76.477.166.095.8488.4

DanceTrack Test Set

MethodHOTA↑IDF1↑AssA↑MOTA↑DetA↑
DiffMOT62.363.047.292.882.5
SelfTrEncMOT64.1466.4750.8590.0881.06
MOTRv2 (JDR)69.971.759.091.983.0

MOT17 Test Set

MethodHOTA↑IDF1↑AssA↑MOTA↑IDs↓
CMTrack65.581.566.180.7912
SelfTrEncMOT63.4878.1263.2579.161,008

Ablation Study

ConfigurationDanceTrack-valMOT17-val
Appearance + Mask IoUHOTA: 54.78, AssA: 38.52, IDF1: 52.71HOTA: 68.26, AssA: 66.81, IDF1: 77.20
Appearance + Bbox IoUHOTA: 59.46, AssA: 43.93, IDF1: 59.11HOTA: 70.43, AssA: 70.83, IDF1: 80.73
Appearance + Bbox IoU + Depth-SegmentationHOTA: 60.61, AssA: 47.04, IDF1: 62.34HOTA: 72.22, AssA: 71.79, IDF1: 82.52

Experimental Findings

  1. Complementarity: Switching from mask IoU to bounding box IoU significantly improves performance, with depth-segmentation integration providing further improvements
  2. Scene Adaptability: Improvements are more pronounced on non-linear motion datasets like DanceTrack, while relatively modest on linear motion datasets like MOT17
  3. Association Quality: Consistently improves on association metrics including HOTA, AssA, and IDF1, validating method effectiveness

Joint Detection-ReID Methods

  • FairMOT: Dual-branch method combining anchor-free detection and appearance embeddings
  • TransCenter: Deformable attention improves occlusion handling
  • AFMTrack: Attention-based feature matching network

Tracking-by-Detection Methods

  • Sequence-Level Tracking: Graph-based methods (Brasó et al.), self-supervised path consistency (Lu et al.)
  • Frame-Level Tracking: Attention models (TrackFormer, MOTRv2), regression methods (OC-SORT, DiffMOT)

Depth-Aware and Self-Supervised Association

  • Depth Integration: Relative depth ordering (Quach et al.), stereo depth combined with pose estimation (Wang et al.)
  • Self-Supervised ReID: Path consistency embeddings (Li et al.)

Conclusions and Discussion

Main Conclusions

  1. Depth-segmentation fusion provides effective 3D spatial awareness capability
  2. Self-supervised encoders successfully enhance temporal stability and discriminability of features
  3. Maintains fine-grained spatial reasoning capability while avoiding expensive mask IoU computation
  4. Demonstrates superior performance in complex scenes (occlusion, non-linear motion)

Limitations

  1. Computational Bottleneck: Depth estimation step (DepthPro ~0.3 seconds/frame) becomes the main performance bottleneck
  2. Linear Motion Scenes: Limited improvements on linear motion datasets like MOT17
  3. Strong Dependency: Relies on quality of pre-trained SAM2 and DepthPro models

Future Directions

  1. Real-Time Depth Estimation: Research faster depth estimators to improve overall speed
  2. Contrastive Learning: Introduce contrastive objectives for encoders to enhance discriminability and robustness
  3. End-to-End Training: Explore joint optimization of depth estimation and tracking

In-Depth Evaluation

Strengths

  1. Technical Innovation: First to combine depth-segmentation fusion with self-supervised encoders for MOT
  2. Practical Value: Avoids expensive mask IoU computation, providing efficient solution
  3. Comprehensive Experiments: Validated on multiple challenging datasets with complete ablation studies
  4. Performance Improvement: Consistently outperforms existing TBD methods on association quality metrics

Weaknesses

  1. Computational Efficiency: While avoiding mask IoU, depth estimation remains a bottleneck
  2. Limited Applicability: Advantages not apparent in simple linear motion scenarios
  3. Strong Dependencies: Heavily relies on quality and availability of pre-trained models
  4. Theoretical Analysis: Lacks theoretical explanation for effectiveness of depth-segmentation fusion

Impact

  1. Academic Contribution: Introduces novel multi-modal fusion approach to MOT field
  2. Practical Application: Demonstrates utility in complex scenario tracking such as sports and dance
  3. Reproducibility: Provides code and detailed implementation details for easy reproduction

Applicable Scenarios

  1. Complex Motion Scenes: Non-linear motion tracking in sports competitions and dance performances
  2. High Occlusion Environments: Multi-object tracking in crowded scenes
  3. Similar Appearance Targets: Scenarios requiring additional spatial cues for disambiguation
  4. Moderate Real-Time Requirements: Applications tolerating certain computational latency

References

The paper cites 41 relevant references covering major works in the MOT field, including classical methods such as ByteTrack, OC-SORT, FairMOT, as well as latest depth-aware and self-supervised learning approaches, providing comprehensive background reference for related research.