2025-11-23T10:40:16.838465

Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking

Khanchi, Amer, Poullis

Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.

academic

Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking

Basic Information

Paper ID: 2510.09878
Title: Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking
Authors: Milad Khanchi, Maria Amer, Charalambos Poullis (Concordia University)
Classification: cs.CV (Computer Vision)
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09878
Code Link: https://github.com/Milad-Khanchi/SelfTrEncMOT

Abstract

Multi-object tracking (MOT) methods typically rely on Intersection over Union (IoU) for association, which becomes unreliable when targets are similar or occluded, and computing segmentation mask IoU is computationally expensive. This paper uses segmentation masks to capture target shape without computing segmentation IoU. Instead, it fuses depth and mask features processed through a self-supervised trained compact encoder, producing stable target representations as additional similarity cues beyond bounding box IoU and re-identification features. Depth maps are obtained through zero-shot depth estimation, and target masks are acquired via a promptable visual segmentation model for fine-grained spatial cues. This work is the first to use self-supervised encoders to optimize segmentation masks without computing mask IoU. Experiments on challenging benchmarks with non-linear motion, occlusion, and crowded scenes (such as SportsMOT and DanceTrack) demonstrate that the method outperforms state-of-the-art TBD methods on most metrics.

Research Background and Motivation

Problem Definition

The core challenges in multi-object tracking include:

Occlusion Problem: When targets are partially or completely occluded, traditional 2D cues (such as bounding box IoU) become unreliable
Appearance Similarity: Targets with similar appearance are difficult to distinguish, leading to frequent ID switches
Computational Efficiency: Direct computation of segmentation mask IoU incurs high computational costs
Complex Motion: Target association under non-linear motion patterns is challenging

Research Motivation

Existing MOT methods primarily rely on 2D cues for data association, performing poorly in complex scenes. For example, when two pedestrians walk in parallel but at different depths, they may be indistinguishable in 2D view. This paper proposes a 3D spatial-aware approach combining depth and segmentation information to provide more robust target association.

Limitations of Existing Methods

Joint Detection-ReID (JDR) Methods: High computational requirements, requiring joint training of detection and tracking
Tracking-by-Detection (TBD) Methods: Primarily rely on appearance embeddings rather than spatial-aware cues
Depth-Aware Methods: Treat depth as auxiliary signal rather than primary association cue
Self-Supervised ReID Learning: Rely on contrastive or clustering targets without leveraging fused 3D spatial information

Core Contributions

Self-Supervised Encoder Design: Enhances temporal stability and discriminability of depth-segmentation features
Novel Approach: First to use self-supervised encoders to optimize segmentation masks and integrate them into matching scores without computing mask IoU
Competitive Performance: Achieves competitive performance across various tracking scenarios, particularly excelling in occluded scenes
Efficient Implementation: Avoids expensive mask IoU computation while maintaining fine-grained spatial reasoning capability

Method Details

Task Definition

Input: Consecutive frames in video sequences and target detection bounding boxes Output: Target identity associations across frames, maintaining ID consistency Constraints: Real-time requirements, handling occlusion and appearance similarity

Model Architecture

1. Depth-Segmentation Fusion Module

Zero-Shot Depth Estimation: Uses Depth Pro to generate depth maps representing relative spatial information
Promptable Visual Segmentation (PVS): Employs SAM2 for spatiotemporal shape alignment
- For tracked trajectories in frame t-1, uses bounding boxes as prompts to generate precise segmentation masks
- For new detections in frame t, back-propagates to frame t-1 for alignment
- Performs pixel-wise multiplication of masks with corresponding depth maps to generate fused depth-segmentation embeddings

2. Self-Supervised Depth-Segmentation Encoder

Architecture Design:

Encoder: 3 convolutional layers (4×4 kernel, stride 2), channels from 1→32→64→128
Batch normalization and ReLU activation
Linear layer producing 2048-dimensional bottleneck features
Decoder: Mirror structure with transposed convolution upsampling

Training Objectives:

L_total = L_recon + L_bottleneck
L_recon = ||f_i - f̂_i||²₂
L_bottleneck = ||b_{t-1} - b_t||²₂

Temporal Consistency Update:

emb_t = C · emb_{t-1} + (1-C) · emb_new
C = T + (1-T) · (1 - (DC-thresh)/(1-thresh))

3. Appearance-Motion Module

Non-Linear Kalman Filter: Models target motion dynamics, integrating Observation-Centered Re-Update (ORU) mechanism
Motion Matching: Computes S_IoU (spatial overlap) and S_ang (angular consistency)
Appearance Matching: Extracts appearance embeddings using FastReID, calculates cosine similarity S_emb

Technical Innovations

Avoiding Mask IoU Computation: Replaces expensive mask IoU with cosine similarity of encoder embeddings
Multi-Modal Fusion: Pixel-level fusion of depth and segmentation information provides fine-grained spatial cues
Self-Supervised Optimization: Enhances feature quality through reconstruction and bottleneck consistency losses
Temporal Stability: Dynamic weighted embedding update strategy maintains cross-frame consistency

Overall Association Strategy

Match_t = S_IoU_t(X̂,D) + S_ang_t(X̂,D) + S_sd_t(X̂,D) + S_emb_t(X̂,D)

Uses Hungarian algorithm for optimal data association.

Experimental Setup

Datasets

SportsMOT: Fast, unpredictable motion with frequent occlusion
DanceTrack: Highly non-linear motion, frequent occlusion, close-range interactions
- 40 training sequences, 25 validation sequences, 35 test sequences
MOT17: Medium-density crowds, structured pedestrian motion, relatively linear and predictable

Evaluation Metrics

HOTA: Higher Order Tracking Accuracy, balancing detection and association accuracy
AssA: Association Accuracy, emphasizing identity preservation
DetA: Detection Accuracy
IDF1: Identity F1 score, focusing on identity preservation and association quality
MOTA: Multiple Object Tracking Accuracy, emphasizing detection-level performance
FPS: Frame rate based on tracking components

Comparison Methods

TBD Methods: ByteTrack, OC-SORT, Deep OC-SORT, DiffMOT, CMTrack, etc. JDR Methods: FairMOT, TransTrack, MOTRv2, etc.

Implementation Details

Detector: YOLOX (consistent with latest MOT methods)
Training: Single NVIDIA A100 GPU, batch size 128, 12 epochs
Optimizer: Adam, learning rate 1e-3
Inference: Batch size 1, association stage exceeds 125 FPS (DanceTrack validation set)

Experimental Results

Main Results

SportsMOT Test Set

Method	HOTA↑	IDF1↑	AssA↑	MOTA↑	DetA↑
DiffMOT*	76.2	76.1	65.1	97.1	89.3
SelfTrEncMOT*	76.4	77.1	66.0	95.84	88.4

DanceTrack Test Set

Method	HOTA↑	IDF1↑	AssA↑	MOTA↑	DetA↑
DiffMOT	62.3	63.0	47.2	92.8	82.5
SelfTrEncMOT	64.14	66.47	50.85	90.08	81.06
MOTRv2 (JDR)	69.9	71.7	59.0	91.9	83.0

MOT17 Test Set

Method	HOTA↑	IDF1↑	AssA↑	MOTA↑	IDs↓
CMTrack	65.5	81.5	66.1	80.7	912
SelfTrEncMOT	63.48	78.12	63.25	79.16	1,008

Ablation Study

Configuration	DanceTrack-val	MOT17-val
Appearance + Mask IoU	HOTA: 54.78, AssA: 38.52, IDF1: 52.71	HOTA: 68.26, AssA: 66.81, IDF1: 77.20
Appearance + Bbox IoU	HOTA: 59.46, AssA: 43.93, IDF1: 59.11	HOTA: 70.43, AssA: 70.83, IDF1: 80.73
Appearance + Bbox IoU + Depth-Segmentation	HOTA: 60.61, AssA: 47.04, IDF1: 62.34	HOTA: 72.22, AssA: 71.79, IDF1: 82.52

Experimental Findings

Complementarity: Switching from mask IoU to bounding box IoU significantly improves performance, with depth-segmentation integration providing further improvements
Scene Adaptability: Improvements are more pronounced on non-linear motion datasets like DanceTrack, while relatively modest on linear motion datasets like MOT17
Association Quality: Consistently improves on association metrics including HOTA, AssA, and IDF1, validating method effectiveness

Joint Detection-ReID Methods

FairMOT: Dual-branch method combining anchor-free detection and appearance embeddings
TransCenter: Deformable attention improves occlusion handling
AFMTrack: Attention-based feature matching network

Tracking-by-Detection Methods

Sequence-Level Tracking: Graph-based methods (Brasó et al.), self-supervised path consistency (Lu et al.)
Frame-Level Tracking: Attention models (TrackFormer, MOTRv2), regression methods (OC-SORT, DiffMOT)

Depth-Aware and Self-Supervised Association

Depth Integration: Relative depth ordering (Quach et al.), stereo depth combined with pose estimation (Wang et al.)
Self-Supervised ReID: Path consistency embeddings (Li et al.)

Conclusions and Discussion

Main Conclusions

Depth-segmentation fusion provides effective 3D spatial awareness capability
Self-supervised encoders successfully enhance temporal stability and discriminability of features
Maintains fine-grained spatial reasoning capability while avoiding expensive mask IoU computation
Demonstrates superior performance in complex scenes (occlusion, non-linear motion)

Limitations

Computational Bottleneck: Depth estimation step (DepthPro ~0.3 seconds/frame) becomes the main performance bottleneck
Linear Motion Scenes: Limited improvements on linear motion datasets like MOT17
Strong Dependency: Relies on quality of pre-trained SAM2 and DepthPro models

Future Directions

Real-Time Depth Estimation: Research faster depth estimators to improve overall speed
Contrastive Learning: Introduce contrastive objectives for encoders to enhance discriminability and robustness
End-to-End Training: Explore joint optimization of depth estimation and tracking

In-Depth Evaluation

Strengths

Technical Innovation: First to combine depth-segmentation fusion with self-supervised encoders for MOT
Practical Value: Avoids expensive mask IoU computation, providing efficient solution
Comprehensive Experiments: Validated on multiple challenging datasets with complete ablation studies
Performance Improvement: Consistently outperforms existing TBD methods on association quality metrics

Weaknesses

Computational Efficiency: While avoiding mask IoU, depth estimation remains a bottleneck
Limited Applicability: Advantages not apparent in simple linear motion scenarios
Strong Dependencies: Heavily relies on quality and availability of pre-trained models
Theoretical Analysis: Lacks theoretical explanation for effectiveness of depth-segmentation fusion

Impact

Academic Contribution: Introduces novel multi-modal fusion approach to MOT field
Practical Application: Demonstrates utility in complex scenario tracking such as sports and dance
Reproducibility: Provides code and detailed implementation details for easy reproduction

Applicable Scenarios

Complex Motion Scenes: Non-linear motion tracking in sports competitions and dance performances
High Occlusion Environments: Multi-object tracking in crowded scenes
Similar Appearance Targets: Scenarios requiring additional spatial cues for disambiguation
Moderate Real-Time Requirements: Applications tolerating certain computational latency

References

The paper cites 41 relevant references covering major works in the MOT field, including classical methods such as ByteTrack, OC-SORT, FairMOT, as well as latest depth-aware and self-supervised learning approaches, providing comprehensive background reference for related research.