Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking
Khanchi, Amer, Poullis
Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.
academic
Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking
Multi-object tracking (MOT) methods typically rely on Intersection over Union (IoU) for association, which becomes unreliable when targets are similar or occluded, and computing segmentation mask IoU is computationally expensive. This paper uses segmentation masks to capture target shape without computing segmentation IoU. Instead, it fuses depth and mask features processed through a self-supervised trained compact encoder, producing stable target representations as additional similarity cues beyond bounding box IoU and re-identification features. Depth maps are obtained through zero-shot depth estimation, and target masks are acquired via a promptable visual segmentation model for fine-grained spatial cues. This work is the first to use self-supervised encoders to optimize segmentation masks without computing mask IoU. Experiments on challenging benchmarks with non-linear motion, occlusion, and crowded scenes (such as SportsMOT and DanceTrack) demonstrate that the method outperforms state-of-the-art TBD methods on most metrics.
Existing MOT methods primarily rely on 2D cues for data association, performing poorly in complex scenes. For example, when two pedestrians walk in parallel but at different depths, they may be indistinguishable in 2D view. This paper proposes a 3D spatial-aware approach combining depth and segmentation information to provide more robust target association.
Self-Supervised Encoder Design: Enhances temporal stability and discriminability of depth-segmentation features
Novel Approach: First to use self-supervised encoders to optimize segmentation masks and integrate them into matching scores without computing mask IoU
Competitive Performance: Achieves competitive performance across various tracking scenarios, particularly excelling in occluded scenes
Input: Consecutive frames in video sequences and target detection bounding boxes
Output: Target identity associations across frames, maintaining ID consistency
Constraints: Real-time requirements, handling occlusion and appearance similarity
Complementarity: Switching from mask IoU to bounding box IoU significantly improves performance, with depth-segmentation integration providing further improvements
Scene Adaptability: Improvements are more pronounced on non-linear motion datasets like DanceTrack, while relatively modest on linear motion datasets like MOT17
Association Quality: Consistently improves on association metrics including HOTA, AssA, and IDF1, validating method effectiveness
The paper cites 41 relevant references covering major works in the MOT field, including classical methods such as ByteTrack, OC-SORT, FairMOT, as well as latest depth-aware and self-supervised learning approaches, providing comprehensive background reference for related research.