Kernel Representation and Similarity Measure for Incomplete Data
Cao, Yang, He et al.
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space. The proposed method introduces data-dependent binning combined with proximity assignment to project data into a high-dimensional sparse representation that adapts to local density variations. For missing value handling, we propose a cascading fallback strategy to estimate missing feature distributions. We conduct clustering tasks on the proposed kernel representation across 12 real world incomplete datasets, demonstrating superior performance compared to existing methods while maintaining linear time complexity. All the code are available at https://anonymous.4open.science/r/proximity-kernel-2289.
academic
Kernel Representation and Similarity Measure for Incomplete Data
This paper addresses the fundamental challenge of similarity measurement for incomplete data by proposing the proximity kernel method. Traditional approaches either discard incomplete data or perform imputation preprocessing, leading to information loss and biased similarity estimation. The proximity kernel directly computes similarity between incomplete data samples in the kernel feature space without requiring explicit imputation in the original space. The method introduces a data-dependent binning mechanism combined with proximity assignment, projecting data into high-dimensional sparse representations that adapt to local density variations. For handling missing values, a cascading fallback strategy is proposed to estimate the distribution of missing features. Clustering experiments on 12 real incomplete datasets demonstrate that the method outperforms existing approaches while maintaining linear time complexity.
Similarity measurement for incomplete data is a fundamental challenge in network mining, recommendation systems, and user behavior analysis. Real-world data is inherently incomplete due to user privacy preferences, survey non-response, and voluntary information non-disclosure.
Ubiquity: Users in recommendation systems typically rate only a small number of items, producing highly sparse user-item matrices
Data Heterogeneity: Missing values may simultaneously affect numerical, categorical, or mixed-type features
Impact on Downstream Tasks: Similarity measurement is fundamental to clustering, classification, and anomaly detection; inaccurate similarity estimation significantly degrades task performance
Deletion Methods: Ignoring missing values or completely removing incomplete samples causes severe information loss and bias
Imputation Methods: Using statistical quantities or complex techniques to fill missing values often fails to capture underlying data distributions and may introduce artificial patterns that do not reflect true similarity structures
Deep Learning Methods: While promising, these typically require large datasets and substantial computational resources, lack theoretical guarantees, and are sensitive to hyperparameters
Existing methods adopt a "two-stage" strategy (imputation followed by similarity computation). This paper proposes a new approach that jointly handles imputation and similarity measurement in kernel representation space.
Proximity Kernel Method: Projects data into high-dimensional sparse representations through equal-frequency binning and Voronoi-based proximity assignment, adapting to local density without explicit density estimation
Cascading Fallback Strategy: Proposes a progressive constraint relaxation matching strategy for handling missing values, progressing from intersection to union to global priors
Linear Time Complexity: Achieves linear time complexity, making the method scalable to large-scale datasets
Experimental Validation: Demonstrates superior performance compared to existing methods on clustering tasks across 12 datasets
Given a dataset D = {x₁, x₂, ..., xₙ} containing n samples, where each sample xᵢ ∈ ℝᵈ is a d-dimensional feature vector potentially containing missing values (represented as NaN). The objective is to compute a similarity function s : D × D → 0,1 that quantifies similarity between any two samples for downstream clustering tasks.
This creates a Voronoi diagram of feature space, where each center cⱼ,ₖ defines a Voronoi cell.
Density Adaptation Property:
In dense regions: Consecutive center spacing is small, creating small Voronoi cells; two points at the same distance are more likely to fall into different cells
In sparse regions: Consecutive center spacing is large, creating large Voronoi cells; two points at the same distance are more likely to fall into the same cell
Density Adaptation Without Explicit Estimation: The combination of equal-frequency binning and proximity assignment naturally achieves density-adaptive partitioning
Joint Processing in Kernel Space: Handles missing values in representation space rather than original space, avoiding introduction of artificial patterns
Progressive Matching Strategy: Matching criteria progress from strict to relaxed, maximizing utilization of available information
This kernel satisfies Mercer conditions (symmetry and positive semi-definiteness) and has a probabilistic interpretation: computing the expected probability that two samples fall into the same bin across all features.
Uses Normalized Mutual Information (NMI) to assess clustering quality, employing K-means clustering with cluster number set to the true number of classes.
Overall Performance: Achieves best or second-best performance on 10 out of 12 datasets with highest average NMI (0.4245)
Statistical Significance: Friedman-Nemenyi test shows proximity kernel significantly outperforms all methods except HI-PMK
Stability: Box plots show proximity kernel not only achieves optimal average performance but also more consistent performance across different datasets
With bin numbers varying from 2 to 10, NMI changes are minimal across three datasets (e.g., Mammo dataset fluctuates between 0.30-0.33), demonstrating method insensitivity to hyperparameters.
Proximity kernel successfully enables direct similarity computation for incomplete data in kernel feature space, avoiding explicit imputation in original space
Data-dependent binning combined with proximity assignment creates representations adapting to local density without explicit density estimation
Cascading fallback strategy effectively utilizes available information, progressively relaxing from strict matching to global priors
Method achieves superior performance while maintaining linear time complexity
Missing Mechanism Assumptions: Current evaluation primarily based on MCAR (Missing Completely At Random) mechanisms; real data often exhibits more complex MAR and MNAR patterns
Binning Strategy: Equal-frequency binning may not be optimal for all data distributions
Theoretical Guarantees: Lacks theoretical convergence guarantees for cascading fallback mechanism
The paper cites 21 relevant references covering multiple domains including missing data handling, kernel methods, and deep learning, providing solid theoretical foundation and comparison benchmarks for this research.
Summary: The proximity kernel method proposed in this paper makes important contributions to similarity measurement for incomplete data. Through clever kernel representation design and cascading fallback strategy, it achieves a good balance between performance and efficiency. Despite some limitations, its innovation and practicality make it valuable in related application domains.