2025-11-24T15:01:18.137860

Kernel Representation and Similarity Measure for Incomplete Data

Cao, Yang, He et al.
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space. The proposed method introduces data-dependent binning combined with proximity assignment to project data into a high-dimensional sparse representation that adapts to local density variations. For missing value handling, we propose a cascading fallback strategy to estimate missing feature distributions. We conduct clustering tasks on the proposed kernel representation across 12 real world incomplete datasets, demonstrating superior performance compared to existing methods while maintaining linear time complexity. All the code are available at https://anonymous.4open.science/r/proximity-kernel-2289.
academic

Kernel Representation and Similarity Measure for Incomplete Data

Basic Information

  • Paper ID: 2510.13352
  • Title: Kernel Representation and Similarity Measure for Incomplete Data
  • Authors: Yang Cao, Sikun Yang, Kai He, Wenjun Ma, Ming Liu, Yujiu Yang, Jian Weng
  • Category: cs.LG (Machine Learning)
  • Publication Date: October 15, 2024 (arXiv submission)
  • Paper Link: https://arxiv.org/abs/2510.13352v1

Abstract

This paper addresses the fundamental challenge of similarity measurement for incomplete data by proposing the proximity kernel method. Traditional approaches either discard incomplete data or perform imputation preprocessing, leading to information loss and biased similarity estimation. The proximity kernel directly computes similarity between incomplete data samples in the kernel feature space without requiring explicit imputation in the original space. The method introduces a data-dependent binning mechanism combined with proximity assignment, projecting data into high-dimensional sparse representations that adapt to local density variations. For handling missing values, a cascading fallback strategy is proposed to estimate the distribution of missing features. Clustering experiments on 12 real incomplete datasets demonstrate that the method outperforms existing approaches while maintaining linear time complexity.

Research Background and Motivation

Core Problem

Similarity measurement for incomplete data is a fundamental challenge in network mining, recommendation systems, and user behavior analysis. Real-world data is inherently incomplete due to user privacy preferences, survey non-response, and voluntary information non-disclosure.

Problem Significance

  1. Ubiquity: Users in recommendation systems typically rate only a small number of items, producing highly sparse user-item matrices
  2. Data Heterogeneity: Missing values may simultaneously affect numerical, categorical, or mixed-type features
  3. Impact on Downstream Tasks: Similarity measurement is fundamental to clustering, classification, and anomaly detection; inaccurate similarity estimation significantly degrades task performance

Limitations of Existing Methods

  1. Deletion Methods: Ignoring missing values or completely removing incomplete samples causes severe information loss and bias
  2. Imputation Methods: Using statistical quantities or complex techniques to fill missing values often fails to capture underlying data distributions and may introduce artificial patterns that do not reflect true similarity structures
  3. Deep Learning Methods: While promising, these typically require large datasets and substantial computational resources, lack theoretical guarantees, and are sensitive to hyperparameters

Research Motivation

Existing methods adopt a "two-stage" strategy (imputation followed by similarity computation). This paper proposes a new approach that jointly handles imputation and similarity measurement in kernel representation space.

Core Contributions

  1. Proximity Kernel Method: Projects data into high-dimensional sparse representations through equal-frequency binning and Voronoi-based proximity assignment, adapting to local density without explicit density estimation
  2. Cascading Fallback Strategy: Proposes a progressive constraint relaxation matching strategy for handling missing values, progressing from intersection to union to global priors
  3. Linear Time Complexity: Achieves linear time complexity, making the method scalable to large-scale datasets
  4. Experimental Validation: Demonstrates superior performance compared to existing methods on clustering tasks across 12 datasets

Method Details

Task Definition

Given a dataset D = {x₁, x₂, ..., xₙ} containing n samples, where each sample xᵢ ∈ ℝᵈ is a d-dimensional feature vector potentially containing missing values (represented as NaN). The objective is to compute a similarity function s : D × D → 0,1 that quantifies similarity between any two samples for downstream clustering tasks.

Model Architecture

1. Data-Dependent Binning and Proximity Assignment

Bin Center Selection: For each feature j, equal-frequency binning selects nₑᵢₙₛ bin centers:

c_{j,k} = percentile(X_{obs,j}, (k-1)×100/(n_{bins}-1))

where k ∈ {1,2,...,nₑᵢₙₛ} and Xₒᵦₛ,ⱼ represents all observed values of feature j.

Proximity Assignment Mechanism: Employs proximity assignment rather than traditional interval membership:

b_{i,j} = argmin_{k∈{1,...,n_{bins}}} |x_{i,j} - c_{j,k}|

This creates a Voronoi diagram of feature space, where each center cⱼ,ₖ defines a Voronoi cell.

Density Adaptation Property:

  • In dense regions: Consecutive center spacing is small, creating small Voronoi cells; two points at the same distance are more likely to fall into different cells
  • In sparse regions: Consecutive center spacing is large, creating large Voronoi cells; two points at the same distance are more likely to fall into the same cell

2. One-Hot Encoding Construction

For each feature j and sample i, construct a binary vector hᵢ,ⱼ ∈ {0,1}^{nₑᵢₙₛ}:

h_{i,j,k} = {1 if b_{i,j} = k; 0 otherwise}

The complete encoding is the concatenation of all features: zᵢ = hᵢ,₁, hᵢ,₂, ..., hᵢ,ₐ

3. Cascading Fallback Strategy for Handling Missing Values

Level 1 - Intersection Matching: Find samples matching on all observed features:

S₁(i) = ∩_{j∈O_i} C_{i,j}

where C_{i,j} = {m : m≠i, b_{m,j} = b_{i,j}}

Level 2 - Union Matching: If S₁(i) = ∅, relax to matching on any observed feature:

S₂(i) = ∪_{j∈O_i} C_{i,j}

Level 3 - Global Prior: If S₂(i) = ∅, use global distribution:

p_{j,k} = count of samples in bin k for feature j / count of samples with observed feature j

For each level, use kernel mean embedding (KME) to estimate missing encodings:

h_{i,j,k} = 1/|S(i)| ∑_{m∈S(i)} h_{m,j,k}

Technical Innovations

  1. Density Adaptation Without Explicit Estimation: The combination of equal-frequency binning and proximity assignment naturally achieves density-adaptive partitioning
  2. Joint Processing in Kernel Space: Handles missing values in representation space rather than original space, avoiding introduction of artificial patterns
  3. Progressive Matching Strategy: Matching criteria progress from strict to relaxed, maximizing utilization of available information

Proximity Kernel Definition

K(x_m, x_n) = 1/d · z_m^T z_n = <z_m, z_n>

This kernel satisfies Mercer conditions (symmetry and positive semi-definiteness) and has a probabilistic interpretation: computing the expected probability that two samples fall into the same bin across all features.

Experimental Setup

Datasets

Uses 12 real datasets from UCI covering multiple domains:

  • Medical Diagnosis: Kidney, Hepatitis, Heart, Tumor, Cancer
  • Biological Classification: Soybean, Mushroom
  • Financial Analysis: Credit
  • Population Prediction: Adult
  • Political Analysis: Voting
  • Others: Mammography, Horse

Datasets range from 155 to 48,842 samples, 5 to 35 dimensions, with missing rates from 0.15% to 19.39%.

Evaluation Metrics

Uses Normalized Mutual Information (NMI) to assess clustering quality, employing K-means clustering with cluster number set to the true number of classes.

Comparison Methods

Nine representative methods:

  1. Simple Imputation: Mean imputation
  2. Statistical Methods: MissForest, MICE, KNN, EM
  3. Deep Learning: GAIN, MIRACLE, MIWAE
  4. Kernel Methods: HI-PMK

Implementation Details

  • Each experiment repeated 10 times with averaged results
  • Hyperparameter tuning performed for all baseline methods
  • Bin numbers for proximity kernel searched in {2,3,4,6,8}

Experimental Results

Main Results

  1. Overall Performance: Achieves best or second-best performance on 10 out of 12 datasets with highest average NMI (0.4245)
  2. Statistical Significance: Friedman-Nemenyi test shows proximity kernel significantly outperforms all methods except HI-PMK
  3. Stability: Box plots show proximity kernel not only achieves optimal average performance but also more consistent performance across different datasets

Missing Rate Robustness Experiments

Testing 10%-50% missing rates on 3L and Messidor datasets:

  • Maintains stable superior performance at low to moderate missing rates (10%-40%)
  • Maintains reasonable performance even at extreme missing rates (50%)
  • In contrast, KNN and MissForest performance nearly drops to zero at 30% missing rate

Scalability Analysis

  • Time Complexity: O(nd + d·n log n), linear in sample size for fixed dimensionality
  • Space Complexity: O(nd), linear in sample size and feature count
  • Actual Runtime: Significantly faster than HI-PMK and MIWAE

Sensitivity Analysis

With bin numbers varying from 2 to 10, NMI changes are minimal across three datasets (e.g., Mammo dataset fluctuates between 0.30-0.33), demonstrating method insensitivity to hyperparameters.

Traditional Imputation Methods

  • Simple Imputation: Mean/mode imputation, computationally efficient but fails to preserve natural data variability
  • KNN Imputation: Based on k most similar samples, but performs poorly on high-dimensional sparse data
  • EM Algorithm: Based on maximum likelihood density estimation, requires strong distributional assumptions
  • MICE: Creates multiple imputed datasets, computationally expensive and requires careful model specification
  • MissForest: Uses random forests for iterative imputation, prone to overfitting and lacks convergence guarantees

Deep Learning Methods

  • GAIN: Uses generative adversarial networks to learn missing data distribution
  • MIWAE: Uses deep latent variable models to maximize evidence lower bound on observed data likelihood
  • MIRACLE: Combines causal inference with deep learning to improve imputation quality

Kernel Methods

  • Traditional Distances: Euclidean distance does not directly apply to incomplete data
  • HI-PMK: Data-dependent kernel method, but with high computational complexity and sensitive hyperparameters

Conclusions and Discussion

Main Conclusions

  1. Proximity kernel successfully enables direct similarity computation for incomplete data in kernel feature space, avoiding explicit imputation in original space
  2. Data-dependent binning combined with proximity assignment creates representations adapting to local density without explicit density estimation
  3. Cascading fallback strategy effectively utilizes available information, progressively relaxing from strict matching to global priors
  4. Method achieves superior performance while maintaining linear time complexity

Limitations

  1. Missing Mechanism Assumptions: Current evaluation primarily based on MCAR (Missing Completely At Random) mechanisms; real data often exhibits more complex MAR and MNAR patterns
  2. Binning Strategy: Equal-frequency binning may not be optimal for all data distributions
  3. Theoretical Guarantees: Lacks theoretical convergence guarantees for cascading fallback mechanism

Future Directions

  1. Investigate method behavior under MAR and MNAR missing mechanisms
  2. Explore adaptive bin selection strategies
  3. Establish theoretical convergence guarantees for cascading fallback mechanism
  4. Extend to more complex data types and structures

In-Depth Evaluation

Strengths

  1. Method Innovation: Unifies imputation and similarity computation in kernel representation space, avoiding problems of traditional two-stage methods
  2. Theoretical Foundation: Provides rigorous proof of kernel validity, satisfying Mercer conditions
  3. Practicality: Linear time complexity makes the method applicable to large-scale applications
  4. Comprehensive Experiments: Thorough evaluation across multiple datasets including statistical significance testing
  5. Robustness: Insensitive to hyperparameters with stable performance across different missing rates

Weaknesses

  1. Missing Mechanism Limitations: Primarily evaluated under MCAR assumptions; adaptability to more complex missing patterns insufficiently verified
  2. Density Estimation: While claiming no explicit density estimation, equal-frequency binning is essentially implicit density estimation
  3. Feature Independence: Modeling of feature dependencies in cascading strategy may be insufficient
  4. Comparison Fairness: Comparisons with deep learning methods may have differences in computational resources and tuning effort

Impact

  1. Academic Contribution: Provides new theoretical framework for similarity measurement of incomplete data
  2. Practical Value: Direct value in recommendation systems, network mining, and other applications
  3. Reproducibility: Provides code links facilitating method verification and dissemination

Applicable Scenarios

  1. Recommendation Systems: Handling sparsity in user-item rating matrices
  2. Network Mining: Processing incompleteness in user behavior data
  3. Medical Data Analysis: Handling missing values in clinical data
  4. Large-Scale Data Processing: Linear complexity suitable for large-scale applications

References

The paper cites 21 relevant references covering multiple domains including missing data handling, kernel methods, and deep learning, providing solid theoretical foundation and comparison benchmarks for this research.


Summary: The proximity kernel method proposed in this paper makes important contributions to similarity measurement for incomplete data. Through clever kernel representation design and cascading fallback strategy, it achieves a good balance between performance and efficiency. Despite some limitations, its innovation and practicality make it valuable in related application domains.