2025-11-24T15:01:18.137860

Kernel Representation and Similarity Measure for Incomplete Data

Cao, Yang, He et al.

Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space. The proposed method introduces data-dependent binning combined with proximity assignment to project data into a high-dimensional sparse representation that adapts to local density variations. For missing value handling, we propose a cascading fallback strategy to estimate missing feature distributions. We conduct clustering tasks on the proposed kernel representation across 12 real world incomplete datasets, demonstrating superior performance compared to existing methods while maintaining linear time complexity. All the code are available at https://anonymous.4open.science/r/proximity-kernel-2289.

academic

Kernel Representation and Similarity Measure for Incomplete Data

Basic Information

Paper ID: 2510.13352
Title: Kernel Representation and Similarity Measure for Incomplete Data
Authors: Yang Cao, Sikun Yang, Kai He, Wenjun Ma, Ming Liu, Yujiu Yang, Jian Weng
Category: cs.LG (Machine Learning)
Publication Date: October 15, 2024 (arXiv submission)
Paper Link: https://arxiv.org/abs/2510.13352v1

Abstract

This paper addresses the fundamental challenge of similarity measurement for incomplete data by proposing the proximity kernel method. Traditional approaches either discard incomplete data or perform imputation preprocessing, leading to information loss and biased similarity estimation. The proximity kernel directly computes similarity between incomplete data samples in the kernel feature space without requiring explicit imputation in the original space. The method introduces a data-dependent binning mechanism combined with proximity assignment, projecting data into high-dimensional sparse representations that adapt to local density variations. For handling missing values, a cascading fallback strategy is proposed to estimate the distribution of missing features. Clustering experiments on 12 real incomplete datasets demonstrate that the method outperforms existing approaches while maintaining linear time complexity.

Research Background and Motivation

Core Problem

Similarity measurement for incomplete data is a fundamental challenge in network mining, recommendation systems, and user behavior analysis. Real-world data is inherently incomplete due to user privacy preferences, survey non-response, and voluntary information non-disclosure.

Problem Significance

Ubiquity: Users in recommendation systems typically rate only a small number of items, producing highly sparse user-item matrices
Data Heterogeneity: Missing values may simultaneously affect numerical, categorical, or mixed-type features
Impact on Downstream Tasks: Similarity measurement is fundamental to clustering, classification, and anomaly detection; inaccurate similarity estimation significantly degrades task performance

Limitations of Existing Methods

Deletion Methods: Ignoring missing values or completely removing incomplete samples causes severe information loss and bias
Imputation Methods: Using statistical quantities or complex techniques to fill missing values often fails to capture underlying data distributions and may introduce artificial patterns that do not reflect true similarity structures
Deep Learning Methods: While promising, these typically require large datasets and substantial computational resources, lack theoretical guarantees, and are sensitive to hyperparameters

Research Motivation

Existing methods adopt a "two-stage" strategy (imputation followed by similarity computation). This paper proposes a new approach that jointly handles imputation and similarity measurement in kernel representation space.

Core Contributions

Proximity Kernel Method: Projects data into high-dimensional sparse representations through equal-frequency binning and Voronoi-based proximity assignment, adapting to local density without explicit density estimation
Cascading Fallback Strategy: Proposes a progressive constraint relaxation matching strategy for handling missing values, progressing from intersection to union to global priors
Linear Time Complexity: Achieves linear time complexity, making the method scalable to large-scale datasets
Experimental Validation: Demonstrates superior performance compared to existing methods on clustering tasks across 12 datasets

Method Details

Task Definition

Given a dataset D = {x₁, x₂, ..., xₙ} containing n samples, where each sample xᵢ ∈ ℝᵈ is a d-dimensional feature vector potentially containing missing values (represented as NaN). The objective is to compute a similarity function s : D × D → 0,1 that quantifies similarity between any two samples for downstream clustering tasks.

Model Architecture

1. Data-Dependent Binning and Proximity Assignment

Bin Center Selection: For each feature j, equal-frequency binning selects nₑᵢₙₛ bin centers:

c_{j,k} = percentile(X_{obs,j}, (k-1)×100/(n_{bins}-1))

where k ∈ {1,2,...,nₑᵢₙₛ} and Xₒᵦₛ,ⱼ represents all observed values of feature j.

Proximity Assignment Mechanism: Employs proximity assignment rather than traditional interval membership:

b_{i,j} = argmin_{k∈{1,...,n_{bins}}} |x_{i,j} - c_{j,k}|

This creates a Voronoi diagram of feature space, where each center cⱼ,ₖ defines a Voronoi cell.

Density Adaptation Property:

In dense regions: Consecutive center spacing is small, creating small Voronoi cells; two points at the same distance are more likely to fall into different cells
In sparse regions: Consecutive center spacing is large, creating large Voronoi cells; two points at the same distance are more likely to fall into the same cell

2. One-Hot Encoding Construction

For each feature j and sample i, construct a binary vector hᵢ,ⱼ ∈ {0,1}^{nₑᵢₙₛ}:

h_{i,j,k} = {1 if b_{i,j} = k; 0 otherwise}

The complete encoding is the concatenation of all features: zᵢ = hᵢ,₁, hᵢ,₂, ..., hᵢ,ₐ

3. Cascading Fallback Strategy for Handling Missing Values

Level 1 - Intersection Matching: Find samples matching on all observed features:

S₁(i) = ∩_{j∈O_i} C_{i,j}

where C_{i,j} = {m : m≠i, b_{m,j} = b_{i,j}}

Level 2 - Union Matching: If S₁(i) = ∅, relax to matching on any observed feature:

S₂(i) = ∪_{j∈O_i} C_{i,j}

Level 3 - Global Prior: If S₂(i) = ∅, use global distribution:

p_{j,k} = count of samples in bin k for feature j / count of samples with observed feature j

For each level, use kernel mean embedding (KME) to estimate missing encodings:

h_{i,j,k} = 1/|S(i)| ∑_{m∈S(i)} h_{m,j,k}

Technical Innovations

Density Adaptation Without Explicit Estimation: The combination of equal-frequency binning and proximity assignment naturally achieves density-adaptive partitioning
Joint Processing in Kernel Space: Handles missing values in representation space rather than original space, avoiding introduction of artificial patterns
Progressive Matching Strategy: Matching criteria progress from strict to relaxed, maximizing utilization of available information

Proximity Kernel Definition

K(x_m, x_n) = 1/d · z_m^T z_n = <z_m, z_n>

This kernel satisfies Mercer conditions (symmetry and positive semi-definiteness) and has a probabilistic interpretation: computing the expected probability that two samples fall into the same bin across all features.

Experimental Setup

Datasets

Uses 12 real datasets from UCI covering multiple domains:

Medical Diagnosis: Kidney, Hepatitis, Heart, Tumor, Cancer
Biological Classification: Soybean, Mushroom
Financial Analysis: Credit
Population Prediction: Adult
Political Analysis: Voting
Others: Mammography, Horse

Datasets range from 155 to 48,842 samples, 5 to 35 dimensions, with missing rates from 0.15% to 19.39%.

Evaluation Metrics

Uses Normalized Mutual Information (NMI) to assess clustering quality, employing K-means clustering with cluster number set to the true number of classes.

Comparison Methods

Nine representative methods:

Simple Imputation: Mean imputation
Statistical Methods: MissForest, MICE, KNN, EM
Deep Learning: GAIN, MIRACLE, MIWAE
Kernel Methods: HI-PMK

Implementation Details

Each experiment repeated 10 times with averaged results
Hyperparameter tuning performed for all baseline methods
Bin numbers for proximity kernel searched in {2,3,4,6,8}

Experimental Results

Main Results

Overall Performance: Achieves best or second-best performance on 10 out of 12 datasets with highest average NMI (0.4245)
Statistical Significance: Friedman-Nemenyi test shows proximity kernel significantly outperforms all methods except HI-PMK
Stability: Box plots show proximity kernel not only achieves optimal average performance but also more consistent performance across different datasets

Missing Rate Robustness Experiments

Testing 10%-50% missing rates on 3L and Messidor datasets:

Maintains stable superior performance at low to moderate missing rates (10%-40%)
Maintains reasonable performance even at extreme missing rates (50%)
In contrast, KNN and MissForest performance nearly drops to zero at 30% missing rate

Scalability Analysis

Time Complexity: O(nd + d·n log n), linear in sample size for fixed dimensionality
Space Complexity: O(nd), linear in sample size and feature count
Actual Runtime: Significantly faster than HI-PMK and MIWAE

Sensitivity Analysis

With bin numbers varying from 2 to 10, NMI changes are minimal across three datasets (e.g., Mammo dataset fluctuates between 0.30-0.33), demonstrating method insensitivity to hyperparameters.

Traditional Imputation Methods

Simple Imputation: Mean/mode imputation, computationally efficient but fails to preserve natural data variability
KNN Imputation: Based on k most similar samples, but performs poorly on high-dimensional sparse data
EM Algorithm: Based on maximum likelihood density estimation, requires strong distributional assumptions
MICE: Creates multiple imputed datasets, computationally expensive and requires careful model specification
MissForest: Uses random forests for iterative imputation, prone to overfitting and lacks convergence guarantees

Deep Learning Methods

GAIN: Uses generative adversarial networks to learn missing data distribution
MIWAE: Uses deep latent variable models to maximize evidence lower bound on observed data likelihood
MIRACLE: Combines causal inference with deep learning to improve imputation quality

Kernel Methods

Traditional Distances: Euclidean distance does not directly apply to incomplete data
HI-PMK: Data-dependent kernel method, but with high computational complexity and sensitive hyperparameters

Conclusions and Discussion

Main Conclusions

Proximity kernel successfully enables direct similarity computation for incomplete data in kernel feature space, avoiding explicit imputation in original space
Data-dependent binning combined with proximity assignment creates representations adapting to local density without explicit density estimation
Cascading fallback strategy effectively utilizes available information, progressively relaxing from strict matching to global priors
Method achieves superior performance while maintaining linear time complexity

Limitations

Missing Mechanism Assumptions: Current evaluation primarily based on MCAR (Missing Completely At Random) mechanisms; real data often exhibits more complex MAR and MNAR patterns
Binning Strategy: Equal-frequency binning may not be optimal for all data distributions
Theoretical Guarantees: Lacks theoretical convergence guarantees for cascading fallback mechanism

Future Directions

Investigate method behavior under MAR and MNAR missing mechanisms
Explore adaptive bin selection strategies
Establish theoretical convergence guarantees for cascading fallback mechanism
Extend to more complex data types and structures

In-Depth Evaluation

Strengths

Method Innovation: Unifies imputation and similarity computation in kernel representation space, avoiding problems of traditional two-stage methods
Theoretical Foundation: Provides rigorous proof of kernel validity, satisfying Mercer conditions
Practicality: Linear time complexity makes the method applicable to large-scale applications
Comprehensive Experiments: Thorough evaluation across multiple datasets including statistical significance testing
Robustness: Insensitive to hyperparameters with stable performance across different missing rates

Weaknesses

Missing Mechanism Limitations: Primarily evaluated under MCAR assumptions; adaptability to more complex missing patterns insufficiently verified
Density Estimation: While claiming no explicit density estimation, equal-frequency binning is essentially implicit density estimation
Feature Independence: Modeling of feature dependencies in cascading strategy may be insufficient
Comparison Fairness: Comparisons with deep learning methods may have differences in computational resources and tuning effort

Impact

Academic Contribution: Provides new theoretical framework for similarity measurement of incomplete data
Practical Value: Direct value in recommendation systems, network mining, and other applications
Reproducibility: Provides code links facilitating method verification and dissemination

Applicable Scenarios

Recommendation Systems: Handling sparsity in user-item rating matrices
Network Mining: Processing incompleteness in user behavior data
Medical Data Analysis: Handling missing values in clinical data
Large-Scale Data Processing: Linear complexity suitable for large-scale applications

References

The paper cites 21 relevant references covering multiple domains including missing data handling, kernel methods, and deep learning, providing solid theoretical foundation and comparison benchmarks for this research.

Summary: The proximity kernel method proposed in this paper makes important contributions to similarity measurement for incomplete data. Through clever kernel representation design and cascading fallback strategy, it achieves a good balance between performance and efficiency. Despite some limitations, its innovation and practicality make it valuable in related application domains.