2025-11-13T00:16:11.561915

Restricted Receptive Fields for Face Verification

Ozturk, Bhatta, Wu et al.

Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.

academic

Restricted Receptive Fields for Face Verification

Basic Information

Paper ID: 2510.10753
Title: Restricted Receptive Fields for Face Verification
Authors: Kagan Ozturk, Aman Bhatta, Haiyu Wu, Patrick Flynn, Kevin W. Bowyer (University of Notre Dame)
Category: cs.CV (Computer Vision)
Publication Date: October 12, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10753

Abstract

This paper proposes a face verification method based on restricted receptive fields, aimed at addressing the interpretability problem of deep neural network decision-making processes. Traditional methods represent entire face images using a single global feature vector, while this work decomposes global similarity into local contributions from restricted receptive fields. The method defines similarity between two face images as the sum of block-level similarity scores, providing locally additive interpretability without relying on post-hoc analysis. Experiments demonstrate that the method achieves competitive verification performance even with small 28×28 patches in 112×112 face images, and surpasses state-of-the-art methods when using 56×56 patches.

Research Background and Motivation

Core Problem

Deep neural networks achieve excellent performance in face recognition tasks, but their decision-making processes lack interpretability, which is a serious concern in high-risk application scenarios.

Problem Significance

Security Requirements: Face recognition systems are widely deployed in security and medical domains, requiring trustworthy decision processes
Failure Diagnosis: Understanding model decision mechanisms is crucial for analyzing model behavior and diagnosing failure cases
Regulatory Compliance: Many application scenarios require AI systems to be interpretable

Limitations of Existing Methods

Post-hoc Explanation Methods: Existing explainable AI methods primarily rely on post-hoc analysis to generate heatmaps, lacking reliable evaluation metrics
Explanation Credibility: Identical heatmaps may be generated for both correct and incorrect predictions, undermining explanation reliability
Computational Overhead: Post-hoc methods require additional computational resources to generate explanations

Research Motivation

This paper proposes an intrinsically interpretable alternative by designing models whose decision processes are inherently interpretable, rather than relying on post-hoc analysis methods.

Core Contributions

Proposes a face similarity metric based on restricted receptive fields: Decomposes global similarity into weighted sums of block-level similarities
Designs the RRFNet architecture: Achieves interpretable verification through block-level comparisons via minor modifications to ResNet
Validates method effectiveness: Demonstrates competitive and even superior-to-SOTA performance on seven benchmark datasets
Provides intrinsic interpretability: Offers local explanations of decision processes without additional computation

Method Details

Task Definition

Input: Two 112×112 face images A and B Output: Binary verification decision (same/different identity) Constraint: Decision process must be interpretable as a combination of local region contributions

Model Architecture

Approach One: Region-Based Similarity Metric

Image Partitioning: Uniformly divide each face image into k local patches of size w×h
Independent Feature Learning: Train independent CNNs for each patch to extract N-dimensional feature vectors
Local Similarity Computation: Calculate similarity between corresponding patches using cosine similarity:
```
S_local(P^A_i, P^B_i) = (f^A_i · f^B_i) / (||f^A_i|| ||f^B_i||)
```
Global Similarity Aggregation: Obtain global similarity through weighted summation:
```
S_global(A,B) = Σ(i=1 to k) w_i · S_local(P^A_i, P^B_i)
```

Approach Two: Restricted Receptive Field Network (RRFNet)

Architecture Modification: Modify ResNet by changing the stride of the first block from 2 to 1
Block-Level Feature Extraction: Extract 512-dimensional features from 28×28 (RRFNet-28) or 56×56 (RRFNet-56) image patches
Global Representation: Define global representation as the mean of block-level features:
```
F^A = (1/K) Σ(i=1 to K) f^A_i
```
Similarity Computation: Global similarity can be expressed as a combination of block-level feature dot products

Technical Innovations

Intrinsic Interpretability: Unlike post-hoc explanation methods, this approach's interpretability is an inherent component of the decision process
Performance Preservation: Through clever architectural design, maintains competitive performance while improving interpretability
Flexible Patch Sizes: Supports different sizes of restricted receptive fields, balancing performance and interpretability
Unified Framework: Provides a mathematical framework for decomposing global similarity into local contributions

Experimental Setup

Datasets

Training Data: WebFace4M and CASIA-WebFace
Test Data: Seven benchmark datasets
- LFW: Standard face verification benchmark
- CFP-FP, CPLFW: Pose variation assessment
- AGEDB, CALFW: Age variation assessment
- Eclipse (ECL): Illumination variation assessment
- Hadrian (HAD): Facial hair variation assessment

Evaluation Metrics

Verification accuracy (10-fold cross-validation)
Average accuracy across datasets

Comparison Methods

ArcFace (ResNet50/100)
AdaFace (ResNet50/100)
UniFace (ResNet50)
KP-RPE (ViT)

Implementation Details

Training Epochs: 20-30
Data Augmentation: Horizontal flipping, ±5 pixel vertical and horizontal shifts
Mask Augmentation: Block masking ratios of 20% and 40%
Architecture: ResNet50/100 backbone

Experimental Results

Main Results

RRFNet-56 Performance:

Achieves 95.69% average accuracy across seven datasets with WebFace4M+ResNet100 setup
Surpasses SOTA methods including ArcFace (95.09%) and AdaFace (95.28%)
Achieves best performance on most datasets

RRFNet-28 Performance:

Achieves 95.20% average accuracy, competing favorably with SOTA methods
Demonstrates that even 28×28 small patches maintain good performance

Ablation Studies

Individual Patch Performance Analysis:

Central region patches (position 28,28) perform best with 94.41% single-patch accuracy
Lower facial regions typically outperform upper regions
On Hadrian dataset, upper regions perform better due to beard variation effects

Patch Combination Strategies:

Using only 28×28 patches: 93.12% average
Using only 56×56 patches: 95.18% average
Combining both patch sizes: 95.51% average

Mask Augmentation Effects:

20% masking: Achieves best performance in most settings
40% masking: Slight performance decrease but remains competitive
No masking: Baseline performance

Case Analysis

The paper presents visualization results for RRFNet-28:

Similarity scores for each patch pair are intuitively displayed
Heatmaps show spatial distribution of patch similarities
Positive pairs show high similarity concentrated in key facial features
Negative pairs show lower and dispersed similarity distribution

Experimental Findings

Local vs. Global: Restricted receptive fields do not necessarily harm performance; in some cases they are beneficial
Patch Size Impact: 56×56 patches achieve the best balance between performance and interpretability
Position Importance: Central facial regions are most critical for verification decisions
Cross-Pose Challenge: 28×28 patches show more significant performance degradation on cross-pose datasets

Explainable AI Method Classification

Post-hoc Explanation Methods: LIME, SHAP, Grad-CAM, etc., generate pixel-level importance
Intrinsic Interpretability Methods: Design model architectures that are inherently interpretable

Face Recognition Interpretability

Existing work primarily adopts post-hoc explanation methods
Lacks reliable metrics for quantitatively evaluating explanation quality
This paper provides an intrinsically interpretable alternative

ProtoPNet: Prototype-based interpretable classification, limited to closed-set recognition
BagNet: Restricts CNN receptive fields for local explanations, but sacrifices accuracy

Conclusions and Discussion

Main Conclusions

The proposed restricted receptive field method achieves intrinsically interpretable face verification
RRFNet-56 surpasses SOTA methods while maintaining interpretability
Even 28×28 small patches achieve competitive performance
The method provides decision explanations without additional computational overhead

Limitations

Computational Overhead: Training time increases 3-7 times compared to baseline methods
Patch Selection: Current uniform patch distribution may not be optimal
Cross-Pose Performance: Small patches show performance degradation with significant pose variation
Architecture Constraints: Primarily validated on ResNet; applicability to other architectures remains unexplored

Future Directions

Adaptive Patch Selection: Automatically select patch sizes and positions based on image content
Architecture Optimization: Explore applicability to other CNN or ViT architectures
Dynamic Patch Strategy: Adjust patch selection based on compared image pairs
Theoretical Analysis: Deepen theoretical understanding of the relationship between restricted receptive fields and performance

In-Depth Evaluation

Strengths

Strong Innovation: Proposes a new paradigm for intrinsically interpretable face verification
Excellent Performance: Achieves or surpasses SOTA while ensuring interpretability
Comprehensive Experiments: Thorough evaluation on multiple benchmark datasets
Simple Method: Achieves complex objectives through simple architectural modifications
Practical Value: Provides trustworthy solutions for high-risk applications

Weaknesses

Computational Efficiency: Significantly increased training time may limit practical deployment
Theoretical Analysis: Lacks in-depth theoretical explanation for why restricted receptive fields improve performance
Generalization: Primarily validated on face verification; applicability to other vision tasks is unknown
Patch Strategy: Fixed patch division may not suit all scenarios

Impact

Academic Contribution: Provides new research directions for explainable AI
Practical Value: Has important application prospects in security, medical, and other high-risk domains
Reproducibility: Clear method description facilitates reproduction and extension
Inspirational Value: May inspire research on more intrinsically interpretable models

Applicable Scenarios

High-Risk Applications: Security systems requiring decision process explanations
Regulatory Environments: Commercial applications requiring explainability compliance
Research Tools: For analyzing face recognition model behavior
Educational Settings: Helping understand deep learning model mechanisms

References

The paper cites 68 related references, primarily covering:

Explainable AI methods (Rudin 2019, Chen et al. 2019)
Face recognition techniques (Deng et al. 2019, Kim et al. 2022)
Deep learning architectures (He et al. 2016)
Evaluation benchmark datasets (Huang et al. 2007, Wu et al. 2024)

Summary: This paper proposes an innovative face verification method based on restricted receptive fields, successfully achieving intrinsic interpretability while maintaining high performance. This work provides valuable new insights for the explainable AI field, particularly suitable for high-risk application scenarios requiring decision transparency. Despite limitations such as computational overhead and insufficient theoretical analysis, its innovation and practical value make it an important contribution to the field.