2025-11-13T00:16:11.561915

Restricted Receptive Fields for Face Verification

Ozturk, Bhatta, Wu et al.
Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.
academic

Restricted Receptive Fields for Face Verification

Basic Information

  • Paper ID: 2510.10753
  • Title: Restricted Receptive Fields for Face Verification
  • Authors: Kagan Ozturk, Aman Bhatta, Haiyu Wu, Patrick Flynn, Kevin W. Bowyer (University of Notre Dame)
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 12, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10753

Abstract

This paper proposes a face verification method based on restricted receptive fields, aimed at addressing the interpretability problem of deep neural network decision-making processes. Traditional methods represent entire face images using a single global feature vector, while this work decomposes global similarity into local contributions from restricted receptive fields. The method defines similarity between two face images as the sum of block-level similarity scores, providing locally additive interpretability without relying on post-hoc analysis. Experiments demonstrate that the method achieves competitive verification performance even with small 28×28 patches in 112×112 face images, and surpasses state-of-the-art methods when using 56×56 patches.

Research Background and Motivation

Core Problem

Deep neural networks achieve excellent performance in face recognition tasks, but their decision-making processes lack interpretability, which is a serious concern in high-risk application scenarios.

Problem Significance

  1. Security Requirements: Face recognition systems are widely deployed in security and medical domains, requiring trustworthy decision processes
  2. Failure Diagnosis: Understanding model decision mechanisms is crucial for analyzing model behavior and diagnosing failure cases
  3. Regulatory Compliance: Many application scenarios require AI systems to be interpretable

Limitations of Existing Methods

  1. Post-hoc Explanation Methods: Existing explainable AI methods primarily rely on post-hoc analysis to generate heatmaps, lacking reliable evaluation metrics
  2. Explanation Credibility: Identical heatmaps may be generated for both correct and incorrect predictions, undermining explanation reliability
  3. Computational Overhead: Post-hoc methods require additional computational resources to generate explanations

Research Motivation

This paper proposes an intrinsically interpretable alternative by designing models whose decision processes are inherently interpretable, rather than relying on post-hoc analysis methods.

Core Contributions

  1. Proposes a face similarity metric based on restricted receptive fields: Decomposes global similarity into weighted sums of block-level similarities
  2. Designs the RRFNet architecture: Achieves interpretable verification through block-level comparisons via minor modifications to ResNet
  3. Validates method effectiveness: Demonstrates competitive and even superior-to-SOTA performance on seven benchmark datasets
  4. Provides intrinsic interpretability: Offers local explanations of decision processes without additional computation

Method Details

Task Definition

Input: Two 112×112 face images A and B Output: Binary verification decision (same/different identity) Constraint: Decision process must be interpretable as a combination of local region contributions

Model Architecture

Approach One: Region-Based Similarity Metric

  1. Image Partitioning: Uniformly divide each face image into k local patches of size w×h
  2. Independent Feature Learning: Train independent CNNs for each patch to extract N-dimensional feature vectors
  3. Local Similarity Computation: Calculate similarity between corresponding patches using cosine similarity:
    S_local(P^A_i, P^B_i) = (f^A_i · f^B_i) / (||f^A_i|| ||f^B_i||)
    
  4. Global Similarity Aggregation: Obtain global similarity through weighted summation:
    S_global(A,B) = Σ(i=1 to k) w_i · S_local(P^A_i, P^B_i)
    

Approach Two: Restricted Receptive Field Network (RRFNet)

  1. Architecture Modification: Modify ResNet by changing the stride of the first block from 2 to 1
  2. Block-Level Feature Extraction: Extract 512-dimensional features from 28×28 (RRFNet-28) or 56×56 (RRFNet-56) image patches
  3. Global Representation: Define global representation as the mean of block-level features:
    F^A = (1/K) Σ(i=1 to K) f^A_i
    
  4. Similarity Computation: Global similarity can be expressed as a combination of block-level feature dot products

Technical Innovations

  1. Intrinsic Interpretability: Unlike post-hoc explanation methods, this approach's interpretability is an inherent component of the decision process
  2. Performance Preservation: Through clever architectural design, maintains competitive performance while improving interpretability
  3. Flexible Patch Sizes: Supports different sizes of restricted receptive fields, balancing performance and interpretability
  4. Unified Framework: Provides a mathematical framework for decomposing global similarity into local contributions

Experimental Setup

Datasets

  • Training Data: WebFace4M and CASIA-WebFace
  • Test Data: Seven benchmark datasets
    • LFW: Standard face verification benchmark
    • CFP-FP, CPLFW: Pose variation assessment
    • AGEDB, CALFW: Age variation assessment
    • Eclipse (ECL): Illumination variation assessment
    • Hadrian (HAD): Facial hair variation assessment

Evaluation Metrics

  • Verification accuracy (10-fold cross-validation)
  • Average accuracy across datasets

Comparison Methods

  • ArcFace (ResNet50/100)
  • AdaFace (ResNet50/100)
  • UniFace (ResNet50)
  • KP-RPE (ViT)

Implementation Details

  • Training Epochs: 20-30
  • Data Augmentation: Horizontal flipping, ±5 pixel vertical and horizontal shifts
  • Mask Augmentation: Block masking ratios of 20% and 40%
  • Architecture: ResNet50/100 backbone

Experimental Results

Main Results

RRFNet-56 Performance:

  • Achieves 95.69% average accuracy across seven datasets with WebFace4M+ResNet100 setup
  • Surpasses SOTA methods including ArcFace (95.09%) and AdaFace (95.28%)
  • Achieves best performance on most datasets

RRFNet-28 Performance:

  • Achieves 95.20% average accuracy, competing favorably with SOTA methods
  • Demonstrates that even 28×28 small patches maintain good performance

Ablation Studies

Individual Patch Performance Analysis:

  • Central region patches (position 28,28) perform best with 94.41% single-patch accuracy
  • Lower facial regions typically outperform upper regions
  • On Hadrian dataset, upper regions perform better due to beard variation effects

Patch Combination Strategies:

  • Using only 28×28 patches: 93.12% average
  • Using only 56×56 patches: 95.18% average
  • Combining both patch sizes: 95.51% average

Mask Augmentation Effects:

  • 20% masking: Achieves best performance in most settings
  • 40% masking: Slight performance decrease but remains competitive
  • No masking: Baseline performance

Case Analysis

The paper presents visualization results for RRFNet-28:

  • Similarity scores for each patch pair are intuitively displayed
  • Heatmaps show spatial distribution of patch similarities
  • Positive pairs show high similarity concentrated in key facial features
  • Negative pairs show lower and dispersed similarity distribution

Experimental Findings

  1. Local vs. Global: Restricted receptive fields do not necessarily harm performance; in some cases they are beneficial
  2. Patch Size Impact: 56×56 patches achieve the best balance between performance and interpretability
  3. Position Importance: Central facial regions are most critical for verification decisions
  4. Cross-Pose Challenge: 28×28 patches show more significant performance degradation on cross-pose datasets

Explainable AI Method Classification

  1. Post-hoc Explanation Methods: LIME, SHAP, Grad-CAM, etc., generate pixel-level importance
  2. Intrinsic Interpretability Methods: Design model architectures that are inherently interpretable

Face Recognition Interpretability

  • Existing work primarily adopts post-hoc explanation methods
  • Lacks reliable metrics for quantitatively evaluating explanation quality
  • This paper provides an intrinsically interpretable alternative
  • ProtoPNet: Prototype-based interpretable classification, limited to closed-set recognition
  • BagNet: Restricts CNN receptive fields for local explanations, but sacrifices accuracy

Conclusions and Discussion

Main Conclusions

  1. The proposed restricted receptive field method achieves intrinsically interpretable face verification
  2. RRFNet-56 surpasses SOTA methods while maintaining interpretability
  3. Even 28×28 small patches achieve competitive performance
  4. The method provides decision explanations without additional computational overhead

Limitations

  1. Computational Overhead: Training time increases 3-7 times compared to baseline methods
  2. Patch Selection: Current uniform patch distribution may not be optimal
  3. Cross-Pose Performance: Small patches show performance degradation with significant pose variation
  4. Architecture Constraints: Primarily validated on ResNet; applicability to other architectures remains unexplored

Future Directions

  1. Adaptive Patch Selection: Automatically select patch sizes and positions based on image content
  2. Architecture Optimization: Explore applicability to other CNN or ViT architectures
  3. Dynamic Patch Strategy: Adjust patch selection based on compared image pairs
  4. Theoretical Analysis: Deepen theoretical understanding of the relationship between restricted receptive fields and performance

In-Depth Evaluation

Strengths

  1. Strong Innovation: Proposes a new paradigm for intrinsically interpretable face verification
  2. Excellent Performance: Achieves or surpasses SOTA while ensuring interpretability
  3. Comprehensive Experiments: Thorough evaluation on multiple benchmark datasets
  4. Simple Method: Achieves complex objectives through simple architectural modifications
  5. Practical Value: Provides trustworthy solutions for high-risk applications

Weaknesses

  1. Computational Efficiency: Significantly increased training time may limit practical deployment
  2. Theoretical Analysis: Lacks in-depth theoretical explanation for why restricted receptive fields improve performance
  3. Generalization: Primarily validated on face verification; applicability to other vision tasks is unknown
  4. Patch Strategy: Fixed patch division may not suit all scenarios

Impact

  1. Academic Contribution: Provides new research directions for explainable AI
  2. Practical Value: Has important application prospects in security, medical, and other high-risk domains
  3. Reproducibility: Clear method description facilitates reproduction and extension
  4. Inspirational Value: May inspire research on more intrinsically interpretable models

Applicable Scenarios

  1. High-Risk Applications: Security systems requiring decision process explanations
  2. Regulatory Environments: Commercial applications requiring explainability compliance
  3. Research Tools: For analyzing face recognition model behavior
  4. Educational Settings: Helping understand deep learning model mechanisms

References

The paper cites 68 related references, primarily covering:

  • Explainable AI methods (Rudin 2019, Chen et al. 2019)
  • Face recognition techniques (Deng et al. 2019, Kim et al. 2022)
  • Deep learning architectures (He et al. 2016)
  • Evaluation benchmark datasets (Huang et al. 2007, Wu et al. 2024)

Summary: This paper proposes an innovative face verification method based on restricted receptive fields, successfully achieving intrinsic interpretability while maintaining high performance. This work provides valuable new insights for the explainable AI field, particularly suitable for high-risk application scenarios requiring decision transparency. Despite limitations such as computational overhead and insufficient theoretical analysis, its innovation and practical value make it an important contribution to the field.