2025-11-18T12:37:13.469298

Comparative Explanations via Counterfactual Reasoning in Recommendations

Yu, Hu
Explainable recommendation through counterfactual reasoning seeks to identify the influential aspects of items in recommendations, which can then be used as explanations. However, state-of-the-art approaches, which aim to minimize changes in product aspects while reversing their recommended decisions according to an aggregated decision boundary score, often lead to factual inaccuracies in explanations. To solve this problem, in this work we propose a novel method of Comparative Counterfactual Explanations for Recommendation (CoCountER). CoCountER creates counterfactual data based on soft swap operations, enabling explanations for recommendations of arbitrary pairs of comparative items. Empirical experiments validate the effectiveness of our approach.
academic

Comparative Explanations via Counterfactual Reasoning in Recommendations

Basic Information

  • Paper ID: 2510.10920
  • Title: Comparative Explanations via Counterfactual Reasoning in Recommendations
  • Authors: Yi Yu (Huawei Technologies Co., Ltd.), Zhenxing Hu (Huawei Technologies Co., Ltd.)
  • Classification: cs.IR (Information Retrieval), cs.AI (Artificial Intelligence)
  • Submission Date/Conference: Submitted to Conference in 2025 (specific conference pending)
  • Paper Link: https://arxiv.org/abs/2510.10920

Abstract

Explainable recommendation systems seek to identify influential factors of recommended items through counterfactual reasoning, which can serve as explanations. However, existing state-of-the-art methods aim to minimize changes in product attributes while reversing recommendation decisions based on aggregated decision boundary scores, often resulting in factual inaccuracies in explanations. To address this issue, this paper proposes a novel comparative counterfactual explanation method for recommendations (CoCountER). CoCountER creates counterfactual data based on soft swap operations, enabling explanations for recommendations of arbitrary comparative item pairs. Empirical experiments validate the effectiveness of the proposed method.

Research Background and Motivation

Problem Definition

Explainable recommendation systems aim to provide users with high-quality recommendations while offering clear explanations to help users understand the logic behind recommendations, thereby increasing user trust and satisfaction with the system.

Limitations of Existing Methods

  1. Issues with Matching-based Methods: Template-based explanation methods (e.g., EFM, MTER, A2CF) optimize recommendations and attribute representations through tensor factorization techniques, but may select attributes with high matching scores but poor actual performance as explanations.
  2. Defects of Existing Counterfactual Methods: Methods such as CountER reverse decisions by minimizing feature reduction but suffer from factual inaccuracy problems. Using the headphone example in the paper, CountER might identify comfort as the explanation for recommending headphone A, when in fact headphone A performs worse than headphone B in terms of comfort.
  3. Root Cause: Existing methods optimize pushing the sum of all reduced attributes toward the decision boundary score, rather than pushing each attribute toward the decision boundary, leading to explanations that contradict intuition.

Research Motivation

This paper proposes addressing the above issues through comparative counterfactual reasoning by performing attribute-level swap operations between item pairs to generate more faithful and intuitive explanations.

Core Contributions

  1. Proposes a novel counterfactual data creation method: A counterfactual data generation mechanism based on soft swap operations
  2. Innovative comparative counterfactual explanation framework: CoCountER can provide explanations for recommendations of arbitrary comparative item pairs
  3. Experimental validation: Validates the method's effectiveness on multiple datasets, surpassing existing methods on counterfactual-related metrics

Method Details

Task Definition

Given a target user u, an explained item i, and a reference item j, where the recommendation scores satisfy r_{u,i} > r_{u,j}, the goal is to identify key attributes influencing the recommendation decision through minimal swap operations.

Model Architecture

1. Data Preprocessing

Uses the Sentires tool to extract (user, item, attribute, sentiment) quadruples from user reviews, constructing:

  • User-attribute attention matrix X: X_{u,a} represents user u's attention to attribute a
  • Item-attribute quality matrix Y: Y_{i,a} represents item i's performance on attribute a

Calculation formulas:

X_{u,a} = {
  0, if user u did not mention attribute a
  1 + (N-1) · (1-exp^{-t_{u,a}})/(1+exp^{-t_{u,a}}), otherwise
}

Y_{i,a} = {
  0, if item i was not mentioned on attribute a
  1 + (N-1)/(1+exp^{-t_{i,a}·s_{i,a}}), otherwise
}

2. Recommendation Model

Adopts a simple fusion layer architecture:

r_{u,i} = g_θ(X_u, Y_i)

Implemented through a three-layer fully connected network with ReLU activation and Sigmoid output.

3. Comparative Counterfactual Explanation Core

Swap Function Design:

f(Y_i, Y_j, ψ) = (1-σ(ψ)) ⊙ Y_i + σ(ψ) ⊙ Y_j

where σ(ψ) is the sigmoid function and ψ is a trainable swap variable vector.

Optimization Objective:

min_ψ ||σ(ψ)||_1 + λL(r_{u,i*}, r_{u,j*})

where L is the boundary ranking loss:

L(r_{u,i*}, r_{u,j*}) = max(0, (r_{u,i*} - r_{u,j*}) + m)

Technical Innovations

  1. Soft Swap Operation: Implements differentiable swap operations through sigmoid functions, where values close to 0 indicate no swap and values close to 1 indicate complete swap
  2. Comparative Framework: Unlike traditional single-item explanations, provides comparative explanations between items
  3. Generality: When fixing the reference item and computing only the first part of the swap function, it can degenerate into reduction-based counterfactual methods

Experimental Setup

Datasets

Uses three categories from the Amazon review dataset:

  • Electronics: 963 users, 1,112 items, 19,418 reviews, 877 attributes
  • CDs & Vinyl: 2,129 users, 2,907 items, 56,045 reviews, 810 attributes
  • Movies: 5,586 users, 6,703 items, 187,490 reviews, 1,530 attributes

Data preprocessing: Filters users and items with fewer than 10 interactions, splits into training/validation/test sets in an 8:1:1 ratio.

Evaluation Metrics

  • User-oriented Metrics: Precision and Recall
  • Model-oriented Metrics: Probability of Necessity (PN) and Probability of Sufficiency (PS)

Baseline Methods

  1. Random Method: Random
  2. Ranking Methods: Sort-i (ranks by item attribute performance), Sort-u (ranks by user attention)
  3. Matching-based Methods: EFM, A2CF
  4. Counterfactual Methods: CountER, CoCountER (proposed method)

Implementation Details

  • Learning rate η optimized through gradient descent
  • Boundary threshold m used in ranking loss
  • Balance factor λ coordinates two optimization objectives
  • Swap threshold set to 0.5 for identifying explanation attributes

Experimental Results

Main Results

On all three datasets, CoCountER consistently surpasses all baseline methods on counterfactual-related metrics PN and PS:

Electronics Dataset:

  • PN: 0.734 (vs. CountER's 0.511)
  • PS: 0.931 (vs. CountER's 0.894)

CDs & Vinyl Dataset:

  • PN: 0.773 (vs. CountER's 0.526)
  • PS: 0.936 (vs. CountER's 0.921)

Movies Dataset:

  • PN: 0.744 (vs. CountER's 0.496)
  • PS: 0.928 (vs. CountER's 0.889)

Hyperparameter Analysis

  1. Impact of Reference Item Ranking: Lower-ranked reference items impose fewer optimization constraints, enabling discovery of more effective counterfactual attributes, improving PN and PS performance
  2. Impact of Reference Item Quantity: Appropriately increasing the number of reference items improves performance, but excessive quantities introduce noise leading to slight performance degradation

Experimental Findings

  • CoCountER provides more faithful and context-aware explanations than CountER through attribute-level swap operations
  • The comparative counterfactual design captures the true causal attributes behind recommendations
  • The method maintains stability across diverse settings, demonstrating good robustness

Explainable Recommendation Systems

  1. Attribute-based Methods: EFM, MTER, A2CF and others use tensor factorization techniques to construct template-based explanations
  2. Counterfactual Reasoning Methods: CountER first introduced counterfactual reasoning into explainable recommendations
  3. Text Generation Methods: Combine pre-trained language models such as BERT to generate textual explanations

Comparative Explanations

Yang et al. proposed the concept of comparative explanations but adopted autoregressive decoders to generate textual explanations, differing from this paper's counterfactual reasoning perspective.

Causal Reasoning Applications in Recommendations

In recent years, causal reasoning has been widely applied to data augmentation and fairness improvement in recommendation systems.

Conclusions and Discussion

Main Conclusions

  1. Proposes the CoCountER framework that generates more faithful recommendation explanations through comparative counterfactual reasoning
  2. Soft swap operations effectively identify key attributes influencing recommendation decisions
  3. Experiments demonstrate that the method significantly outperforms existing methods on counterfactual metrics

Limitations

  1. Simplified Recommendation Model: To focus on explainability, a relatively simple recommendation model architecture was adopted
  2. Computational Complexity: Requires optimization for each reference item, increasing computational costs
  3. Attribute Dependency: The method depends on attributes extracted from reviews and is sensitive to attribute quality

Future Directions

The paper proposes combining counterfactual reasoning with generative models to produce natural language explanations of counterfactual scenarios.

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose a comparative counterfactual explanation framework, addressing the factual inaccuracy problem of existing methods
  2. Solid Theoretical Foundation: Clearly illustrates problems with existing methods through concrete examples and provides theoretical analysis
  3. Reasonable Method Design: The soft swap operation design ensures both differentiability and provides intuitive explanations
  4. Comprehensive Experiments: Validates on multiple datasets with hyperparameter sensitivity analysis

Weaknesses

  1. Limited Evaluation Metrics: Primarily focuses on counterfactual metrics, lacking user studies to verify practical usability of explanations
  2. Limited Baseline Methods: While including major comparative methods, lacks more recent counterfactual explanation approaches
  3. Scalability Issues: As item quantity increases, the number of item pairs to consider grows quadratically
  4. Insufficient Practical Deployment Considerations: Lacks discussion of efficiency and scalability when deploying in actual recommendation systems

Impact

  1. Academic Contribution: Provides a new research direction for the explainable recommendation field
  2. Practical Value: Generates more intuitive explanations, helping improve user experience
  3. Reproducibility: Provides detailed algorithm descriptions and implementation details

Applicable Scenarios

  1. E-commerce Recommendations: Particularly suitable for scenarios requiring explanation of why a specific product is recommended over others
  2. Content Recommendations: Applicable to movie, music, and other content recommendation systems
  3. High-risk Decisions: Suitable for recommendation scenarios requiring high explainability

References

The paper cites 30 relevant references covering multiple related fields including explainable recommendations, counterfactual reasoning, and causal inference, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality research paper that proposes an innovative comparative counterfactual explanation framework, addressing important problems with existing methods. The method design is reasonable, experimental validation is comprehensive, and it makes significant contributions to the explainable recommendation field. While some limitations exist, overall it represents valuable research work.