When Are Learning Biases Equivalent? A Unifying Framework for Fairness, Robustness, and Distribution Shift
Mehta
Machine learning systems exhibit diverse failure modes: unfairness toward protected groups, brittleness to spurious correlations, poor performance on minority sub-populations, which are typically studied in isolation by distinct research communities. We propose a unifying theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance. By formalizing biases as violations of conditional independence through information-theoretic measures, we prove formal equivalence conditions relating spurious correlations, subpopulation shift, class imbalance, and fairness violations. Our theory predicts that a spurious correlation of strength $α$ produces equivalent worst-group accuracy degradation as a sub-population imbalance ratio $r \approx (1+α)/(1-α)$ under feature overlap assumptions. Empirical validation in six datasets and three architectures confirms that predicted equivalences hold within the accuracy of the worst group 3\%, enabling the principled transfer of debiasing methods across problem domains. This work bridges the literature on fairness, robustness, and distribution shifts under a common perspective.
academic
When Are Learning Biases Equivalent? A Unifying Framework for Fairness, Robustness, and Distribution Shift
Machine learning systems exhibit multiple failure modes: unfairness toward protected groups, vulnerability to spurious correlations, and poor performance on minority subgroups. These issues are typically studied independently by different research communities. This paper proposes a unified theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance. By formalizing biases as violations of conditional independence (using information-theoretic measures), the authors prove formal equivalence conditions between spurious correlations, subgroup shifts, class imbalance, and fairness violations. The theory predicts that spurious correlations of strength α produce worst-group accuracy drops equivalent to subgroup imbalance ratios r ≈ (1+α)/(1-α). Empirical validation on six datasets and three architectures confirms that predicted equivalences hold within 3% error on worst-group accuracy, enabling principled transfer of debiasing methods across problem domains.
Deep learning systems frequently exhibit systematic failures with degraded performance on specific subgroups despite high average accuracy. Specifically:
Algorithmic Unfairness: Medical diagnostic models accurate for majority populations but catastrophically fail for minority groups
Shortcut Learning: Image classifiers exploit spurious background correlations rather than learning robust features
Subgroup Shift: Recommendation systems amplify existing societal biases
Unified Theoretical Framework: Treats all biases as violations of conditional independence between predictions and protected/spurious attributes given true labels, formalized through information-theoretic measures
Formal Equivalence Conditions: Proves when spurious correlations, subgroup shifts, and fairness violations produce quantitatively equivalent effects (Theorem 2)
Predictive Theory: Framework predicts worst-group performance from distribution properties, empirically validated on 18 problem configurations
Method Transfer Verification: Successfully demonstrates transfer of debiasing techniques across theoretically equivalent problems, achieving within 5% of from-scratch training performance
Literature Bridging: Establishes unified perspective across fairness, robustness, and generalization research communities
Theorem 2 (Bias Equivalence):
Consider two learning problems (D₁, A₁) and (D₂, A₂) with identical feature space X and label space Y but different attributes A₁, A₂. Under smoothness assumptions on loss function ℓ and feature overlap condition:
η = min_y ∫ min(p₁(x|y), p₂(x|y))dx > τ
If bias mechanisms satisfy ϵ-equivalence:
|B(f; D₁) - B(f; D₂)| ≤ ϵ
then worst-group accuracy difference is at most δ(ϵ, η), where:
δ(ϵ, η) = O(√ϵ/η)
Corollary 3 (Spurious Correlation ↔ Imbalance):
Spurious correlation of strength α is equivalent to subgroup imbalance ratio r when:
Step 1: Relating Bias to Worst-Group Loss
Via Fano's inequality, worst-group error rate satisfies:
Err_worst ≤ [H(Y|A) + B(f; D)] / log 2
Step 2: Feature Overlap and Loss Distribution
Under feature overlap condition η > τ, via coupling lemma and Lipschitz continuity, Wasserstein-1 distance satisfies:
|B(f; D₁) - B(f; D₂)| ≤ ϵ ⟹ W₁(L₁, L₂) ≤ C√ϵ/η
Step 3: Bounding Accuracy Difference
Via Kantorovich-Rubinstein duality:
Information-Theoretic Unified View: First uses conditional mutual information I(Ŷ; A | Y) to uniformly characterize fairness, robustness, and distribution shift
Quantitative Equivalence Prediction: Provides computable formulas predicting equivalent bias configurations, rather than merely qualitative analysis
Feature Overlap Conditions: Explicitly identifies boundary conditions for equivalence (η > τ), explaining when equivalence fails
Operationality: Theory predictions directly applicable by measuring α and label marginals without complex computation
Finding: Consistent equivalence across architectures (average variation 0.8%), indicating phenomenon is fundamentally distributional
Correlation Strength:
Systematically vary spurious correlation strength α from 0.7 to 0.99, observing predicted equivalent imbalance ratios from 5.7:1 to 199:1, with all predictions verified within 4% worst-group accuracy, confirming Corollary 3 across entire correlation strength range.
Binary Classification Assumption: Current theory limited to binary classification, though naturally extends to multi-class via one-vs-rest decomposition
Bound Looseness: δ(ϵ, η) bound may be loose in practice; tighter characterization via concentration inequalities remains open
Worst-Group Metric: Focuses on worst-group metrics; connections to calibration and individual fairness merit exploration
Sagawa et al. (2020) - GroupDRO method and Waterbirds benchmark
Geirhos et al. (2020) - Shortcut learning in deep networks
Hardt et al. (2016) - Equalized odds in supervised learning
Koh et al. (2021) - WILDS wild distribution shift benchmark
Kirichenko et al. (2022) - Deep Feature Reweighting (DFR)
Liu et al. (2021) - Just Train Twice (JTT) method
Overall Assessment: This is a high-quality theory-and-empirics combined work with pioneering contributions to machine learning bias research. The theoretical framework is elegant and practical, with sufficient experimental validation. Main limitations are binary classification assumptions and missing multi-class extensions. For a top-tier venue like NeurIPS, this is a strong paper meriting acceptance, with anticipated significant impact and inspiration for subsequent research. Recommend authors supplement final version with additional method transfer experiments, failure case analysis, and practical guidance for feature overlap threshold selection.