2025-11-14T22:04:10.870857

Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions

Trauger, Trauger, Tewari

In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.

academic

Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions

Basic Information

Paper ID: 2510.08382
Title: Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions
Authors: Jacob Trauger (University of Michigan), Tyson Trauger (The Ohio State University), Ambuj Tewari (University of Michigan)
Classification: cs.LG (Machine Learning), stat.ML (Statistics - Machine Learning)
Publication Date: October 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.08382

Abstract

This paper provides a characterization of the learnability of forgiving 0-1 loss functions in the finite-label multiclass classification setting. To this end, the authors create a new combinatorial dimension based on the Natarajan dimension and prove that a hypothesis class is learnable in this setting if and only if this generalized Natarajan dimension is finite. The paper also demonstrates connections to set-valued feedback learning, showing that the learnability of set learning problems is characterized by the Natarajan dimension.

Research Background and Motivation

Problem Background

In machine learning theory, characterizing the learnability of classification tasks is a central problem. For binary classification, VC dimension completely characterizes PAC learnability; for multiclass classification in the finite-label case, the Natarajan dimension plays an analogous role. However, these theories are based on the standard 0-1 loss function, which possesses the "identity of indiscernibles" property—loss is zero if and only if two labels are equal.

Research Motivation

In practical applications, more "forgiving" loss functions are often needed, such as:

Sentence paraphrasing tasks: Multiple different sentences may all be correct paraphrases
Threshold-based metrics: Outputs within a certain threshold range are all acceptable
Set-valued feedback learning: Predicted results only need to be in a given set

In these scenarios, multiple different outputs may produce zero loss for the same true label, breaking the foundational assumptions of traditional theory.

Limitations of Existing Methods

Existing learnability theories (VC dimension, Natarajan dimension, etc.) implicitly link label equality to loss values. When loss functions do not satisfy the identity of indiscernibles property, these theories no longer apply, necessitating new theoretical frameworks to characterize learnability.

Core Contributions

Proposes Generalized Natarajan Dimension: Creates a new combinatorial dimension based on the Natarajan dimension, applicable to forgiving 0-1 loss functions
Complete Learnability Characterization: Proves that a hypothesis class is PAC learnable under forgiving 0-1 loss if and only if the generalized Natarajan dimension is finite
Resolution of Set Learning Problems: First characterizes the learnability of set-valued feedback learning in the batch setting
Establishment of Theoretical Framework: Establishes systematic learning theory for loss functions that do not satisfy the identity of indiscernibles property

Methodology Details

Task Definition

Input Space: $X$ (arbitrary input space) Output Space: $Y = [k]$ (finite label set, $|Y| = k$ ) Hypothesis Class: $H \subset Y^X$ Loss Function: $\ell: Y \times Y \to \{0,1\}$ , satisfying the following constraints:

Binary-valuedness: $\forall y_1, y_2 \in Y, \ell(y_1, y_2) \in \{0,1\}$
Symmetry: $\forall y_1, y_2 \in Y, \ell(y_1, y_2) = \ell(y_2, y_1)$
Non-containment: $\forall y_1, y_2 \in Y, \sigma(y_1) \not\subset \sigma(y_2)$
Reflexivity: $\forall y \in Y, \ell(y, y) = 0$

where $\sigma(y) = \{y' | \ell(y, y') = 0\}$ denotes the set of all labels that produce zero loss with $y$ .

Core Theoretical Construction

1. Generalized Natarajan Dimension

Definition 4 (Generalized Natarajan Dimension): A hypothesis class $H$ and loss function $\ell$ generalized Natarajan-shatter a set $S = \{s_1, ..., s_n\}$ if there exist $h_1, h_2 \in H$ such that:

Separation Condition: $\forall s_i \in S, \sigma(h_1(s_i)) \neq \sigma(h_2(s_i))$
Realization Condition: $\forall S' \subseteq S$ $\forall S^{'} \subseteq S$ , there exists $h \in H$ $h \in H$ such that:
- $\forall s \in S': \sigma(h(s)) = \sigma(h_1(s))$
- $\forall s \in S \setminus S': \sigma(h(s)) = \sigma(h_2(s))$

The generalized Natarajan dimension $\text{GNdim}(H, \ell)$ is the cardinality of the largest set that can be generalized Natarajan-shattered by $H$ .

2. Construction of Equivalent Learning Problems

Key Insight: By defining an equivalence relation $y \sim y' \Leftrightarrow \sigma(y) = \sigma(y')$ through the $\sigma$ function, the original problem is transformed into a standard multiclass learning problem on the quotient space $Y_C = Y/\sim$ .

Lemma 1: The loss function respects the equivalence relation, i.e., if $y_1 \sim y_1'$ and $y_2 \sim y_2'$ , then $\ell(y_1, y_2) = \ell(y_1', y_2')$ .

Corollary 1: The original learning problem $(X, Y, H, \ell)$ is equivalent to the quotient space learning problem $(X, Y_C, H_C, \ell_C)$ .

Corollary 2: $\text{GNdim}(H, \ell) = \text{Ndim}(H_C)$

Main Theorem

Theorem 1 (Main Result): The learning problem $(X, Y, H, \ell)$ is PAC learnable if and only if $\text{GNdim}(H, \ell) < \infty$ .

Proof Strategy

Necessity (Lemma 2):

Employs proof by contradiction, assuming $\text{GNdim}(H, \ell) = \infty$
Constructs adversarial distribution families such that any learning algorithm performs poorly on some distribution
Utilizes the shattering property to construct complex function families on $2^m$ points
Proves via probabilistic arguments that there exists a distribution such that any algorithm's loss is at least $\frac{1}{2k}$

Sufficiency (Lemma 3):

Leverages the construction of equivalent learning problems
Decomposes the original loss as a linear combination of $k$ standard 0-1 losses: $1 - L_D(h) = \sum_{i=1}^k (1 - L_{D_i}(h))$
Since $H_C$ has finite Natarajan dimension, it possesses uniform convergence properties
Proves the effectiveness of the ERM algorithm through union bounds

Experimental Setup

This paper is purely theoretical work, primarily verifying theoretical results through mathematical proofs rather than traditional experimental setups. Theoretical verification is conducted through:

Theoretical Verification Methods

Constructive Proofs: Proves necessity by constructing specific adversarial distributions
Reduction Proofs: Reduces complex problems to known standard multiclass learning problems
Dimension Analysis: Characterizes learnability through the finiteness of combinatorial dimensions

Application Example Analysis

The paper validates the theoretical effectiveness through set learning problems, constructing specific loss matrices to illustrate theoretical applicability.

Experimental Results

Main Theoretical Results

Completion of Theorem 1's Proof: Successfully proves that the generalized Natarajan dimension completely characterizes PAC learnability under forgiving 0-1 loss functions.

Characterization of Set Learning (Corollary 3): For set learning problems where the loss function is defined as: $\ell(y, v) = \begin{cases} 0 & y \in v \\ 1 & \text{otherwise} \end{cases}$

It is proven that the learnability of such problems is characterized by the standard Natarajan dimension, i.e., $\text{GNdim}(H, \ell) = \text{Ndim}(H)$ .

Theoretical Insights

Dimension Preservation Property: In many cases, even when loss functions become more forgiving, learning complexity (measured by dimension) may remain unchanged
Existence of Adversarial Distributions: The rigor of PAC learning implies that even if a loss function is mostly zero, there may still exist distributions that make learning difficult
Importance of Quotient Spaces: Through appropriate equivalence relations, complex learning problems can be reduced to standard problems

Classical Learning Theory

VC Theory: Vapnik & Chervonenkis (1974) established learnability theory for binary classification
Natarajan Dimension: Natarajan (1989) extended the theory to multiclass classification
DS Dimension: Daniely & Shalev-Shwartz (2014) addressed infinite-label cases

Extended Learning Settings

Partial Concept Classes: Alon et al. (2022) studied partially defined concept classes
Multi-output Learning: Raman et al. (2024) characterized multi-output learning problems
Online Set Learning: Raman et al. (2024) studied set-valued feedback in online settings

Positioning of This Paper's Contribution

This paper fills a gap in the theory of forgiving loss functions in the batch setting, particularly being the first to systematically address loss functions that do not satisfy the identity of indiscernibles property.

Conclusions and Discussion

Main Conclusions

Complete Characterization: The generalized Natarajan dimension completely characterizes PAC learnability under forgiving 0-1 loss functions
Resolution of Set Learning: First complete characterization of set learning learnability in the batch setting
Theoretical Unification: Establishes a unified theoretical framework for loss functions that do not satisfy the identity of indiscernibles property

Limitations

Symmetry Assumption: Current theory requires loss functions to be symmetric, which may be overly restrictive
Finite Label Restriction: Theory only applies to finite-label cases; extension to infinite labels remains open
Learning Rates: While learnability is characterized, the paper does not analyze how learning rates vary with the forgivingness of the loss function

Future Directions

Removing Symmetry Assumptions: Study learnability of asymmetric loss functions
Infinite Label Extension: Similar to DS dimension's extension of Natarajan dimension
Learning Rate Analysis: Investigate how sample complexity depends on the forgivingness of the loss function
Algorithm Design: Design efficient learning algorithms specifically tailored for forgiving loss functions

In-Depth Evaluation

Strengths

Strong Theoretical Innovation: First systematic treatment of loss functions that do not satisfy the identity of indiscernibles property, filling an important theoretical gap
Mathematical Rigor: Complete and rigorous proofs, particularly elegant in the technique of reducing complex problems to known problems through quotient space construction
High Practical Value: Provides theoretical foundations for set learning and other practical problems with significant application value
Clear Presentation: Well-structured paper with accurate mathematical exposition, easy to understand and verify

Weaknesses

Restrictive Assumptions: Symmetry and finite-label assumptions limit the applicability of the theory
Lack of Algorithm Analysis: While proving ERM effectiveness, the paper lacks in-depth analysis of specific algorithm design and optimization
Insufficient Experimental Validation: As a purely theoretical work, it lacks verification on real datasets and application case studies
Incomplete Complexity Analysis: Does not provide concrete sample complexity bounds

Impact

Significant Theoretical Contribution: Opens new research directions in learning theory, expected to inspire substantial subsequent research
Substantial Practical Value: Provides theoretical foundations for set learning, multi-label learning, and other practical problems
Excellent Reproducibility: Theoretical results are entirely based on mathematical proofs with perfect reproducibility
Strong Inspirational Value: Proposed techniques (quotient space construction, equivalence relations, etc.) may apply to other learning theory problems

Applicable Scenarios

Set-valued Prediction: Recommendation systems, information retrieval, and other scenarios requiring candidate set returns
Multi-label Learning: Text classification, image annotation, and other tasks allowing multiple correct answers
Robust Learning: Learning scenarios requiring tolerance to label noise
Approximate Learning: Prediction tasks allowing certain degrees of approximation

References

The paper cites important literature in learning theory, including:

Vapnik & Chervonenkis (1974): Foundational work in VC theory
Natarajan (1989): Important contributions to multiclass learning theory
Shalev-Shwartz & Ben-David (2014): Modern learning theory textbook
Recent related work such as Daniely et al., Brukhim et al., Raman et al.

Overall Assessment: This is a high-quality theoretical paper making important contributions to learning theory. While subject to certain assumption limitations, it opens new research directions with significant theoretical value and practical implications.