2025-11-10T02:33:59.960416

Active Learning of General Halfspaces: Label Queries vs Membership Queries

Diakonikolas, Kane, Ma

We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces under the Gaussian distribution on $R^d$ in the presence of some form of query access. In the classical pool-based active learning model, where the algorithm is allowed to make adaptive label queries to previously sampled points, we establish a strong information-theoretic lower bound ruling out non-trivial improvements over the passive setting. Specifically, we show that any active learner requires label complexity of $\tildeÎ©(d/(\log(m)Îµ))$, where $m$ is the number of unlabeled examples. Specifically, to beat the passive label complexity of $\tilde{O} (d/Îµ)$, an active learner requires a pool of $2^{poly(d)}$ unlabeled samples. On the positive side, we show that this lower bound can be circumvented with membership query access, even in the agnostic model. Specifically, we give a computationally efficient learner with query complexity of $\tilde{O}(\min\{1/p, 1/Îµ\} + d\cdot polylog(1/Îµ))$ achieving error guarantee of $O(opt)+Îµ$. Here $p \in [0, 1/2]$ is the bias and $opt$ is the 0-1 loss of the optimal halfspace. As a corollary, we obtain a strong separation between the active and membership query models. Taken together, our results characterize the complexity of learning general halfspaces under Gaussian marginals in these models.

academic

Active Learning of General Halfspaces: Label Queries vs Membership Queries

Basic Information

Paper ID: 2501.00508
Title: Active Learning of General Halfspaces: Label Queries vs Membership Queries
Authors: Ilias Diakonikolas (University of Wisconsin-Madison), Daniel M. Kane (University of California, San Diego), Mingchen Ma (University of Wisconsin-Madison)
Classification: cs.LG (Machine Learning)
Submission Date: December 31, 2024
Paper Link: https://arxiv.org/abs/2501.00508

Abstract

This paper investigates the problem of learning general (non-homogeneous) halfspaces over Gaussian distributions in $\mathbb{R}^d$ , considering two query access models. In the classical pool-based active learning model, algorithms can perform adaptive label queries on pre-sampled points. The authors establish strong information-theoretic lower bounds that preclude non-trivial improvements over passive settings. Specifically, any active learner requires $\tilde{\Omega}(d/(\log(m)\epsilon))$ label complexity, where $m$ is the number of unlabeled samples. To surpass the passive learning label complexity of $\tilde{O}(d/\epsilon)$ , active learners require $2^{\text{poly}(d)}$ unlabeled samples. On the positive side, the authors demonstrate that this lower bound can be circumvented through membership query access, even in the agnostic model. Specifically, they provide a computationally efficient learner with query complexity $\tilde{O}(\min\{1/p, 1/\epsilon\} + d \cdot \text{polylog}(1/\epsilon))$ , achieving error guarantee $O(\text{opt})+\epsilon$ .

Research Background and Motivation

Problem Definition

This paper studies the problem of learning general halfspaces under Gaussian distributions. Halfspaces (or linear threshold functions, LTFs) are functions of the form $h(x) = \text{sign}(w \cdot x + t)$ , where $w \in S^{d-1}$ is a weight vector and $t$ is a threshold. When $t=0$ , these are called homogeneous halfspaces.

Research Motivation

Theoretical Gap: For homogeneous halfspaces, active learning is known to achieve $O(d\log(1/\epsilon))$ label complexity, but whether similar improvements exist for general halfspaces remains an open question.
Practical Importance: Halfspace learning is a classical problem in machine learning with significant impact from the perceptron algorithm to SVMs and AdaBoost.
Query Model Comparison: The capability differences between active learning (label queries) and membership queries require deeper understanding.

Limitations of Existing Methods

For general halfspaces with bias $p$ , at least $1/p$ labeled samples are needed to observe the first point of the minority class
Existing information-theoretic lower bounds are $\Omega(\min\{1/p, 1/\epsilon\} + d\log(1/\epsilon))$
Lack of rigorous characterization of differences between active learning and membership query models

Core Contributions

Strong Information-Theoretic Lower Bounds: Proves that any active learning algorithm requires $\tilde{\Omega}(d/(\log(m)\epsilon))$ label complexity, where $m$ is the number of unlabeled samples
Membership Query Upper Bounds: Provides a computationally efficient algorithm with query complexity $\tilde{O}(\min\{1/p, 1/\epsilon\} + d \cdot \text{polylog}(1/\epsilon))$
Model Separation: Establishes a strong separation between active learning and membership query models
Complexity Characterization: Completely characterizes the complexity of learning general halfspaces under Gaussian marginal distributions

Methodology Details

Task Definition

Input: Access to labeled function $y(x): \mathbb{R}^d \to \{\pm 1\}$ with target distribution $\mathcal{N}(0,I)$ Output: Halfspace $\hat{h}(x) = \text{sign}(\hat{w} \cdot x + \hat{t})$ Objective: Minimize error rate $\text{err}(\hat{h}) = \Pr_{x \sim \mathcal{N}(0,I)}(\hat{h}(x) \neq y(x))$

Lower Bound Proof Strategy

Core Idea

If a small number of queries can learn a halfspace with error rate $p/2$ , then by randomly partitioning the sample set, one can use the first part to learn the halfspace and the second part to find $d$ negative samples with $O(d)$ expected queries.

Key Lemmas

Lemma 2.1: If there exists an active learning algorithm that uses $r$ label queries to learn a halfspace with bias $p$ to error rate $p/2$ , then there exists an algorithm using $r+O(d)$ queries to find $d$ negative samples from $2m$ samples.

Lemma 2.2: For matrix $A \in \mathbb{R}^{k \times d}$ , if $\|AA^T - dI\|_2 \leq O(d/(t^*)^2)$ , then the probability that a random halfspace labels all $k$ samples as negative is at most $O(p\log(1/p))^k$ .

Upper Bound Algorithm Design

Overall Framework (Algorithm 1)

Bias Estimation: Use $\tilde{O}(\min\{1/p, 1/\epsilon\})$ queries to estimate bias $p$
Threshold Grid: Construct threshold grid $\{t_0, t_1, \ldots, t_\psi\}$ with spacing $1/(2\log(1/\epsilon))$
Initialization and Refinement: Run initialization and refinement algorithms for each grid point
Candidate Selection: Use tournament method to select the best hypothesis from candidates

Refinement Algorithm (Algorithm 3)

Uses projected gradient descent:

Gradient Construction: $G_i := \text{proj}_{w_i^{\perp}} zy(A_i^{1/2}z - \tilde{t}w_i)$
Update Rule: $w_{i+1} = \text{proj}_{S^{d-1}}(w_i + \mu_i\hat{g}_i)$
Localization Technique: Find correct $\tilde{t}$ through binary search

Key Lemma 3.1: If gradient estimates satisfy certain conditions, then $\sin(\theta_{i+1}/2) \leq (1-1/C_2)\sigma_i$

Initialization Algorithm (Algorithm 2)

Uses label smoothing technique:

Smoothed Labels: $\tilde{y}(x) := y(\sqrt{1-\rho^2}x + \rho z)$ , where $z \sim \mathcal{N}(0,I)$
Chow Parameter Estimation: Estimate $\mathbb{E}[z\tilde{y}(x_0)]$ to obtain direction of $w^*$

Experimental Setup

Theoretical Analysis Framework

This is primarily a theoretical work establishing complexity bounds through mathematical proofs rather than empirical experiments.

Analysis Tools

Information-Theoretic Methods: Yao's minimax principle
Geometric Analysis: Concentration phenomena on high-dimensional spheres
Probabilistic Tools: Tail bounds and concentration inequalities for Gaussian distributions

Main Results

Lower Bound Results (Theorem 1.1)

Theorem 1.1: For any active learning algorithm $A$ , there exists a halfspace $h^*$ with bias $p$ such that if $A$ performs fewer than $\tilde{\Omega}(d/(p\log(m)))$ label queries on $m$ i.i.d. Gaussian samples, then with probability at least 2/3 it outputs a hypothesis with error rate exceeding $p/2$ .

Corollary: To surpass passive learning's $\tilde{O}(d/\epsilon)$ complexity, $2^{\text{poly}(d)}$ unlabeled samples are required.

Upper Bound Results (Theorem 1.2)

Theorem 1.2: There exists an algorithm using $\tilde{O}(\min\{1/p, 1/\epsilon\} + d \cdot \text{polylog}(1/\epsilon))$ membership queries with runtime $\text{poly}(d,M)$ , outputting a hypothesis with error rate at most $O(\text{opt}) + \epsilon$ .

Complexity Analysis

Initialization: $\tilde{O}(1/p + d\log(1/\epsilon))$ queries
Refinement: $\tilde{O}(d \cdot \text{polylog}(1/\epsilon))$ queries
Total Complexity: $\tilde{O}(\min\{1/p, 1/\epsilon\} + d \cdot \text{polylog}(1/\epsilon))$

Active Learning Theory

Early work by Dasgupta et al. establishing $O(d\log(1/\epsilon))$ complexity for homogeneous halfspaces
Balcan-Hanneke-Vaughan providing $\tilde{O}((1/p)d^{3/2}\log(1/\epsilon))$ upper bound for general halfspaces

Membership Query Learning

Angluin's introduction of the membership query model
Algorithm design for robust learning under noise

Halfspace Learning

Development from perceptron to modern SVMs
Learning algorithms under adversarial noise

Technical Innovations

Lower Bound Proof Techniques

Decision Tree Analysis: Modeling query algorithms as binary decision trees
Geometric Conditions: Establishing matrix condition $\|AA^T - dI\|_2 \leq O(d/(t^*)^2)$
Probabilistic Analysis: Leveraging tail bounds of Beta distributions

Upper Bound Algorithm Techniques

Localization Technique: Finding correct threshold through bias checking
Label Smoothing: Overcoming difficulties in Chow parameter estimation under small bias
Robustness Analysis: Maintaining algorithm performance under noise

Conclusions and Discussion

Main Conclusions

Active learning provides no significant advantage for general halfspaces (unless exponential unlabeled samples are available)
Membership queries achieve near-optimal query complexity
Exponential separation exists between the two query models

Limitations

Only considers Gaussian distributions; results for other distributions remain unknown
Algorithm implementation requires precise numerical computation
Constant factors may be large

Future Directions

Extension to other distribution families
Improving practical efficiency of algorithms
Studying query complexity for other geometric concept classes

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: First to establish rigorous separation between active learning and membership queries
Advanced Technical Methods: Combines multiple mathematical tools with sophisticated proof techniques
Complete Results: Provides both upper and lower bounds, completely characterizing problem complexity
Clear Presentation: Well-organized technical details with rigorous logical arguments

Weaknesses

Limited Practical Applicability: Polylogarithmic factors in algorithm complexity may be large
Distribution Restrictions: Only considers Gaussian distributions; real-world distributions may differ
Missing Experiments: Lacks empirical validation of theoretical results

Impact

Theoretical Significance: Provides important negative results for active learning theory
Methodological Value: Proof techniques applicable to other learning problems
Practical Guidance: Offers theoretical guidance for practical system design

Applicable Scenarios

Theoretical machine learning research
Query complexity analysis
Online learning system design
Practical applications involving halfspaces

This paper represents an important contribution to active learning theory, revealing through rigorous mathematical analysis the fundamental differences between different query models, establishing a solid foundation for theoretical development in this field.