2025-11-26T01:46:17.989246

LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations

Dey, Braun, Ravipati et al.

E-commerce sellers are advised to bid on keyphrases to boost their advertising campaigns. These keyphrases must be relevant to prevent irrelevant items from cluttering search systems and to maintain positive seller perception. It is vital that keyphrase suggestions align with seller, search and buyer judgments. Given the challenges in collecting negative feedback in these systems, LLMs have been used as a scalable proxy to human judgments. This paper presents an empirical study on a major ecommerce platform of a distillation framework involving an LLM teacher, a cross-encoder assistant and a bi-encoder Embedding Based Retrieval (EBR) student model, aimed at mitigating click-induced biases in keyphrase recommendations.

academic

LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations

Basic Information

Paper ID: 2508.03628
Title: LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations
Authors: Soumik Dey, Benjamin Braun, Naveen Ravipati, Hansi Wu, Binbin Li (eBay Inc)
Categories: cs.IR (Information Retrieval), cs.AI, cs.LG
Publication Date: arXiv v5, November 20, 2025
Paper Link: https://arxiv.org/abs/2508.03628v5

Abstract

E-commerce sellers need to bid on keywords to improve ad performance, and these keywords must be relevant to prevent unrelated products from polluting the search system and maintain seller satisfaction. Due to the difficulty of collecting negative feedback, this paper proposes using LLMs as scalable proxies for human judgment. The research implements a knowledge distillation framework on a large-scale e-commerce platform: LLM teacher model → cross-encoder assistant → dual-encoder EBR student model, aimed at mitigating click bias issues in keyphrase recommendations.

Research Background and Motivation

1. Core Problem

In e-commerce advertising systems, recommending relevant keywords (buyer query terms) to sellers for bidding on ads. Key challenges include:

Unreliability of click data: High clicks/sales indicate relevance, but lack of clicks does not indicate irrelevance
MNAR bias (Missing Not At Random): Unpopular products rank low, receiving fewer impressions and clicks
Middleman bias: Training data only contains keywords filtered through search relevance, leading to sample selection bias

2. Problem Significance

Keyword relevance directly impacts seller strategy and search system quality
Irrelevant recommendations reduce seller satisfaction, waste resources, and harm ad performance
Need to simultaneously satisfy judgment criteria of sellers, advertising systems, and search systems

3. Limitations of Existing Methods

CTR-only training: Easily replicates popularity and exposure bias in training data
Unreliable negative samples: Negative samples in click logs cannot truly reflect irrelevance
Manual annotation difficulty: High cost, limited scale, and modal bias (annotators can see images but models cannot)

4. Research Motivation

Leverage LLM's world knowledge and judgment capabilities as a proxy for human judgment, through multi-task learning and knowledge distillation framework, combining CTR, search relevance, and LLM signals to train efficient dual-encoder retrieval models.

Core Contributions

Proposes Teacher-Assistant-Student distillation framework: Three-level architecture of LLM teacher → cross-encoder assistant → dual-encoder student
Multi-signal fusion training strategy: Integrates CTR, search relevance (SR), and LLM labels in a multi-task learning paradigm
Systematic loss function comparative study: Evaluates 8 knowledge distillation loss functions, finding Pearson correlation loss optimal
Production environment evaluation protocol: Proposes offline evaluation method simulating real ad auction scenarios
Significant business impact: A/B testing shows 51.26% GMB increase, 38.69% ROAS increase, 11.75% keyphrase adoption rate increase

Method Details

Task Definition

Input: Item title + category and buyer query keyphrase
Output: Relevance judgment (binary classification or continuous similarity score)
Objective: Retrieve Top-K most relevant keywords for each product for ad bidding
Constraints: Requires low latency (suitable for production), high accuracy (aligned with multiple stakeholders)

Model Architecture

1. Dataset Construction (Three Label Sources)

CTR Labels (10,702,747 samples):

Calculate click-to-impression ratio over past 30 days
CTR > 0.05 marked as positive samples
Positive samples reliable, negative samples unreliable (used only for MNR loss)

Search Relevance (SR) Labels (18,721,682 samples):

Collected from auction process over 3 months using SR model scores
Exceeding business threshold marked as positive samples
No middleman bias and sample selection bias

LLM Labels (50,078,315 training set, 3,524,414 test set):

Generated using Mixtral 8X7B Instruct-v0.1
90% consistency with click data
Prompt design:

Given an item with title: "{title}", 
determine whether the keyphrase: "{keyphrase}", 
is relevant for cpc targeting or not by giving 
ONLY yes or no answer

2. Cross-Encoder (Assistant)

Base Model: microBERT (distilled version of eBERT)

4.3x smaller than eBERT, 5.5x faster
Pre-trained on eBay product data

Input Format:

query [SEP] category name [SEP] item title

Training:

Fine-tuned on 50M LLM labels using cross-entropy loss
Test set F1=96% (7.5M samples)

Role: Serves as intermediate assistant model, providing soft labels for distillation

3. Dual-Encoder (Student)

Base Model: microBERT dual-tower architecture

Input Processing:

Item tower: item title [SEP] category name
Keyphrase tower: buyer query
Compute cosine similarity after independent encoding

Output Dimension Optimization:

Use Matryoshka Loss to truncate embeddings to 64 dimensions (reducing ANN latency)

4. Multi-Task Training Paradigm

Core Idea: Each batch contains samples from only one dataset, sampled proportionally by dataset size

Loss Function Combination:

Data Source	Loss Function	Rationale
CTR Labels	MNR Loss	Only reliable positive samples; negatives generated via IRNS
SR Labels	Contrastive Loss	Clear positive and negative samples
LLM Labels	Contrastive Loss	Clear positive and negative samples
Cross-Encoder Distillation	Pearson Correlation Loss	Align ranking order

Technical Innovations

1. Necessity of Teacher-Assistant Architecture

Direct distillation from LLM to dual-encoder performs poorly (F1=0.66 vs 0.88)
Cross-encoder as intermediate bridge:
- Stronger learning capacity than dual-encoder (can jointly encode)
- More efficient than LLM (can generate large-scale soft labels)
- Enables progressive knowledge transfer

2. Rationality of Multi-Signal Fusion

LLM+CTR+KD model achieves optimal performance:
- Median keyphrase count: 12
- LLM pass rate: 71%
- Search pass rate: >99%

Design Principles:

CTR provides real interaction signals (reliable positive samples)
LLM provides unbiased judgment (covers unexposed samples)
SR ensures search system acceptance
Cross-encoder provides fine-grained ranking signals

3. Superiority of Pearson Loss

Experimental comparison (Table 1):

KD Loss	F1	Precision	Recall	ρ (Pearson Correlation)
MSE	0.81	0.77	0.86	0.78
CoSENT	0.87	0.86	0.88	0.82
Pearson	0.88	0.87	0.88	0.87
MSEmar	0.86	0.84	0.88	0.80
KL-Div	0.85	0.83	0.88	0.66

Reason Analysis:

MSE is pointwise loss, cannot capture ranking relationships
CoSENT is pairwise ranking loss, has calibration ability
Pearson is batch ranking loss, optimizes overall linear correlation
Achieves highest Pearson correlation with cross-encoder (0.87)

Experimental Setup

Datasets

Platform Scale: 2.3 billion products
Training Set:
- CTR: 10.7M
- SR: 18.7M
- LLM: 50M (training) + 3.5M (test)
Evaluation Set: 10,000 samples (per model)
A/B Testing: US market for 12 days

Evaluation Metrics

Offline Metrics:

F1, Precision, Recall: Classification performance
ρ (Pearson Correlation): Alignment with cross-encoder
KP (Keyphrase Count): Median keyphrase count after relevance filtering
PR (Pass Rate): LLM/SR pass rate at different ranking positions

Online Metrics:

GMB (Gross Merchandise Bought): Sales volume
ROAS (Return on Ad Spend): Advertising ROI
Adoption Rate: Number of keywords actually used by sellers

Baseline Methods

CTR-only: Baseline trained only on CTR
LLM: Only LLM labels + Contrastive Loss
LLM+KD: LLM labels + cross-encoder distillation
LLM+SR+KD: LLM + SR labels + distillation
LLM+CTR+KD: Optimal combination
LLM+SR+CTR+KD: All signals combination

Implementation Details

Base Model: microBERT (selection rationale in Table 3)
Training Framework: PyTorch + Transformers
Batch Sampling: Proportional to dataset size
Production Deployment:
- Batch inference: PySpark (1500 executors)
- NRT inference: Triton + ONNX (V100 GPU)
- Daily increment latency: 35 minutes (20M products)
- ANN retrieval: Additional 2.5 hours

Experimental Results

Main Results

Table 2: Label Ablation Experiment

Model	KP	PR	Pass@5	Pass@10	Pass@15	Pass@20
LLM+CTR+KD	12.0	71	68	60	55	52
LLM+SR+CTR+KD	11.0	70	67	59	54	51
LLM+SR+KD	12.0	51	47	42	41	39
LLM+KD	11.0	49	36	35	33	32
LLM	11.0	61	45	41	38	35
CTR	7	60	51	42	37	34

Key Findings:

LLM+CTR+KD Optimal: Achieves best balance between efficiency (KP=12) and quality (PR=71%)
CTR-only Inefficient: Only 7 keywords, limiting coverage
Distillation Brings Significant Improvement: LLM → LLM+KD (PR: 61% → 49%, but Pass@5 improves)
SR Signal's Role: Improves search pass rate to >99%

Ablation Experiments

1. Knowledge Distillation Loss Comparison (Table 1)

Pearson Loss Optimal: F1=0.88, ρ=0.87
CoSENT Second Best: F1=0.87, ρ=0.82
MSE Fails: Validates CUPID paper findings
Direct Distillation (LLM→BE) Poor: Contrastive F1=0.83, Softmax F1=0.66

2. Base Model Selection (Table 3)

Base Model	Recall	Precision	F1
eBERT	0.92	0.81	0.86
microBERT	0.92	0.78	0.85
ModernBERT	0.91	0.76	0.83

Rationale for microBERT Selection:

Performance close to eBERT (F1 difference only 0.01)
30% faster inference speed
Pre-trained on platform data (ModernBERT not pre-trained)

3. Progressive Multi-Task Framework Construction

CTR (F1=0.66) 
→ CTR+LLM (F1=0.83) 
→ LLM+CTR+KD (F1=0.88)

Each component brings gains

A/B Testing Results (Online Validation)

Test Setup: US market, 12 days, replacing CTR-only EBR model

Business Metric Improvements:

GMB +51.26% (p=0.01) - Significant sales growth
ROAS +38.69% (p=0.02) - Significant ROI improvement
Adoption Rate +11.75% (p=0.03) - Sellers more willing to use recommendations

Significance: Demonstrates offline metric improvements translate to real business value

Case Analysis

Positive Case (LLM and model agreement):

Product: "Genuine 15V 4A Power AC Adapter Laptop Charger For Surface Pro 3 4 5 6"
Keyphrase: "microsoft surface charger"
Judgment: Relevant ✓

Negative Case (Fine-tuned LLM failure):

Product: "iPhone 11 64GB 128G Unlocked..."
Keyphrase: "yellow iphone" (image shows yellow color)
General LLM: Irrelevant (text-only basis)
Fine-tuned LLM: Relevant (affected by modal bias)

Experimental Findings

General LLM Superior to Fine-tuned LLM:
- General LLM: Reduce 68% keywords, sales +10%
- Fine-tuned LLM: Retain 75% keywords, sales -20%
- Reason: Human annotations have modal bias
Teacher-Assistant Necessity:
- Cross-encoder has better calibration
- Can handle large-scale data for soft label generation
Multi-Signal Complementarity:
- CTR: Reliable positive samples
- LLM: Long-tail coverage
- SR: Search system alignment
- All three indispensable

1. Embedding-Based Retrieval (EBR)

Dual-Encoder vs Cross-Encoder:
- Dual-encoder: Independent encoding, supports ANN, low latency
- Cross-encoder: Joint encoding, better performance, high latency
Paper Contribution: Combines both advantages through distillation

2. Click Bias Problem

MNAR Bias: Chen et al. (2023)
Middleman Bias: Dey et al. (2025b) - Prior work by paper authors
Paper Solution: Supplements click data with LLM and SR signals

3. Knowledge Distillation Methods

TwinBERT (Lu et al., 2020): Cross-encoder to dual-tower BERT
ERNIE-search (Lu et al., 2022): Teacher-Assistant architecture
PROD (Lin et al., 2023): Progressive distillation
D2LLM (Liao et al., 2024): Pearson loss for LLM distillation
Paper Contribution: Combines multi-task learning with Teacher-Assistant architecture

4. LLM as Judge

GPT-4 Evaluation: Zheng et al. (2023) - MT-Bench
Search Application: Wang et al. (2024) - Pinterest
Paper Contribution:
- Large-scale application (50M labels)
- Systematic evaluation of general LLM vs fine-tuned LLM
- Discovery of modal bias problem

Conclusions and Discussion

Main Conclusions

LLM Signals Effectively Mitigate Click Bias: In ad keyphrase recommendation scenarios, LLM-generated labels significantly outperform CTR-only approaches
Teacher-Assistant Architecture Superior to Direct Distillation: Cross-encoder as intermediate bridge is critical
Pearson Loss Most Suitable for Ranking Distillation: Batch ranking loss outperforms pointwise and pairwise losses
Multi-Signal Fusion Produces Synergistic Effects: CTR+LLM+KD combination achieves best business results
General LLM Superior to Fine-tuned LLM: When human-annotated data has modal bias

Limitations

Domain Specificity:
- Research limited to e-commerce advertising scenarios
- Method transferability requires validation
Human Annotation Quality Issues:
- Annotators can see images but models cannot (modal bias)
- Label granularity too fine (excellent/good/fair/bad)
- Sample size insufficient to cover 2.3 billion products
Simple Negative Sample Mining Strategy:
- CTR data only uses IRNS (In-batch Random Negative Sampling)
- Advanced methods like ANCE, N-Game not explored
- Left for future research
Limited LLM Choices:
- Uses Mixtral 8X7B (open-source, medium-scale)
- Larger models (GPT-4) limited by API constraints
- No LLM fine-tuning (due to human data quality issues)
Evaluation Limitations:
- Offline evaluation only on LLM label test set
- A/B testing only in US market
- Long-term effects not evaluated

Future Directions

Better Human Judgment Data Collection:
- Unified input modality (text-only or multimodal)
- Simplified labels (binary classification)
- Expanded sample scale
Advanced Negative Sample Mining:
- Explore ANCE, N-Game methods
- Balance computational cost and performance
Multimodal Extension:
- Incorporate image information into models
- Address modal bias problem
Fine-tuned LLM Exploration:
- Fine-tune on high-quality data
- Potentially further improve performance
Cross-Domain Transfer:
- Validate method on other e-commerce platforms
- Extend to non-advertising scenarios

In-Depth Evaluation

Strengths

1. Method Innovation ⭐⭐⭐⭐⭐

Three-Level Teacher-Assistant-Student Architecture: Innovatively combines LLM, cross-encoder, and dual-encoder
Multi-Task Hybrid Training: Skillfully fuses three heterogeneous signal sources
Systematic Loss Function Study: Compares 8 KD losses, provides clear guidance

2. Experimental Sufficiency ⭐⭐⭐⭐⭐

Large-Scale Real Data: 50M LLM labels, 2.3 billion products
Comprehensive Ablation Studies: Labels, losses, base models, architectures
Online Validation: A/B testing proves business value
Detailed Appendix: LLM evaluation, loss function derivations, system architecture

3. Practical Value ⭐⭐⭐⭐⭐

Significant Business Improvements: GMB +51%, ROAS +39%
Production Deployment Details: Complete system architecture and latency analysis
Strong Reproducibility: Open-source models (Mixtral), clear method description

4. Insight Depth ⭐⭐⭐⭐

Modal Bias Discovery: Reveals hidden problems in human annotations
General LLM Advantages: Challenges conventional "fine-tuning is always better" wisdom
Middleman Bias: Proposes new bias type and provides solution

5. Writing Quality ⭐⭐⭐⭐

Clear structure, rigorous logic
Rich figures (auction mechanism diagram, architecture diagram, production system diagram)
Complete mathematical formulas (detailed derivations in Appendix 8.3)

Weaknesses

1. Method Limitations

Computational Cost Not Quantified: GPU time/cost for generating 50M LLM labels not reported
Hyperparameter Sensitivity: No analysis of learning rate, batch size, temperature parameter impacts
Limited LLM Choice: Mixtral 8X7B not optimal, but constrained by open-source and cost

2. Experimental Setup Flaws

Single Test Set Evaluation: Offline experiments only on LLM label test set, not validated on SR/CTR test sets
Short A/B Test Duration: 12 days may be insufficient to observe long-term effects (e.g., seller fatigue)
Geographic Limitation: Only US market, effects in other countries unknown

3. Insufficient Analysis

Few Failure Case Analyses: Only 1 modal bias example provided
Ranking Quality Not Evaluated: No NDCG, MRR and other ranking metrics
Diversity Not Quantified: While uniqueness and diversity mentioned, no specific metrics

4. Reproducibility Issues

Platform Anonymity: Cannot access eBay-specific eBERT/microBERT
Data Not Public: Commercial data cannot be shared
Complete Code Not Open-Sourced: Only method description provided

5. Missing Theoretical Analysis

Why Pearson Optimal: Lacks theoretical explanation, only experimental validation
Teacher-Assistant Gain Sources: Contribution of each level not quantified
Multi-Task Learning Theory: No analysis of task interference/synergy

Impact Assessment

Contribution to Field ⭐⭐⭐⭐⭐

E-Commerce Ad System Bias: Systematically clarifies middleman bias, provides solution paradigm
Knowledge Distillation: Validates Teacher-Assistant architecture effectiveness in retrieval tasks
LLM Application: Successful large-scale LLM label generation case study (50M)
Industrial Practice: Complete production system design reference

Academic Impact

High Citation Potential: Solves practical problems, method transferable
Future Research Directions: Multimodal LLMs, better human annotation protocols
Benchmark Role: Pearson loss may become distillation standard

Industrial Impact

Direct Business Value: GMB +51% significant for eBay
Strong Replicability: Other e-commerce platforms can learn (Amazon, Alibaba)
Significant Cost-Benefit: LLM labels replace large-scale manual annotation

Applicable Scenarios

Highly Applicable ✅

E-Commerce Ad Recommendations: Keywords, product recommendations
Search Relevance: Query-document matching
Information Retrieval: Any scenario requiring multi-stakeholder judgment alignment
Bias Mitigation: Recommendation systems with click/exposure bias

Moderately Applicable ⚠️

Other Recommendation Scenarios: Requires signal source adjustment (e.g., video recommendations)
Cross-Lingual Retrieval: Requires multilingual LLM and pre-trained models
Real-Time Systems: Requires NRT inference latency optimization

Not Applicable ❌

Small-Scale Data: Method requires large data volumes (million-level)
Unbiased Scenarios: If click data reliable, method gains limited
Pure Exploration Tasks: Requires diversity rather than relevance

Reproduction Recommendations

To reproduce this work:

Replace LLM: Use Llama 3.1 70B or Qwen 2.5 72B
Replace Base Model: Use public sentence-transformers models
Simplified Version: First validate LLM+CTR+Pearson Loss (no SR data needed)
Evaluation Protocol: Reference Appendix 8.2 offline evaluation procedure
Start Scale: Begin with million-level data, gradually scale up

Selected References

D2LLM (Liao et al., 2024): First proposes Pearson loss for LLM→dual-encoder distillation
CUPID (Bhattacharya et al., 2023): Proves MSE loss unsuitable for cross→dual-encoder distillation
ERNIE-search (Lu et al., 2022): Early exploration of Teacher-Assistant architecture
Middleman Bias (Dey et al., 2025b): Paper authors' proposed middleman bias theory

Bias and Recommendation

Chen et al. (2023): Recommendation system bias survey
Joachims et al. (2017): Unbiased learning from biased feedback

LLM Evaluation

Zheng et al. (2023): MT-Bench and LLM-as-a-judge
Gu et al. (2025): LLM as judge survey

Overall Rating: ⭐⭐⭐⭐⭐ (5/5)

This is an excellent industrial application paper that validates the effectiveness of LLM-assisted training in real large-scale scenarios, providing a complete solution from theory to practice. Despite some limitations (such as insufficient theoretical analysis, single-market testing), its practical value, method innovation, and experimental sufficiency all reach top-tier levels. Particularly noteworthy is the authors' in-depth analysis of general LLM vs fine-tuned LLM, revealing the modal bias problem in human annotations and providing important warnings for the field.