2025-11-26T01:46:17.989246

LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations

Dey, Braun, Ravipati et al.
E-commerce sellers are advised to bid on keyphrases to boost their advertising campaigns. These keyphrases must be relevant to prevent irrelevant items from cluttering search systems and to maintain positive seller perception. It is vital that keyphrase suggestions align with seller, search and buyer judgments. Given the challenges in collecting negative feedback in these systems, LLMs have been used as a scalable proxy to human judgments. This paper presents an empirical study on a major ecommerce platform of a distillation framework involving an LLM teacher, a cross-encoder assistant and a bi-encoder Embedding Based Retrieval (EBR) student model, aimed at mitigating click-induced biases in keyphrase recommendations.
academic

LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations

Basic Information

  • Paper ID: 2508.03628
  • Title: LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations
  • Authors: Soumik Dey, Benjamin Braun, Naveen Ravipati, Hansi Wu, Binbin Li (eBay Inc)
  • Categories: cs.IR (Information Retrieval), cs.AI, cs.LG
  • Publication Date: arXiv v5, November 20, 2025
  • Paper Link: https://arxiv.org/abs/2508.03628v5

Abstract

E-commerce sellers need to bid on keywords to improve ad performance, and these keywords must be relevant to prevent unrelated products from polluting the search system and maintain seller satisfaction. Due to the difficulty of collecting negative feedback, this paper proposes using LLMs as scalable proxies for human judgment. The research implements a knowledge distillation framework on a large-scale e-commerce platform: LLM teacher model → cross-encoder assistant → dual-encoder EBR student model, aimed at mitigating click bias issues in keyphrase recommendations.

Research Background and Motivation

1. Core Problem

In e-commerce advertising systems, recommending relevant keywords (buyer query terms) to sellers for bidding on ads. Key challenges include:

  • Unreliability of click data: High clicks/sales indicate relevance, but lack of clicks does not indicate irrelevance
  • MNAR bias (Missing Not At Random): Unpopular products rank low, receiving fewer impressions and clicks
  • Middleman bias: Training data only contains keywords filtered through search relevance, leading to sample selection bias

2. Problem Significance

  • Keyword relevance directly impacts seller strategy and search system quality
  • Irrelevant recommendations reduce seller satisfaction, waste resources, and harm ad performance
  • Need to simultaneously satisfy judgment criteria of sellers, advertising systems, and search systems

3. Limitations of Existing Methods

  • CTR-only training: Easily replicates popularity and exposure bias in training data
  • Unreliable negative samples: Negative samples in click logs cannot truly reflect irrelevance
  • Manual annotation difficulty: High cost, limited scale, and modal bias (annotators can see images but models cannot)

4. Research Motivation

Leverage LLM's world knowledge and judgment capabilities as a proxy for human judgment, through multi-task learning and knowledge distillation framework, combining CTR, search relevance, and LLM signals to train efficient dual-encoder retrieval models.

Core Contributions

  1. Proposes Teacher-Assistant-Student distillation framework: Three-level architecture of LLM teacher → cross-encoder assistant → dual-encoder student
  2. Multi-signal fusion training strategy: Integrates CTR, search relevance (SR), and LLM labels in a multi-task learning paradigm
  3. Systematic loss function comparative study: Evaluates 8 knowledge distillation loss functions, finding Pearson correlation loss optimal
  4. Production environment evaluation protocol: Proposes offline evaluation method simulating real ad auction scenarios
  5. Significant business impact: A/B testing shows 51.26% GMB increase, 38.69% ROAS increase, 11.75% keyphrase adoption rate increase

Method Details

Task Definition

Input: Item title + category and buyer query keyphrase
Output: Relevance judgment (binary classification or continuous similarity score)
Objective: Retrieve Top-K most relevant keywords for each product for ad bidding
Constraints: Requires low latency (suitable for production), high accuracy (aligned with multiple stakeholders)

Model Architecture

1. Dataset Construction (Three Label Sources)

CTR Labels (10,702,747 samples):

  • Calculate click-to-impression ratio over past 30 days
  • CTR > 0.05 marked as positive samples
  • Positive samples reliable, negative samples unreliable (used only for MNR loss)

Search Relevance (SR) Labels (18,721,682 samples):

  • Collected from auction process over 3 months using SR model scores
  • Exceeding business threshold marked as positive samples
  • No middleman bias and sample selection bias

LLM Labels (50,078,315 training set, 3,524,414 test set):

  • Generated using Mixtral 8X7B Instruct-v0.1
  • 90% consistency with click data
  • Prompt design:
Given an item with title: "{title}", 
determine whether the keyphrase: "{keyphrase}", 
is relevant for cpc targeting or not by giving 
ONLY yes or no answer

2. Cross-Encoder (Assistant)

Base Model: microBERT (distilled version of eBERT)

  • 4.3x smaller than eBERT, 5.5x faster
  • Pre-trained on eBay product data

Input Format:

query [SEP] category name [SEP] item title

Training:

  • Fine-tuned on 50M LLM labels using cross-entropy loss
  • Test set F1=96% (7.5M samples)

Role: Serves as intermediate assistant model, providing soft labels for distillation

3. Dual-Encoder (Student)

Base Model: microBERT dual-tower architecture

Input Processing:

  • Item tower: item title [SEP] category name
  • Keyphrase tower: buyer query
  • Compute cosine similarity after independent encoding

Output Dimension Optimization:

  • Use Matryoshka Loss to truncate embeddings to 64 dimensions (reducing ANN latency)

4. Multi-Task Training Paradigm

Core Idea: Each batch contains samples from only one dataset, sampled proportionally by dataset size

Loss Function Combination:

Data SourceLoss FunctionRationale
CTR LabelsMNR LossOnly reliable positive samples; negatives generated via IRNS
SR LabelsContrastive LossClear positive and negative samples
LLM LabelsContrastive LossClear positive and negative samples
Cross-Encoder DistillationPearson Correlation LossAlign ranking order

Technical Innovations

1. Necessity of Teacher-Assistant Architecture

  • Direct distillation from LLM to dual-encoder performs poorly (F1=0.66 vs 0.88)
  • Cross-encoder as intermediate bridge:
    • Stronger learning capacity than dual-encoder (can jointly encode)
    • More efficient than LLM (can generate large-scale soft labels)
    • Enables progressive knowledge transfer

2. Rationality of Multi-Signal Fusion

LLM+CTR+KD model achieves optimal performance:
- Median keyphrase count: 12
- LLM pass rate: 71%
- Search pass rate: >99%

Design Principles:

  • CTR provides real interaction signals (reliable positive samples)
  • LLM provides unbiased judgment (covers unexposed samples)
  • SR ensures search system acceptance
  • Cross-encoder provides fine-grained ranking signals

3. Superiority of Pearson Loss

Experimental comparison (Table 1):

KD LossF1PrecisionRecallρ (Pearson Correlation)
MSE0.810.770.860.78
CoSENT0.870.860.880.82
Pearson0.880.870.880.87
MSEmar0.860.840.880.80
KL-Div0.850.830.880.66

Reason Analysis:

  • MSE is pointwise loss, cannot capture ranking relationships
  • CoSENT is pairwise ranking loss, has calibration ability
  • Pearson is batch ranking loss, optimizes overall linear correlation
  • Achieves highest Pearson correlation with cross-encoder (0.87)

Experimental Setup

Datasets

  • Platform Scale: 2.3 billion products
  • Training Set:
    • CTR: 10.7M
    • SR: 18.7M
    • LLM: 50M (training) + 3.5M (test)
  • Evaluation Set: 10,000 samples (per model)
  • A/B Testing: US market for 12 days

Evaluation Metrics

Offline Metrics:

  • F1, Precision, Recall: Classification performance
  • ρ (Pearson Correlation): Alignment with cross-encoder
  • KP (Keyphrase Count): Median keyphrase count after relevance filtering
  • PR (Pass Rate): LLM/SR pass rate at different ranking positions

Online Metrics:

  • GMB (Gross Merchandise Bought): Sales volume
  • ROAS (Return on Ad Spend): Advertising ROI
  • Adoption Rate: Number of keywords actually used by sellers

Baseline Methods

  1. CTR-only: Baseline trained only on CTR
  2. LLM: Only LLM labels + Contrastive Loss
  3. LLM+KD: LLM labels + cross-encoder distillation
  4. LLM+SR+KD: LLM + SR labels + distillation
  5. LLM+CTR+KD: Optimal combination
  6. LLM+SR+CTR+KD: All signals combination

Implementation Details

  • Base Model: microBERT (selection rationale in Table 3)
  • Training Framework: PyTorch + Transformers
  • Batch Sampling: Proportional to dataset size
  • Production Deployment:
    • Batch inference: PySpark (1500 executors)
    • NRT inference: Triton + ONNX (V100 GPU)
    • Daily increment latency: 35 minutes (20M products)
    • ANN retrieval: Additional 2.5 hours

Experimental Results

Main Results

Table 2: Label Ablation Experiment

ModelKPPRPass@5Pass@10Pass@15Pass@20
LLM+CTR+KD12.07168605552
LLM+SR+CTR+KD11.07067595451
LLM+SR+KD12.05147424139
LLM+KD11.04936353332
LLM11.06145413835
CTR76051423734

Key Findings:

  1. LLM+CTR+KD Optimal: Achieves best balance between efficiency (KP=12) and quality (PR=71%)
  2. CTR-only Inefficient: Only 7 keywords, limiting coverage
  3. Distillation Brings Significant Improvement: LLM → LLM+KD (PR: 61% → 49%, but Pass@5 improves)
  4. SR Signal's Role: Improves search pass rate to >99%

Ablation Experiments

1. Knowledge Distillation Loss Comparison (Table 1)

  • Pearson Loss Optimal: F1=0.88, ρ=0.87
  • CoSENT Second Best: F1=0.87, ρ=0.82
  • MSE Fails: Validates CUPID paper findings
  • Direct Distillation (LLM→BE) Poor: Contrastive F1=0.83, Softmax F1=0.66

2. Base Model Selection (Table 3)

Base ModelRecallPrecisionF1
eBERT0.920.810.86
microBERT0.920.780.85
ModernBERT0.910.760.83

Rationale for microBERT Selection:

  • Performance close to eBERT (F1 difference only 0.01)
  • 30% faster inference speed
  • Pre-trained on platform data (ModernBERT not pre-trained)

3. Progressive Multi-Task Framework Construction

CTR (F1=0.66) 
→ CTR+LLM (F1=0.83) 
→ LLM+CTR+KD (F1=0.88)

Each component brings gains

A/B Testing Results (Online Validation)

Test Setup: US market, 12 days, replacing CTR-only EBR model

Business Metric Improvements:

  • GMB +51.26% (p=0.01) - Significant sales growth
  • ROAS +38.69% (p=0.02) - Significant ROI improvement
  • Adoption Rate +11.75% (p=0.03) - Sellers more willing to use recommendations

Significance: Demonstrates offline metric improvements translate to real business value

Case Analysis

Positive Case (LLM and model agreement):

  • Product: "Genuine 15V 4A Power AC Adapter Laptop Charger For Surface Pro 3 4 5 6"
  • Keyphrase: "microsoft surface charger"
  • Judgment: Relevant ✓

Negative Case (Fine-tuned LLM failure):

  • Product: "iPhone 11 64GB 128G Unlocked..."
  • Keyphrase: "yellow iphone" (image shows yellow color)
  • General LLM: Irrelevant (text-only basis)
  • Fine-tuned LLM: Relevant (affected by modal bias)

Experimental Findings

  1. General LLM Superior to Fine-tuned LLM:
    • General LLM: Reduce 68% keywords, sales +10%
    • Fine-tuned LLM: Retain 75% keywords, sales -20%
    • Reason: Human annotations have modal bias
  2. Teacher-Assistant Necessity:
    • Cross-encoder has better calibration
    • Can handle large-scale data for soft label generation
  3. Multi-Signal Complementarity:
    • CTR: Reliable positive samples
    • LLM: Long-tail coverage
    • SR: Search system alignment
    • All three indispensable

1. Embedding-Based Retrieval (EBR)

  • Dual-Encoder vs Cross-Encoder:
    • Dual-encoder: Independent encoding, supports ANN, low latency
    • Cross-encoder: Joint encoding, better performance, high latency
  • Paper Contribution: Combines both advantages through distillation

2. Click Bias Problem

  • MNAR Bias: Chen et al. (2023)
  • Middleman Bias: Dey et al. (2025b) - Prior work by paper authors
  • Paper Solution: Supplements click data with LLM and SR signals

3. Knowledge Distillation Methods

  • TwinBERT (Lu et al., 2020): Cross-encoder to dual-tower BERT
  • ERNIE-search (Lu et al., 2022): Teacher-Assistant architecture
  • PROD (Lin et al., 2023): Progressive distillation
  • D2LLM (Liao et al., 2024): Pearson loss for LLM distillation
  • Paper Contribution: Combines multi-task learning with Teacher-Assistant architecture

4. LLM as Judge

  • GPT-4 Evaluation: Zheng et al. (2023) - MT-Bench
  • Search Application: Wang et al. (2024) - Pinterest
  • Paper Contribution:
    • Large-scale application (50M labels)
    • Systematic evaluation of general LLM vs fine-tuned LLM
    • Discovery of modal bias problem

Conclusions and Discussion

Main Conclusions

  1. LLM Signals Effectively Mitigate Click Bias: In ad keyphrase recommendation scenarios, LLM-generated labels significantly outperform CTR-only approaches
  2. Teacher-Assistant Architecture Superior to Direct Distillation: Cross-encoder as intermediate bridge is critical
  3. Pearson Loss Most Suitable for Ranking Distillation: Batch ranking loss outperforms pointwise and pairwise losses
  4. Multi-Signal Fusion Produces Synergistic Effects: CTR+LLM+KD combination achieves best business results
  5. General LLM Superior to Fine-tuned LLM: When human-annotated data has modal bias

Limitations

  1. Domain Specificity:
    • Research limited to e-commerce advertising scenarios
    • Method transferability requires validation
  2. Human Annotation Quality Issues:
    • Annotators can see images but models cannot (modal bias)
    • Label granularity too fine (excellent/good/fair/bad)
    • Sample size insufficient to cover 2.3 billion products
  3. Simple Negative Sample Mining Strategy:
    • CTR data only uses IRNS (In-batch Random Negative Sampling)
    • Advanced methods like ANCE, N-Game not explored
    • Left for future research
  4. Limited LLM Choices:
    • Uses Mixtral 8X7B (open-source, medium-scale)
    • Larger models (GPT-4) limited by API constraints
    • No LLM fine-tuning (due to human data quality issues)
  5. Evaluation Limitations:
    • Offline evaluation only on LLM label test set
    • A/B testing only in US market
    • Long-term effects not evaluated

Future Directions

  1. Better Human Judgment Data Collection:
    • Unified input modality (text-only or multimodal)
    • Simplified labels (binary classification)
    • Expanded sample scale
  2. Advanced Negative Sample Mining:
    • Explore ANCE, N-Game methods
    • Balance computational cost and performance
  3. Multimodal Extension:
    • Incorporate image information into models
    • Address modal bias problem
  4. Fine-tuned LLM Exploration:
    • Fine-tune on high-quality data
    • Potentially further improve performance
  5. Cross-Domain Transfer:
    • Validate method on other e-commerce platforms
    • Extend to non-advertising scenarios

In-Depth Evaluation

Strengths

1. Method Innovation ⭐⭐⭐⭐⭐

  • Three-Level Teacher-Assistant-Student Architecture: Innovatively combines LLM, cross-encoder, and dual-encoder
  • Multi-Task Hybrid Training: Skillfully fuses three heterogeneous signal sources
  • Systematic Loss Function Study: Compares 8 KD losses, provides clear guidance

2. Experimental Sufficiency ⭐⭐⭐⭐⭐

  • Large-Scale Real Data: 50M LLM labels, 2.3 billion products
  • Comprehensive Ablation Studies: Labels, losses, base models, architectures
  • Online Validation: A/B testing proves business value
  • Detailed Appendix: LLM evaluation, loss function derivations, system architecture

3. Practical Value ⭐⭐⭐⭐⭐

  • Significant Business Improvements: GMB +51%, ROAS +39%
  • Production Deployment Details: Complete system architecture and latency analysis
  • Strong Reproducibility: Open-source models (Mixtral), clear method description

4. Insight Depth ⭐⭐⭐⭐

  • Modal Bias Discovery: Reveals hidden problems in human annotations
  • General LLM Advantages: Challenges conventional "fine-tuning is always better" wisdom
  • Middleman Bias: Proposes new bias type and provides solution

5. Writing Quality ⭐⭐⭐⭐

  • Clear structure, rigorous logic
  • Rich figures (auction mechanism diagram, architecture diagram, production system diagram)
  • Complete mathematical formulas (detailed derivations in Appendix 8.3)

Weaknesses

1. Method Limitations

  • Computational Cost Not Quantified: GPU time/cost for generating 50M LLM labels not reported
  • Hyperparameter Sensitivity: No analysis of learning rate, batch size, temperature parameter impacts
  • Limited LLM Choice: Mixtral 8X7B not optimal, but constrained by open-source and cost

2. Experimental Setup Flaws

  • Single Test Set Evaluation: Offline experiments only on LLM label test set, not validated on SR/CTR test sets
  • Short A/B Test Duration: 12 days may be insufficient to observe long-term effects (e.g., seller fatigue)
  • Geographic Limitation: Only US market, effects in other countries unknown

3. Insufficient Analysis

  • Few Failure Case Analyses: Only 1 modal bias example provided
  • Ranking Quality Not Evaluated: No NDCG, MRR and other ranking metrics
  • Diversity Not Quantified: While uniqueness and diversity mentioned, no specific metrics

4. Reproducibility Issues

  • Platform Anonymity: Cannot access eBay-specific eBERT/microBERT
  • Data Not Public: Commercial data cannot be shared
  • Complete Code Not Open-Sourced: Only method description provided

5. Missing Theoretical Analysis

  • Why Pearson Optimal: Lacks theoretical explanation, only experimental validation
  • Teacher-Assistant Gain Sources: Contribution of each level not quantified
  • Multi-Task Learning Theory: No analysis of task interference/synergy

Impact Assessment

Contribution to Field ⭐⭐⭐⭐⭐

  1. E-Commerce Ad System Bias: Systematically clarifies middleman bias, provides solution paradigm
  2. Knowledge Distillation: Validates Teacher-Assistant architecture effectiveness in retrieval tasks
  3. LLM Application: Successful large-scale LLM label generation case study (50M)
  4. Industrial Practice: Complete production system design reference

Academic Impact

  • High Citation Potential: Solves practical problems, method transferable
  • Future Research Directions: Multimodal LLMs, better human annotation protocols
  • Benchmark Role: Pearson loss may become distillation standard

Industrial Impact

  • Direct Business Value: GMB +51% significant for eBay
  • Strong Replicability: Other e-commerce platforms can learn (Amazon, Alibaba)
  • Significant Cost-Benefit: LLM labels replace large-scale manual annotation

Applicable Scenarios

Highly Applicable ✅

  1. E-Commerce Ad Recommendations: Keywords, product recommendations
  2. Search Relevance: Query-document matching
  3. Information Retrieval: Any scenario requiring multi-stakeholder judgment alignment
  4. Bias Mitigation: Recommendation systems with click/exposure bias

Moderately Applicable ⚠️

  1. Other Recommendation Scenarios: Requires signal source adjustment (e.g., video recommendations)
  2. Cross-Lingual Retrieval: Requires multilingual LLM and pre-trained models
  3. Real-Time Systems: Requires NRT inference latency optimization

Not Applicable ❌

  1. Small-Scale Data: Method requires large data volumes (million-level)
  2. Unbiased Scenarios: If click data reliable, method gains limited
  3. Pure Exploration Tasks: Requires diversity rather than relevance

Reproduction Recommendations

To reproduce this work:

  1. Replace LLM: Use Llama 3.1 70B or Qwen 2.5 72B
  2. Replace Base Model: Use public sentence-transformers models
  3. Simplified Version: First validate LLM+CTR+Pearson Loss (no SR data needed)
  4. Evaluation Protocol: Reference Appendix 8.2 offline evaluation procedure
  5. Start Scale: Begin with million-level data, gradually scale up

Selected References

  1. D2LLM (Liao et al., 2024): First proposes Pearson loss for LLM→dual-encoder distillation
  2. CUPID (Bhattacharya et al., 2023): Proves MSE loss unsuitable for cross→dual-encoder distillation
  3. ERNIE-search (Lu et al., 2022): Early exploration of Teacher-Assistant architecture
  4. Middleman Bias (Dey et al., 2025b): Paper authors' proposed middleman bias theory

Bias and Recommendation

  1. Chen et al. (2023): Recommendation system bias survey
  2. Joachims et al. (2017): Unbiased learning from biased feedback

LLM Evaluation

  1. Zheng et al. (2023): MT-Bench and LLM-as-a-judge
  2. Gu et al. (2025): LLM as judge survey

Overall Rating: ⭐⭐⭐⭐⭐ (5/5)

This is an excellent industrial application paper that validates the effectiveness of LLM-assisted training in real large-scale scenarios, providing a complete solution from theory to practice. Despite some limitations (such as insufficient theoretical analysis, single-market testing), its practical value, method innovation, and experimental sufficiency all reach top-tier levels. Particularly noteworthy is the authors' in-depth analysis of general LLM vs fine-tuned LLM, revealing the modal bias problem in human annotations and providing important warnings for the field.