E-commerce sellers are advised to bid on keyphrases to boost their advertising campaigns. These keyphrases must be relevant to prevent irrelevant items from cluttering search systems and to maintain positive seller perception. It is vital that keyphrase suggestions align with seller, search and buyer judgments. Given the challenges in collecting negative feedback in these systems, LLMs have been used as a scalable proxy to human judgments. This paper presents an empirical study on a major ecommerce platform of a distillation framework involving an LLM teacher, a cross-encoder assistant and a bi-encoder Embedding Based Retrieval (EBR) student model, aimed at mitigating click-induced biases in keyphrase recommendations.
Paper ID : 2508.03628Title : LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase RecommendationsAuthors : Soumik Dey, Benjamin Braun, Naveen Ravipati, Hansi Wu, Binbin Li (eBay Inc)Categories : cs.IR (Information Retrieval), cs.AI, cs.LGPublication Date : arXiv v5, November 20, 2025Paper Link : https://arxiv.org/abs/2508.03628v5 E-commerce sellers need to bid on keywords to improve ad performance, and these keywords must be relevant to prevent unrelated products from polluting the search system and maintain seller satisfaction. Due to the difficulty of collecting negative feedback, this paper proposes using LLMs as scalable proxies for human judgment. The research implements a knowledge distillation framework on a large-scale e-commerce platform: LLM teacher model → cross-encoder assistant → dual-encoder EBR student model, aimed at mitigating click bias issues in keyphrase recommendations.
In e-commerce advertising systems, recommending relevant keywords (buyer query terms) to sellers for bidding on ads. Key challenges include:
Unreliability of click data : High clicks/sales indicate relevance, but lack of clicks does not indicate irrelevanceMNAR bias (Missing Not At Random): Unpopular products rank low, receiving fewer impressions and clicksMiddleman bias : Training data only contains keywords filtered through search relevance, leading to sample selection biasKeyword relevance directly impacts seller strategy and search system quality Irrelevant recommendations reduce seller satisfaction, waste resources, and harm ad performance Need to simultaneously satisfy judgment criteria of sellers, advertising systems, and search systems CTR-only training : Easily replicates popularity and exposure bias in training dataUnreliable negative samples : Negative samples in click logs cannot truly reflect irrelevanceManual annotation difficulty : High cost, limited scale, and modal bias (annotators can see images but models cannot)Leverage LLM's world knowledge and judgment capabilities as a proxy for human judgment, through multi-task learning and knowledge distillation framework, combining CTR, search relevance, and LLM signals to train efficient dual-encoder retrieval models.
Proposes Teacher-Assistant-Student distillation framework : Three-level architecture of LLM teacher → cross-encoder assistant → dual-encoder studentMulti-signal fusion training strategy : Integrates CTR, search relevance (SR), and LLM labels in a multi-task learning paradigmSystematic loss function comparative study : Evaluates 8 knowledge distillation loss functions, finding Pearson correlation loss optimalProduction environment evaluation protocol : Proposes offline evaluation method simulating real ad auction scenariosSignificant business impact : A/B testing shows 51.26% GMB increase, 38.69% ROAS increase, 11.75% keyphrase adoption rate increaseInput : Item title + category and buyer query keyphraseOutput : Relevance judgment (binary classification or continuous similarity score)Objective : Retrieve Top-K most relevant keywords for each product for ad biddingConstraints : Requires low latency (suitable for production), high accuracy (aligned with multiple stakeholders)
CTR Labels (10,702,747 samples):
Calculate click-to-impression ratio over past 30 days CTR > 0.05 marked as positive samples Positive samples reliable, negative samples unreliable (used only for MNR loss) Search Relevance (SR) Labels (18,721,682 samples):
Collected from auction process over 3 months using SR model scores Exceeding business threshold marked as positive samples No middleman bias and sample selection bias LLM Labels (50,078,315 training set, 3,524,414 test set):
Generated using Mixtral 8X7B Instruct-v0.1 90% consistency with click data Prompt design: Given an item with title: "{title}",
determine whether the keyphrase: "{keyphrase}",
is relevant for cpc targeting or not by giving
ONLY yes or no answer
Base Model : microBERT (distilled version of eBERT)
4.3x smaller than eBERT, 5.5x faster Pre-trained on eBay product data Input Format :
query [SEP] category name [SEP] item title
Training :
Fine-tuned on 50M LLM labels using cross-entropy loss Test set F1=96% (7.5M samples) Role : Serves as intermediate assistant model, providing soft labels for distillation
Base Model : microBERT dual-tower architecture
Input Processing :
Item tower: item title [SEP] category name Keyphrase tower: buyer query Compute cosine similarity after independent encoding Output Dimension Optimization :
Use Matryoshka Loss to truncate embeddings to 64 dimensions (reducing ANN latency) Core Idea : Each batch contains samples from only one dataset, sampled proportionally by dataset size
Loss Function Combination :
Data Source Loss Function Rationale CTR Labels MNR Loss Only reliable positive samples; negatives generated via IRNS SR Labels Contrastive Loss Clear positive and negative samples LLM Labels Contrastive Loss Clear positive and negative samples Cross-Encoder Distillation Pearson Correlation Loss Align ranking order
Direct distillation from LLM to dual-encoder performs poorly (F1=0.66 vs 0.88) Cross-encoder as intermediate bridge:
Stronger learning capacity than dual-encoder (can jointly encode) More efficient than LLM (can generate large-scale soft labels) Enables progressive knowledge transfer LLM+CTR+KD model achieves optimal performance:
- Median keyphrase count: 12
- LLM pass rate: 71%
- Search pass rate: >99%
Design Principles :
CTR provides real interaction signals (reliable positive samples) LLM provides unbiased judgment (covers unexposed samples) SR ensures search system acceptance Cross-encoder provides fine-grained ranking signals Experimental comparison (Table 1):
KD Loss F1 Precision Recall ρ (Pearson Correlation) MSE 0.81 0.77 0.86 0.78 CoSENT 0.87 0.86 0.88 0.82 Pearson 0.88 0.87 0.88 0.87 MSEmar 0.86 0.84 0.88 0.80 KL-Div 0.85 0.83 0.88 0.66
Reason Analysis :
MSE is pointwise loss, cannot capture ranking relationships CoSENT is pairwise ranking loss, has calibration ability Pearson is batch ranking loss, optimizes overall linear correlation Achieves highest Pearson correlation with cross-encoder (0.87) Platform Scale : 2.3 billion productsTraining Set :
CTR: 10.7M SR: 18.7M LLM: 50M (training) + 3.5M (test) Evaluation Set : 10,000 samples (per model)A/B Testing : US market for 12 daysOffline Metrics :
F1, Precision, Recall : Classification performanceρ (Pearson Correlation) : Alignment with cross-encoderKP (Keyphrase Count) : Median keyphrase count after relevance filteringPR (Pass Rate) : LLM/SR pass rate at different ranking positionsOnline Metrics :
GMB (Gross Merchandise Bought) : Sales volumeROAS (Return on Ad Spend) : Advertising ROIAdoption Rate : Number of keywords actually used by sellersCTR-only : Baseline trained only on CTRLLM : Only LLM labels + Contrastive LossLLM+KD : LLM labels + cross-encoder distillationLLM+SR+KD : LLM + SR labels + distillationLLM+CTR+KD : Optimal combinationLLM+SR+CTR+KD : All signals combinationBase Model : microBERT (selection rationale in Table 3)Training Framework : PyTorch + TransformersBatch Sampling : Proportional to dataset sizeProduction Deployment :
Batch inference: PySpark (1500 executors) NRT inference: Triton + ONNX (V100 GPU) Daily increment latency: 35 minutes (20M products) ANN retrieval: Additional 2.5 hours Table 2: Label Ablation Experiment
Model KP PR Pass@5 Pass@10 Pass@15 Pass@20 LLM+CTR+KD 12.0 71 68 60 55 52 LLM+SR+CTR+KD 11.0 70 67 59 54 51 LLM+SR+KD 12.0 51 47 42 41 39 LLM+KD 11.0 49 36 35 33 32 LLM 11.0 61 45 41 38 35 CTR 7 60 51 42 37 34
Key Findings :
LLM+CTR+KD Optimal : Achieves best balance between efficiency (KP=12) and quality (PR=71%)CTR-only Inefficient : Only 7 keywords, limiting coverageDistillation Brings Significant Improvement : LLM → LLM+KD (PR: 61% → 49%, but Pass@5 improves)SR Signal's Role : Improves search pass rate to >99%Pearson Loss Optimal : F1=0.88, ρ=0.87CoSENT Second Best : F1=0.87, ρ=0.82MSE Fails : Validates CUPID paper findingsDirect Distillation (LLM→BE) Poor : Contrastive F1=0.83, Softmax F1=0.66Base Model Recall Precision F1 eBERT 0.92 0.81 0.86 microBERT 0.92 0.78 0.85 ModernBERT 0.91 0.76 0.83
Rationale for microBERT Selection :
Performance close to eBERT (F1 difference only 0.01) 30% faster inference speed Pre-trained on platform data (ModernBERT not pre-trained) CTR (F1=0.66)
→ CTR+LLM (F1=0.83)
→ LLM+CTR+KD (F1=0.88)
Each component brings gains
Test Setup : US market, 12 days, replacing CTR-only EBR model
Business Metric Improvements :
GMB +51.26% (p=0.01) - Significant sales growthROAS +38.69% (p=0.02) - Significant ROI improvementAdoption Rate +11.75% (p=0.03) - Sellers more willing to use recommendationsSignificance : Demonstrates offline metric improvements translate to real business value
Positive Case (LLM and model agreement):
Product: "Genuine 15V 4A Power AC Adapter Laptop Charger For Surface Pro 3 4 5 6" Keyphrase: "microsoft surface charger" Judgment: Relevant ✓ Negative Case (Fine-tuned LLM failure):
Product: "iPhone 11 64GB 128G Unlocked..." Keyphrase: "yellow iphone" (image shows yellow color) General LLM: Irrelevant (text-only basis) Fine-tuned LLM: Relevant (affected by modal bias) General LLM Superior to Fine-tuned LLM :General LLM: Reduce 68% keywords, sales +10% Fine-tuned LLM: Retain 75% keywords, sales -20% Reason: Human annotations have modal bias Teacher-Assistant Necessity :Cross-encoder has better calibration Can handle large-scale data for soft label generation Multi-Signal Complementarity :CTR: Reliable positive samples LLM: Long-tail coverage SR: Search system alignment All three indispensable Dual-Encoder vs Cross-Encoder :
Dual-encoder: Independent encoding, supports ANN, low latency Cross-encoder: Joint encoding, better performance, high latency Paper Contribution : Combines both advantages through distillationMNAR Bias : Chen et al. (2023)Middleman Bias : Dey et al. (2025b) - Prior work by paper authorsPaper Solution : Supplements click data with LLM and SR signalsTwinBERT (Lu et al., 2020): Cross-encoder to dual-tower BERTERNIE-search (Lu et al., 2022): Teacher-Assistant architecturePROD (Lin et al., 2023): Progressive distillationD2LLM (Liao et al., 2024): Pearson loss for LLM distillationPaper Contribution : Combines multi-task learning with Teacher-Assistant architectureGPT-4 Evaluation : Zheng et al. (2023) - MT-BenchSearch Application : Wang et al. (2024) - PinterestPaper Contribution :
Large-scale application (50M labels) Systematic evaluation of general LLM vs fine-tuned LLM Discovery of modal bias problem LLM Signals Effectively Mitigate Click Bias : In ad keyphrase recommendation scenarios, LLM-generated labels significantly outperform CTR-only approachesTeacher-Assistant Architecture Superior to Direct Distillation : Cross-encoder as intermediate bridge is criticalPearson Loss Most Suitable for Ranking Distillation : Batch ranking loss outperforms pointwise and pairwise lossesMulti-Signal Fusion Produces Synergistic Effects : CTR+LLM+KD combination achieves best business resultsGeneral LLM Superior to Fine-tuned LLM : When human-annotated data has modal biasDomain Specificity :Research limited to e-commerce advertising scenarios Method transferability requires validation Human Annotation Quality Issues :Annotators can see images but models cannot (modal bias) Label granularity too fine (excellent/good/fair/bad) Sample size insufficient to cover 2.3 billion products Simple Negative Sample Mining Strategy :CTR data only uses IRNS (In-batch Random Negative Sampling) Advanced methods like ANCE, N-Game not explored Left for future research Limited LLM Choices :Uses Mixtral 8X7B (open-source, medium-scale) Larger models (GPT-4) limited by API constraints No LLM fine-tuning (due to human data quality issues) Evaluation Limitations :Offline evaluation only on LLM label test set A/B testing only in US market Long-term effects not evaluated Better Human Judgment Data Collection :Unified input modality (text-only or multimodal) Simplified labels (binary classification) Expanded sample scale Advanced Negative Sample Mining :Explore ANCE, N-Game methods Balance computational cost and performance Multimodal Extension :Incorporate image information into models Address modal bias problem Fine-tuned LLM Exploration :Fine-tune on high-quality data Potentially further improve performance Cross-Domain Transfer :Validate method on other e-commerce platforms Extend to non-advertising scenarios Three-Level Teacher-Assistant-Student Architecture : Innovatively combines LLM, cross-encoder, and dual-encoderMulti-Task Hybrid Training : Skillfully fuses three heterogeneous signal sourcesSystematic Loss Function Study : Compares 8 KD losses, provides clear guidanceLarge-Scale Real Data : 50M LLM labels, 2.3 billion productsComprehensive Ablation Studies : Labels, losses, base models, architecturesOnline Validation : A/B testing proves business valueDetailed Appendix : LLM evaluation, loss function derivations, system architectureSignificant Business Improvements : GMB +51%, ROAS +39%Production Deployment Details : Complete system architecture and latency analysisStrong Reproducibility : Open-source models (Mixtral), clear method descriptionModal Bias Discovery : Reveals hidden problems in human annotationsGeneral LLM Advantages : Challenges conventional "fine-tuning is always better" wisdomMiddleman Bias : Proposes new bias type and provides solutionClear structure, rigorous logic Rich figures (auction mechanism diagram, architecture diagram, production system diagram) Complete mathematical formulas (detailed derivations in Appendix 8.3) Computational Cost Not Quantified : GPU time/cost for generating 50M LLM labels not reportedHyperparameter Sensitivity : No analysis of learning rate, batch size, temperature parameter impactsLimited LLM Choice : Mixtral 8X7B not optimal, but constrained by open-source and costSingle Test Set Evaluation : Offline experiments only on LLM label test set, not validated on SR/CTR test setsShort A/B Test Duration : 12 days may be insufficient to observe long-term effects (e.g., seller fatigue)Geographic Limitation : Only US market, effects in other countries unknownFew Failure Case Analyses : Only 1 modal bias example providedRanking Quality Not Evaluated : No NDCG, MRR and other ranking metricsDiversity Not Quantified : While uniqueness and diversity mentioned, no specific metricsPlatform Anonymity : Cannot access eBay-specific eBERT/microBERTData Not Public : Commercial data cannot be sharedComplete Code Not Open-Sourced : Only method description providedWhy Pearson Optimal : Lacks theoretical explanation, only experimental validationTeacher-Assistant Gain Sources : Contribution of each level not quantifiedMulti-Task Learning Theory : No analysis of task interference/synergyE-Commerce Ad System Bias : Systematically clarifies middleman bias, provides solution paradigmKnowledge Distillation : Validates Teacher-Assistant architecture effectiveness in retrieval tasksLLM Application : Successful large-scale LLM label generation case study (50M)Industrial Practice : Complete production system design referenceHigh Citation Potential : Solves practical problems, method transferableFuture Research Directions : Multimodal LLMs, better human annotation protocolsBenchmark Role : Pearson loss may become distillation standardDirect Business Value : GMB +51% significant for eBayStrong Replicability : Other e-commerce platforms can learn (Amazon, Alibaba)Significant Cost-Benefit : LLM labels replace large-scale manual annotationE-Commerce Ad Recommendations : Keywords, product recommendationsSearch Relevance : Query-document matchingInformation Retrieval : Any scenario requiring multi-stakeholder judgment alignmentBias Mitigation : Recommendation systems with click/exposure biasOther Recommendation Scenarios : Requires signal source adjustment (e.g., video recommendations)Cross-Lingual Retrieval : Requires multilingual LLM and pre-trained modelsReal-Time Systems : Requires NRT inference latency optimizationSmall-Scale Data : Method requires large data volumes (million-level)Unbiased Scenarios : If click data reliable, method gains limitedPure Exploration Tasks : Requires diversity rather than relevanceTo reproduce this work :
Replace LLM : Use Llama 3.1 70B or Qwen 2.5 72BReplace Base Model : Use public sentence-transformers modelsSimplified Version : First validate LLM+CTR+Pearson Loss (no SR data needed)Evaluation Protocol : Reference Appendix 8.2 offline evaluation procedureStart Scale : Begin with million-level data, gradually scale upD2LLM (Liao et al., 2024): First proposes Pearson loss for LLM→dual-encoder distillationCUPID (Bhattacharya et al., 2023): Proves MSE loss unsuitable for cross→dual-encoder distillationERNIE-search (Lu et al., 2022): Early exploration of Teacher-Assistant architectureMiddleman Bias (Dey et al., 2025b): Paper authors' proposed middleman bias theoryChen et al. (2023) : Recommendation system bias surveyJoachims et al. (2017) : Unbiased learning from biased feedbackZheng et al. (2023) : MT-Bench and LLM-as-a-judgeGu et al. (2025) : LLM as judge surveyOverall Rating : ⭐⭐⭐⭐⭐ (5/5)
This is an excellent industrial application paper that validates the effectiveness of LLM-assisted training in real large-scale scenarios, providing a complete solution from theory to practice. Despite some limitations (such as insufficient theoretical analysis, single-market testing), its practical value, method innovation, and experimental sufficiency all reach top-tier levels. Particularly noteworthy is the authors' in-depth analysis of general LLM vs fine-tuned LLM, revealing the modal bias problem in human annotations and providing important warnings for the field.