2025-11-10T03:00:12.487488

Introducing Semantic Capability in LinkedIn's Content Search Engine

Yang, Zheng, Mohan et al.
In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn's new content search engine with semantic capability, and its impact on metrics.
academic

Introducing Semantic Capability in LinkedIn's Content Search Engine

Basic Information

  • Paper ID: 2412.20366
  • Title: Introducing Semantic Capability in LinkedIn's Content Search Engine
  • Authors: Xin Yang, Chujie Zheng, Madhumitha Mohan, Sonali Bhadra, Pansul Bhatt, Lingyu (Claire) Zhang, Rupesh Gupta
  • Institution: LinkedIn Corporation, Mountain View, CA, USA
  • Classification: cs.IR (Information Retrieval)
  • Publication Date: December 2024
  • Paper Link: https://arxiv.org/abs/2412.20366

Abstract

As user search behavior evolves, traditional keyword-based search engines can no longer satisfy the increasingly complex demands of natural language queries. This paper introduces LinkedIn's newly designed content search engine with semantic understanding capabilities and demonstrates its significant improvements in core metrics.

Research Background and Motivation

Problem Definition

  1. Increasing Complexity of Search Queries: Users have transitioned from short keyword queries to complex natural language queries, such as "how to ask for a raise?" and "dropout in AI"
  2. Limitations of Traditional Search: Keyword matching-based search engines face two primary challenges when handling complex queries:
    • Returning empty results when not all query keywords exist in any post
    • Failing to correctly answer questions even when posts containing all keywords exist, due to lack of conceptual understanding

Research Motivation

LinkedIn's analysis revealed that posts capable of correctly answering queries actually exist in the search index, but may not contain all keywords from the query. This motivated the team to develop a content search engine with semantic matching capabilities to better understand query intent and return relevant content.

Core Contributions

  1. Designed a Two-layer Architecture for Semantic Search Engine: Incorporating a retrieval layer and multi-stage ranking layer, effectively combining keyword matching and semantic understanding
  2. Implemented a Hybrid Retrieval Strategy: Simultaneously utilizing term-based retriever (TBR) and embedding-based retriever (EBR)
  3. Established a Multi-objective Optimization Framework: Simultaneously optimizing on-topic rate and long-dwells metrics
  4. Achieved Significant Performance Improvements: Both on-topic rate and long-dwell metrics improved by over 10%

Methodology Details

Task Definition

Return high-quality, engaging post content for each search query, evaluated through two quantitative metrics:

  • On-topic Rate: Assessing the quality and relevance of returned posts using GPT
  • Long-dwells: Measuring user dwell time on posts

Model Architecture

1. Retrieval Layer

The retrieval layer contains two parallel retrievers:

Term-Based Retriever (TBR):

  • Maintains inverted indices mapping keywords to posts containing those terms
  • Finds posts containing all query keywords through intersection operations
  • Suitable for navigational queries, such as finding specific posts

Embedding-Based Retriever (EBR):

  • Adopts a two-tower model architecture
  • Query embedding tower: Processes query text and user features to generate query embeddings
  • Post embedding tower: Processes post text and author features to generate post embeddings
  • Utilizes the multilingual-e5 model for text embedding
  • Computes matching scores between queries and posts using cosine similarity

Key Advantages of EBR:

  • Semantic Matching: Based on concepts rather than exact keyword matching
  • Personalization: Can return personalized results based on searcher characteristics
  • Objective Optimization: Supports optimization of arbitrary objective functions

2. Multi-stage Ranking Layer

The ranking layer employs a two-stage design to balance effectiveness and efficiency:

L1 Ranking Stage:

  • Uses simple models to initially rank thousands of candidate posts
  • Selects the top hundreds of candidates for the next stage

L2 Ranking Stage:

  • Uses complex models for fine-grained ranking of candidates
  • Generates final search results

The ranking model architecture contains two prediction models:

  • On-topic Relevance Prediction Model: Takes query and post text as input, outputs relevance scores
  • Long-dwell Prediction Model: Takes a richer feature set as input, including:
    • Query and post text
    • Paired features such as BM25 matching scores
    • Query features (e.g., whether containing job titles)
    • Post features (e.g., post popularity)
    • User features (e.g., job-seeking intent)
    • Author features (e.g., author influence)
    • User-author relationship features (e.g., friendship status)

Final score calculation formula:

score = α × on-topicness_score + (1-α) × long-dwell_score

where α serves as a tuning parameter, with optimal values determined through online experiments.

Technical Innovations

  1. Hybrid Retrieval Strategy: Combines advantages of exact matching and semantic matching
  2. Two-tower Model Design: Supports pre-computation of post embeddings, significantly improving retrieval efficiency
  3. Multi-objective Optimization: Simultaneously considers content quality and user engagement
  4. Hierarchical Architecture: Achieves good balance between efficiency and effectiveness

Experimental Setup

Dataset

  • Uses historical data from LinkedIn's content search engine
  • Training data format: (query, post, label) triplets
  • Labels combine both on-topic rate and long-dwell metrics

Evaluation Metrics

  1. On-topic Rate:
    • Uses GPT to score the top 10 returned posts (1 indicates relevant and high-quality, 0 indicates irrelevant)
    • Calculates the proportion of posts labeled as 1
  2. Long-dwells:
    • Binary classification based on user dwell time on posts
    • Counts posts labeled as 1

Implementation Details

  • Text embedding model: multilingual-e5
  • Embedding storage: Venice key-value storage system
  • Approximate nearest neighbor search: Limits scanned posts to control latency
  • Pre-computation optimization: Offline and near-line computation of post embeddings

Experimental Results

Main Results

The new semantic search engine achieved significant performance improvements:

  • On-topic Rate: Improved by over 10%
  • Long-dwells: Improved by over 10%
  • Site-level Impact: Positive impact on LinkedIn's overall session count

Typical Cases

The search engine now effectively handles complex natural language queries, such as:

  • "how to ask for a raise?"
  • "dropout in AI"

These queries typically yielded unsatisfactory results in traditional keyword-based systems.

The paper focuses on practical applications of industrial-grade search systems, with relevant technologies including:

  • Text embedding techniques (multilingual-e5)
  • Two-tower model architecture
  • Multi-stage ranking systems
  • Large-scale retrieval system optimization

Conclusions and Discussion

Main Conclusions

  1. Semantic understanding capability is crucial for modern search engines
  2. Hybrid retrieval strategies effectively balance exact matching and semantic matching requirements
  3. Multi-objective optimization frameworks effectively enhance user experience

Limitations

  1. The current definition of on-topic rate is relatively simple and cannot fully capture quality expectations across different query types
  2. Reliance on GPT for quality assessment may have certain limitations

Future Directions

The team plans to:

  1. Improve on-topic rate assessment metrics
  2. Introduce large language models (LLMs) in the ranking layer to achieve joint attention mechanisms between query and post text
  3. Further enhance deep language understanding capabilities

In-depth Evaluation

Strengths

  1. High Practical Value: Addresses important real-world business problems
  2. Reasonable Architecture Design: Two-layer architecture effectively balances effectiveness and efficiency
  3. Mature Technical Solution: Fully considers engineering challenges of large-scale deployment
  4. Comprehensive Evaluation Framework: Establishes dual evaluation framework for quality and engagement
  5. Significant Results: Achieves metric improvements exceeding 10%

Weaknesses

  1. Limited Technical Innovation: Primarily engineering application of existing techniques
  2. Evaluation Method Limitations: GPT-based evaluation may introduce bias
  3. Insufficient Comparative Experiments: Lacks detailed comparisons with other semantic search methods
  4. Missing Theoretical Analysis: Lacks in-depth theoretical analysis and ablation studies

Impact

  1. Industrial Value: Provides practical reference for large-scale semantic search systems
  2. Technology Promotion: Demonstrates practical application effectiveness of semantic understanding in search engines
  3. Experience Sharing: Provides valuable engineering practice experience

Applicable Scenarios

This method is suitable for:

  • Large-scale content search platforms
  • Search systems requiring complex natural language query processing
  • Search applications with high personalization requirements
  • Search scenarios requiring balance among multiple optimization objectives

References

The paper cites the following key technologies and tools:

  1. Apache Samza - Stream processing framework
  2. MTEB Leaderboard - Text embedding evaluation benchmark
  3. Venice - LinkedIn's data storage platform
  4. Multilingual E5 - Multilingual text embedding model

Summary: This is a typical industrial technical report focusing on sharing LinkedIn's engineering practice experience in semantic search. While technical innovation is relatively limited, its complete system design, significant performance improvements, and in-depth consideration of engineering challenges make it an important reference for the industry.