2025-11-10T03:00:12.487488

Introducing Semantic Capability in LinkedIn's Content Search Engine

Yang, Zheng, Mohan et al.

In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn's new content search engine with semantic capability, and its impact on metrics.

academic

Introducing Semantic Capability in LinkedIn's Content Search Engine

Basic Information

Paper ID: 2412.20366
Title: Introducing Semantic Capability in LinkedIn's Content Search Engine
Authors: Xin Yang, Chujie Zheng, Madhumitha Mohan, Sonali Bhadra, Pansul Bhatt, Lingyu (Claire) Zhang, Rupesh Gupta
Institution: LinkedIn Corporation, Mountain View, CA, USA
Classification: cs.IR (Information Retrieval)
Publication Date: December 2024
Paper Link: https://arxiv.org/abs/2412.20366

Abstract

As user search behavior evolves, traditional keyword-based search engines can no longer satisfy the increasingly complex demands of natural language queries. This paper introduces LinkedIn's newly designed content search engine with semantic understanding capabilities and demonstrates its significant improvements in core metrics.

Research Background and Motivation

Problem Definition

Increasing Complexity of Search Queries: Users have transitioned from short keyword queries to complex natural language queries, such as "how to ask for a raise?" and "dropout in AI"
Limitations of Traditional Search: Keyword matching-based search engines face two primary challenges when handling complex queries:
- Returning empty results when not all query keywords exist in any post
- Failing to correctly answer questions even when posts containing all keywords exist, due to lack of conceptual understanding

Research Motivation

LinkedIn's analysis revealed that posts capable of correctly answering queries actually exist in the search index, but may not contain all keywords from the query. This motivated the team to develop a content search engine with semantic matching capabilities to better understand query intent and return relevant content.

Core Contributions

Designed a Two-layer Architecture for Semantic Search Engine: Incorporating a retrieval layer and multi-stage ranking layer, effectively combining keyword matching and semantic understanding
Implemented a Hybrid Retrieval Strategy: Simultaneously utilizing term-based retriever (TBR) and embedding-based retriever (EBR)
Established a Multi-objective Optimization Framework: Simultaneously optimizing on-topic rate and long-dwells metrics
Achieved Significant Performance Improvements: Both on-topic rate and long-dwell metrics improved by over 10%

Methodology Details

Task Definition

Return high-quality, engaging post content for each search query, evaluated through two quantitative metrics:

On-topic Rate: Assessing the quality and relevance of returned posts using GPT
Long-dwells: Measuring user dwell time on posts

Model Architecture

1. Retrieval Layer

The retrieval layer contains two parallel retrievers:

Term-Based Retriever (TBR):

Maintains inverted indices mapping keywords to posts containing those terms
Finds posts containing all query keywords through intersection operations
Suitable for navigational queries, such as finding specific posts

Embedding-Based Retriever (EBR):

Adopts a two-tower model architecture
Query embedding tower: Processes query text and user features to generate query embeddings
Post embedding tower: Processes post text and author features to generate post embeddings
Utilizes the multilingual-e5 model for text embedding
Computes matching scores between queries and posts using cosine similarity

Key Advantages of EBR:

Semantic Matching: Based on concepts rather than exact keyword matching
Personalization: Can return personalized results based on searcher characteristics
Objective Optimization: Supports optimization of arbitrary objective functions

2. Multi-stage Ranking Layer

The ranking layer employs a two-stage design to balance effectiveness and efficiency:

L1 Ranking Stage:

Uses simple models to initially rank thousands of candidate posts
Selects the top hundreds of candidates for the next stage

L2 Ranking Stage:

Uses complex models for fine-grained ranking of candidates
Generates final search results

The ranking model architecture contains two prediction models:

On-topic Relevance Prediction Model: Takes query and post text as input, outputs relevance scores
Long-dwell Prediction Model: Takes a richer feature set as input, including:
- Query and post text
- Paired features such as BM25 matching scores
- Query features (e.g., whether containing job titles)
- Post features (e.g., post popularity)
- User features (e.g., job-seeking intent)
- Author features (e.g., author influence)
- User-author relationship features (e.g., friendship status)

Final score calculation formula:

score = α × on-topicness_score + (1-α) × long-dwell_score

where α serves as a tuning parameter, with optimal values determined through online experiments.

Technical Innovations

Hybrid Retrieval Strategy: Combines advantages of exact matching and semantic matching
Two-tower Model Design: Supports pre-computation of post embeddings, significantly improving retrieval efficiency
Multi-objective Optimization: Simultaneously considers content quality and user engagement
Hierarchical Architecture: Achieves good balance between efficiency and effectiveness

Experimental Setup

Dataset

Uses historical data from LinkedIn's content search engine
Training data format: (query, post, label) triplets
Labels combine both on-topic rate and long-dwell metrics

Evaluation Metrics

On-topic Rate:
- Uses GPT to score the top 10 returned posts (1 indicates relevant and high-quality, 0 indicates irrelevant)
- Calculates the proportion of posts labeled as 1
Long-dwells:
- Binary classification based on user dwell time on posts
- Counts posts labeled as 1

Implementation Details

Text embedding model: multilingual-e5
Embedding storage: Venice key-value storage system
Approximate nearest neighbor search: Limits scanned posts to control latency
Pre-computation optimization: Offline and near-line computation of post embeddings

Experimental Results

Main Results

The new semantic search engine achieved significant performance improvements:

On-topic Rate: Improved by over 10%
Long-dwells: Improved by over 10%
Site-level Impact: Positive impact on LinkedIn's overall session count

Typical Cases

The search engine now effectively handles complex natural language queries, such as:

"how to ask for a raise?"
"dropout in AI"

These queries typically yielded unsatisfactory results in traditional keyword-based systems.

The paper focuses on practical applications of industrial-grade search systems, with relevant technologies including:

Text embedding techniques (multilingual-e5)
Two-tower model architecture
Multi-stage ranking systems
Large-scale retrieval system optimization

Conclusions and Discussion

Main Conclusions

Semantic understanding capability is crucial for modern search engines
Hybrid retrieval strategies effectively balance exact matching and semantic matching requirements
Multi-objective optimization frameworks effectively enhance user experience

Limitations

The current definition of on-topic rate is relatively simple and cannot fully capture quality expectations across different query types
Reliance on GPT for quality assessment may have certain limitations

Future Directions

The team plans to:

Improve on-topic rate assessment metrics
Introduce large language models (LLMs) in the ranking layer to achieve joint attention mechanisms between query and post text
Further enhance deep language understanding capabilities

In-depth Evaluation

Strengths

High Practical Value: Addresses important real-world business problems
Reasonable Architecture Design: Two-layer architecture effectively balances effectiveness and efficiency
Mature Technical Solution: Fully considers engineering challenges of large-scale deployment
Comprehensive Evaluation Framework: Establishes dual evaluation framework for quality and engagement
Significant Results: Achieves metric improvements exceeding 10%

Weaknesses

Limited Technical Innovation: Primarily engineering application of existing techniques
Evaluation Method Limitations: GPT-based evaluation may introduce bias
Insufficient Comparative Experiments: Lacks detailed comparisons with other semantic search methods
Missing Theoretical Analysis: Lacks in-depth theoretical analysis and ablation studies

Impact

Industrial Value: Provides practical reference for large-scale semantic search systems
Technology Promotion: Demonstrates practical application effectiveness of semantic understanding in search engines
Experience Sharing: Provides valuable engineering practice experience

Applicable Scenarios

This method is suitable for:

Large-scale content search platforms
Search systems requiring complex natural language query processing
Search applications with high personalization requirements
Search scenarios requiring balance among multiple optimization objectives

References

The paper cites the following key technologies and tools:

Apache Samza - Stream processing framework
MTEB Leaderboard - Text embedding evaluation benchmark
Venice - LinkedIn's data storage platform
Multilingual E5 - Multilingual text embedding model

Summary: This is a typical industrial technical report focusing on sharing LinkedIn's engineering practice experience in semantic search. While technical innovation is relatively limited, its complete system design, significant performance improvements, and in-depth consideration of engineering challenges make it an important reference for the industry.