Introducing Semantic Capability in LinkedIn's Content Search Engine
Yang, Zheng, Mohan et al.
In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn's new content search engine with semantic capability, and its impact on metrics.
academic
Introducing Semantic Capability in LinkedIn's Content Search Engine
As user search behavior evolves, traditional keyword-based search engines can no longer satisfy the increasingly complex demands of natural language queries. This paper introduces LinkedIn's newly designed content search engine with semantic understanding capabilities and demonstrates its significant improvements in core metrics.
Increasing Complexity of Search Queries: Users have transitioned from short keyword queries to complex natural language queries, such as "how to ask for a raise?" and "dropout in AI"
Limitations of Traditional Search: Keyword matching-based search engines face two primary challenges when handling complex queries:
Returning empty results when not all query keywords exist in any post
Failing to correctly answer questions even when posts containing all keywords exist, due to lack of conceptual understanding
LinkedIn's analysis revealed that posts capable of correctly answering queries actually exist in the search index, but may not contain all keywords from the query. This motivated the team to develop a content search engine with semantic matching capabilities to better understand query intent and return relevant content.
Designed a Two-layer Architecture for Semantic Search Engine: Incorporating a retrieval layer and multi-stage ranking layer, effectively combining keyword matching and semantic understanding
Implemented a Hybrid Retrieval Strategy: Simultaneously utilizing term-based retriever (TBR) and embedding-based retriever (EBR)
Established a Multi-objective Optimization Framework: Simultaneously optimizing on-topic rate and long-dwells metrics
Achieved Significant Performance Improvements: Both on-topic rate and long-dwell metrics improved by over 10%
The paper cites the following key technologies and tools:
Apache Samza - Stream processing framework
MTEB Leaderboard - Text embedding evaluation benchmark
Venice - LinkedIn's data storage platform
Multilingual E5 - Multilingual text embedding model
Summary: This is a typical industrial technical report focusing on sharing LinkedIn's engineering practice experience in semantic search. While technical innovation is relatively limited, its complete system design, significant performance improvements, and in-depth consideration of engineering challenges make it an important reference for the industry.