In this paper, @LinkedIn describes their updated hybrid search platform for content search. It’s a nice overview of a concrete implementation of some modern search techniques.
My favorite parts are the various personalization approaches integrated into the re-ranking layers. It would be neat if they shared a little more detail for those vectors and the corresponding MLP. It’s a nice approach to have the complexity and personalization vectors in the two re-ranking layers. It’s complex, but relatively easy and low cost to change those layers as new ideas are identified.
As a critique I wonder how beneficial the personalization in the two tower retrieval side might be. There’s a lot of compute, complexity, and time in that part of the platform. It will be costly to retrain and generate those parts of the representation. How much better is the initial retrieval over just using the text embeddings directly? It would be nice to know what evaluations showed for the value of that work.
Overall, a nice paper and an easier read. Thanks to LinkedIn for sharing their internal details and even specific things like their use of the e5 embeddings.
Introducing Semantic Capability in LinkedIn’s Content Search
Engine
https://arxiv.org/abs/2412.20366
P.S.
Some notes related to the ever evolving lexicon of this field:
The paper uses the phrase token based retrieval (TBR) where most people refer to this as traditional lexical or keyword based search. TFIDF, BM25, etc.
The paper uses embedding based retrieval (EBR) which is fairly close to common terms semantic search or vector search. Perhaps ok since they add some other components to the text embeddings.
I’d argue they should use the term hybrid search in the title, but it’s normal to see messy lexicons in the leading edge of the field.