How are Embeddings Affecting Traditional Text Search?

An informal explanation of lexical, semantic, and hybrid search for text documents

By Cody Collier on May 6, 2024

Overview

Modern semantic embeddings have opened new doors in text based search. More recently, you may have seen discussions involving terms like embeddings, vectors, vector databases, semantic search and more. What does all this mean?

In this post I’ll present key terms to understand, along with an overview for non-technical readers.

Traditional Lexical Search

What we think of as search began in academia, and continues today, in a field known as information retrieval. Over the past few decades, search algorithms and engines have become commoditized in industry. We take for granted almost any application involving documents, email, or other text will have search available as a feature.

Most implementations are based on lexical algorithms such as TF-IDF or BM25, which is why they're called "lexical search". Essentially, these algorithms keep track of the vocabulary and word statistics for a set of documents, and then match a user query to the corresponding documents for those words.

While we take it for granted, there’s a deep history of optimization and detail behind lexical search. Such technical details are beyond the scope of this article, but we see the results in our everyday use search. Of course we also see where search fails, when we struggle to remember the exact words used in a document, don’t know exactly what we’re seeking, or want to search for concepts. Enter semantic embeddings!

Using Embeddings for Semantic Search

In 2012 and 2013, seeds were planted which would eventually bloom into significant disruptions for the search industry. A new algorithmic approach called Word2Vec emerged, alongside advancements in learned representations and deep learning. This ignited a decade of new ideas for using numeric vectors to represent the meaning of a group of words. These numeric vectors are referred to as Embeddings. The details can be complex, with innovations happening regularly, but the intuitions are fairly straight forward.

Imagine two paragraphs which talk about the same topic. Paragraphs A and B can cover the same topic and meaning, while using both similar and different words. If you create an embedding for each paragraph, the embeddings will be "close" to each other, mathematically speaking, even if the sentences are different. So you can now measure semantic similarity with a simple math operation!

Now we can apply this concept in search. You can make an embedding for each paragraph or document, and store it in a vector db. Then you make an embedding for the user query, and see which docs are similar to the query. This is the fundamental core of semantic search.

Semantic search has been growing significantly over the last couple of years. It’s a big part of why we’ve seen an explosion in the release of new vector databases. This is especially true for people approaching search through the lens of machine learning in the new domain of retrieval augmented generation (RAG).

Semantic search presents interesting and valuable ideas for search. However, it sometimes over-corrects and misses important ideas from traditional lexical search. What if you want to find only documents with an exact phrase? What if you still want to filter documents by metadata like traditional search?

Combining the Two as Hybrid Search

You might already be wondering, why not both?

That’s exactly what hybrid search does. If you combine traditional lexical search with modern semantic search, you can better ensure search results will cover gaps and address the user query and information needs. How does this work?

Some people combine results from two separate engines (one lexical engines and one vector engine). There are a algorithms for merging two separate result sets such as RRF and others. Some traditional search engines already support vector indexes. Vespa is a strong leader in this area while Lucene based engines such as Elastic and OpenSearch are close on their heels. Finally, some new vector engines are adding basic lexical search as supplemental features.

Looking ahead

Traditional search and BM25 aren't going anywhere. At the same time, semantic search is a great way to start filling gaps and even extend functionality into document similarity and multimodal search. Using the two together is an understandably appealing concept.

Implementation approaches are fairly diverse right now, while hybrid search is nascent and best practices are still developing. In some cases, you can even make search results worse if you’re not careful! Even so, it’s fairly clear that hybrid search is emerging as the top pathway for state of the art results in text based search.

Please reach out if you’re interested in further conversation, comments, or corrections!

Home

© 2024 Cody Collier. All rights reserved.