What if embedding retrieval and semantic search is only worth it for code?
Code is a nice domain for semantic search because there’s less 1:1 mapping between words and semantics.
Code is of course more structured, but there’s a subtle layer of indirection between the string of tokens and the meaning. The meaning is also more distributed across lines. A line of code isn’t always like a sentence. Natural language and code target different recipients.
I suspect this is part of why we’re seeing more success for semantic search with code than with natural language documents.