By the fastCRW team · Benchmark figures verified 2026-05-18 · Search bench is a single point-in-time run · Verify independently before quoting.
Vector embeddings vs keyword search: the core difference
Vector embeddings power semantic search: they map text to points in a high-dimensional space so that "how do I cancel my plan" and "stop my subscription" land close together even though they share no words. Keyword search (also called lexical search) does the opposite — it matches the actual terms in your query against the actual terms in the documents, ranking by how often and how rarely those words appear. The two answer different questions. Keyword search asks "which documents contain these words?" Vector search asks "which documents mean the same thing?" Most teams building search or retrieval-augmented generation (RAG) eventually need both, but understanding the split first saves you from over-engineering a problem that a plain inverted index would have solved.
Lexical matching (BM25/TF-IDF)
Classic keyword search is built on TF-IDF and its modern successor BM25. TF-IDF scores a document by term frequency (how often the query word appears here) weighted by inverse document frequency (how rare the word is across the whole corpus) — so "embeddings" counts for more than "the." BM25 refines this with saturation (the tenth occurrence of a word matters less than the second) and document-length normalization. The result is fast, explainable, and exact: if a user searches for an error code, a SKU, or a person's name, BM25 returns documents that literally contain that token. It runs on an inverted index — a map from each term to the list of documents containing it — which is decades-mature and cheap to operate.
Semantic similarity with embeddings
An embedding model (such as a sentence transformer) reads a chunk of text and emits a fixed-length vector, often 384 to 1,536 numbers. Texts with similar meaning produce vectors that point in similar directions. You store these vectors in a vector database, and at query time you embed the query the same way and find the nearest vectors. The payoff is recall on paraphrase: synonyms, reworded questions, and conceptually related passages surface even with zero word overlap. The cost is that the match is fuzzy and opaque — you cannot point at a word and say "this is why it matched," and exact identifiers can get blurred into nearby-but-wrong neighbors.
How vector embeddings work
From text to high-dimensional vectors
Embedding models are trained so that semantically related text ends up nearby in vector space. During indexing you split documents into chunks (the chunking strategy matters a lot for quality), run each chunk through the model, and store the resulting vector alongside the original text. The vector is the searchable artifact; the text is what you return to the user or feed to an LLM. Because the model is fixed at index time, every chunk and every query must be embedded by the same model — mixing models breaks the geometry.
Cosine similarity and nearest neighbors
To compare two vectors you typically use cosine similarity, which measures the angle between them and ignores magnitude — so two passages about the same topic score high regardless of length. Finding the closest vectors to a query is a nearest-neighbor search. Doing it exactly across millions of vectors is slow, so vector databases use approximate nearest-neighbor (ANN) indexes like HNSW to trade a tiny bit of recall for orders-of-magnitude speed. This is the part people underestimate: a vector index is an approximation with tunable accuracy, not a lookup table.
When keyword search still wins
Exact terms, codes, and names
Embeddings are the wrong tool when the query is the answer's identity. Error codes, product SKUs, function names, legal citations, ticket IDs, and proper nouns all demand exact matching — and a semantic model will happily return "close" neighbors that are simply wrong. HTTP 422 should match HTTP 422, not HTTP 429 because they feel related. For these, BM25 is not a fallback; it is the correct primary tool.
Cost and latency advantages
Keyword search needs no GPU, no embedding-model inference at query time, and no vector store to maintain. The inverted index is small relative to a vector index, updates incrementally, and returns results in single-digit milliseconds. For a corpus that is mostly searched by exact terms — logs, code, structured catalogs — adding an embedding pipeline buys complexity and recurring inference cost for marginal gain. Reach for embeddings when paraphrase recall is the actual bottleneck, not by default.
Hybrid search: best of both
Combining lexical and semantic signals
In practice the strongest retrieval systems run both and merge the results — this is hybrid search. You issue the query against a BM25 index and a vector index in parallel, then fuse the two ranked lists. A common fusion method is Reciprocal Rank Fusion (RRF), which scores each document by its rank position in each list, so a document that ranks well in either signal floats up without you having to hand-tune weights. Hybrid search recovers the exact-match precision of BM25 and the paraphrase recall of embeddings in one result set, which is why most production RAG stacks default to it.
Reranking the merged set
Fusion gets you a good candidate set; a reranker makes it great. A cross-encoder reranker takes the query and each candidate document together and scores true relevance, rather than comparing pre-computed vectors in isolation. It is too expensive to run over the whole corpus, but cheap enough to run over the top 20-50 candidates that hybrid search surfaced. The typical pipeline is: BM25 + vector retrieve hundreds → fuse to tens → rerank to the handful you actually send to the LLM. Each stage trades a little latency for a lot of precision.
Where live web search fits
Indexed embeddings can be stale
Both BM25 and vector search assume the documents already exist in your index. That assumption breaks the moment freshness matters. An embedding index is a snapshot: the day after you build it, a price changed, a doc was rewritten, a new release shipped — and your vectors still describe yesterday's world. Re-embedding a large corpus is expensive, so most teams re-index on a cadence, which means there is always a staleness window. For questions about "right now," even a perfect semantic match over a stale index returns a confidently wrong answer.
Fresh retrieval for real-time signals
This is where live web retrieval is a different layer, not a competing one. Rather than searching a pre-built index of your own, you query the live web at request time and pull back current full-page content. fastCRW's /v1/search is built for exactly this: it runs the query, returns ranked results, and can optionally scrape each result's content in the same call so an agent gets fresh text without a second round trip. In our search benchmark — 100 queries across 10 categories run concurrently against three providers (triple-bench.ts, single point-in-time run, verified 2026-05-18) — fastCRW search averaged 880 ms with a 785 ms median and won latency on 73 of 100 queries against Firecrawl and Tavily. We cite the raw numbers rather than a speed multiple, and this is the search benchmark only; it does not measure scrape accuracy.
Disclosure: we build fastCRW, so weigh this accordingly — and here is the honest scope line. fastCRW is not embeddings-native. It does not train an embedding model, host a vector database, or do neural ranking the way a dedicated semantic-search engine like Exa does. fastCRW's /v1/search is retrieval-plus-scrape: it fetches fresh results and clean page content, and you bring your own embedding and vector layer on top if you want semantic ranking. We would rather state that plainly than imply semantic parity we do not have.
Where Exa and dedicated vector tools genuinely win
If your core problem is "find the conceptually most relevant pages on the open web by meaning," a neural-search engine like Exa is purpose-built for it — its index is embeddings, so semantic recall over the web is its native strength, not an add-on. Likewise, if you need to store and re-search millions of your own embeddings, a managed vector database does that job far better than any scraper. fastCRW's role is the fresh-content layer feeding those systems, not a replacement for them. A common, honest architecture is: live web search and scrape with fastCRW → chunk → embed → store in your vector DB → query with hybrid search at runtime.
Putting it together for RAG
The mental model that holds up: keyword search for exactness and identifiers, vector search for meaning and paraphrase, hybrid + rerank when you want both, and live web retrieval as the freshness layer that keeps any index from going stale. None of these replaces the others. The mistake we see most is teams reaching for embeddings first when their queries are 80% exact-term lookups, paying for inference and a vector store to solve a problem BM25 already solved — and the opposite mistake, bolting a semantic layer onto a stale index and never asking whether the underlying data is current. Decide the query shape first, then the freshness requirement, then pick the layers.
Sources
- BM25 / Okapi ranking function: en.wikipedia.org/wiki/Okapi_BM25
- HNSW approximate nearest neighbor: Malkov & Yashunin, 2016
- Reciprocal Rank Fusion: Cormack et al., SIGIR 2009
- fastCRW repo and search API: github.com/us/crw · /benchmarks
Related: Best semantic search APIs · Best vector databases · What is a web index? · Agentic search explained · Search API for AI agents
