By the fastCRW team · Benchmark + footprint figures verified 2026-05-18 (scrape run 2026-05-08) · fastCRW launch pricing expires 2026-06-01 · Verify independently before relying on these numbers.
Weaviate semantic search needs clean source text
Weaviate semantic search from the web is only as good as the text you put into it. Weaviate handles the vectors, the indexing, and the hybrid ranking — but if you feed it raw HTML littered with navigation, cookie banners, ad markup, and boilerplate, the embeddings encode that noise and your similarity scores drift. The first job in any web-to-Weaviate pipeline is not the vectorizer; it is the extraction layer that turns a live page into clean, structured text worth embedding.
That is where fastCRW fits. It is a Firecrawl-compatible open-core scraper that returns clean markdown out of the box, so the content you import into Weaviate carries the page's real structure — headings, lists, code blocks — instead of DOM debris. This guide walks the full path: collect web content with fastCRW, define a Weaviate class and vectorize it, run hybrid (BM25 + vector) queries over fresh data, and keep the index current on a schedule.
Why raw HTML hurts vectorization
Embedding models compress meaning into a fixed-length vector. Every token they see competes for that budget. When a page's main article shares the input with a 40-link footer and three "related posts" widgets, the vector becomes a blend of the content you want and the chrome you don't. Two genuinely different articles that share the same site template can end up closer in vector space than they should be, because the shared boilerplate dominates. Clean extraction removes that confound before it ever reaches the vectorizer.
The web-to-Weaviate flow
The pipeline has four stages, and only the first belongs to fastCRW:
- Collect — crawl or scrape source pages into clean markdown (fastCRW).
- Chunk — split each document into embedding-sized passages.
- Vectorize + import — define a Weaviate class and load objects with metadata (Weaviate).
- Query — run hybrid search and refresh on a schedule.
fastCRW owns the ingestion edge; Weaviate owns storage, vectorization, and retrieval. Keeping that boundary explicit makes the cost and accuracy trade-offs easy to reason about.
Step 1: Collect web content with fastCRW
fastCRW exposes a Firecrawl-compatible REST surface, so if you already call Firecrawl you can point the same client at your fastCRW base URL after a base-URL swap. The two endpoints you need here are /v1/map (discover URLs) and /v1/crawl (pull pages); for one-off pages use /v1/scrape.
Crawl a site or scrape target pages
For a whole documentation site, map first to see the URL set, then crawl:
POST /v1/mapreturns every URL fastCRW can discover on the site — useful to scope the crawl before you spend credits.POST /v1/crawlstarts an async BFS crawl and returns a job ID; pollGET /v1/crawl/:idfor status and results.crawlacceptsmaxDepthandmaxPages(caps:maxDepth10,maxPages1000) so you can bound the run.POST /v1/scrapehandles single pages — a changelog, a pricing page, one API reference — when you don't need a full crawl.
A crawl costs 1 credit per page (2 per page when chrome-rendered), and a standalone scrape costs 1 credit (2 with the chrome renderer). There is no per-page JS-render multiplier beyond that flat chrome bump, which keeps a 500-page docs crawl forecastable.
Clean markdown for embedding
fastCRW's extraction is the part that matters for retrieval quality. On Firecrawl's own public scrape-content-dataset-v1 — 1,000 URLs of which 819 carry labeled ground truth — fastCRW recovered 63.74% truth-recall of the 819 labeled URLs (diagnose_3way.py, 2026-05-08), the highest of the three tools tested (Crawl4AI 59.95%, Firecrawl 56.04%). Higher recall means more of each page's real content survives extraction, so the markdown you chunk and vectorize is more complete. We disclose the trade-off plainly: that recall comes partly from a chrome-stealth fallback that recovers hard pages, and the same mechanism gives fastCRW a p90 latency of 14157 ms — the worst of the three in that run (p50 1914 ms, which does beat Firecrawl's 2305 ms). For a batch ingestion job feeding an index, the tail latency is rarely the binding constraint; recall is.
Step 2: Define a Weaviate class and vectorize
With clean markdown in hand, the work moves to Weaviate. fastCRW does no vectorization itself — that is Weaviate's job, via whichever vectorizer module you configure.
Schema and vectorizer choices
Define a class (collection) with the properties you want to store and query: the chunk text, plus metadata like sourceUrl, title, and crawledAt. Pick a vectorizer that matches your needs — a Weaviate inference module (e.g. a text2vec module) or none if you bring your own embeddings. If you embed outside Weaviate, fastCRW still only produces the text; the embedding step is yours.
One honest constraint to plan around: if you use fastCRW's own LLM extraction (formats: ["json"] with a jsonSchema) to pull structured fields before import, that extraction supports OpenAI and Anthropic providers only. For plain markdown ingestion — the common case here — no LLM is involved at all, so this limit doesn't apply to most Weaviate pipelines.
Chunking before import
Embedding models have a context limit, and retrieval works best when each vector represents a focused passage. Split each markdown document into chunks — by heading section, by token count with overlap, or with a structure-aware splitter — and store one Weaviate object per chunk. Because fastCRW preserves heading hierarchy in the markdown, heading-based chunking tends to produce cleaner, self-contained passages than chunking raw HTML would. See best chunking strategies for RAG for the trade-offs between fixed-size, recursive, and semantic chunking.
Step 3: Hybrid search over fresh data
Once objects are imported, Weaviate's hybrid search is where the freshness pays off.
Importing objects with metadata
Load each chunk as an object with its metadata attached. Use a deterministic ID derived from the source URL plus a chunk index so re-imports upsert in place rather than duplicating. Keeping sourceUrl and crawledAt on every object lets you filter queries by source and prune stale content later.
Hybrid (BM25 + vector) queries
Hybrid search blends keyword (BM25) and vector similarity, controlled by an alpha weight. It is the reason clean text matters twice over: the vector half rewards semantically coherent passages, and the BM25 half rewards exact term matches — and both degrade when boilerplate dilutes the chunk. With clean fastCRW markdown, a query for an error message can hit the exact string via BM25 while still surfacing conceptually related passages via the vector side. If you are weighing Weaviate's hybrid mode against hosted search APIs, see best semantic search APIs.
Scheduled refresh
Web content changes; a static index drifts. Re-run the same crawl on a schedule (nightly, weekly — whatever your sources' change rate justifies) and upsert by your deterministic IDs so only changed chunks are rewritten. This is the same pattern any web-backed RAG system needs — covered end to end in building a RAG pipeline with CRW and scheduled crawls with cron.
Cost, footprint, and self-hosting
The cost of this pipeline splits cleanly across the two layers.
Crawl credits vs Weaviate hosting
On the ingestion side you pay fastCRW credits: 1 per crawled page (2 if chrome-rendered), 1 per scrape, 1 per map call. On the storage side you pay for Weaviate — whether that's Weaviate Cloud or your own instance, plus the embedding cost if you use a paid vectorizer. The two are independent: a 1,000-page crawl is ~1,000 credits regardless of how Weaviate is hosted. For current credit allowances per plan, see /pricing (launch tiers revert to regular price on 2026-06-01).
Run the open-core engine alongside self-hosted Weaviate for $0
This is where the footprint story matters. fastCRW's engine is a single static Rust binary — roughly an ~8 MB Docker image in one container, no Redis or Node.js required. Weaviate runs as its own container. That means the entire ingestion-plus-vectors stack can sit on one box: self-hosted Weaviate next to the self-hosted fastCRW engine, with scraped content never leaving your infrastructure. Self-hosting the AGPL-3.0 engine is free — you pay only for the server — so for a self-managed Weaviate deployment the web-ingestion layer adds $0 in per-page bills. A heavy cloud-only scraper cannot offer that co-location.
Honest limits
Two boundaries are worth stating before you build:
- fastCRW does ingestion, not vector indexing. It produces clean markdown; it does not embed, store, or rank. All vectorization, ANN indexing, and hybrid search live in Weaviate. If you want a turnkey vector store, fastCRW is not it — it is deliberately the layer in front of one.
- LLM extraction is OpenAI/Anthropic only. If you use
formats: ["json"]to pull structured fields during ingestion, you are limited to those two providers. Plain markdown ingestion uses no LLM, so this only bites structured-extraction pipelines. - Single-URL extraction and stateless requests. There is no multi-URL batch
/v1/extract— for many URLs you iterate/v1/scrapeconcurrently or use/v1/crawl. Each request is stateless, and screenshot output is not supported (aformats: ["screenshot"]request returns HTTP 422), which is irrelevant for text ingestion but worth knowing.
Within those boundaries, the pairing is clean: fastCRW gets accurate, well-structured text into your pipeline, and Weaviate turns it into fast hybrid semantic search. For other vector stores, see best vector databases, and for a managed-Pinecone variant of this same pattern, Pinecone + fastCRW.
Sources
- Scrape benchmark (truth-recall, p50/p90):
bench/server-runs/RESULT_3WAY_1000_FULL.md·diagnose_3way.py, 819 labeled URLs, 2026-05-08 — see /benchmarks - fastCRW repo and pricing: github.com/us/crw · /pricing
- Weaviate hybrid search docs: weaviate.io (verify against current Weaviate version)
Related: Best semantic search APIs · Best vector databases · RAG pipeline with CRW · Best chunking strategies for RAG
