How do I import web pages into Weaviate?

Use fastCRW to collect the pages first: call POST /v1/map to discover a site's URLs, then POST /v1/crawl (or /v1/scrape for single pages) to pull them as clean markdown. Chunk each document into embedding-sized passages, then define a Weaviate class and import one object per chunk with metadata like sourceUrl and crawledAt. Weaviate handles vectorization and indexing; fastCRW only produces the text.

Does clean markdown improve Weaviate semantic search?

Yes. Embedding models have a fixed vector budget, so navigation, ads, and boilerplate in raw HTML dilute the meaning encoded for each chunk and can pull unrelated pages closer in vector space. fastCRW returns clean markdown with the page's real heading structure intact — and recovered 63.74% truth-recall on Firecrawl's 819 labeled URLs (diagnose_3way.py, 2026-05-08), the highest of the three tools tested — so both the vector and BM25 halves of Weaviate's hybrid search see cleaner input.

Can I run fastCRW and Weaviate on the same self-hosted box?

Yes. fastCRW's engine is a single static Rust binary (~8 MB image, one container, no Redis or Node.js), so it co-locates comfortably with a self-hosted Weaviate container on one server. Self-hosting the AGPL-3.0 engine is free — you pay only for the server — and scraped content never leaves your infrastructure, which a cloud-only scraper cannot match.

How much does a crawl-to-Weaviate pipeline cost?

On the fastCRW side, a crawl costs 1 credit per page (any renderer — no chrome premium), a single scrape costs 1 credit, and a map call costs 1 credit — so a 1,000-page crawl is roughly 1,000 credits. Weaviate hosting and any paid vectorizer are billed separately. Self-hosting the fastCRW engine removes the per-page credit cost entirely. See /pricing for current per-plan allowances.

Which LLM does fastCRW use for extraction?

fastCRW's LLM-based structured extraction (formats: ["json"] with a jsonSchema) runs on fastCRW's managed LLM on the managed cloud and requires a paid plan; self-hosted, it runs on the model endpoint you configure. For a standard Weaviate ingestion pipeline you usually import plain markdown, which involves no LLM at all, so this limit only affects pipelines that pull structured fields during scraping.

Weaviate + fastCRW: Semantic Search From Web

By the fastCRW team · Benchmark + footprint figures verified 2026-05-18 (scrape run 2026-05-08) · Verify independently before relying on these numbers.

Weaviate semantic search needs clean source text

Weaviate semantic search from the web is only as good as the text you put into it. Weaviate handles the vectors, the indexing, and the hybrid ranking — but if you feed it raw HTML littered with navigation, cookie banners, ad markup, and boilerplate, the embeddings encode that noise and your similarity scores drift. The first job in any web-to-Weaviate pipeline is not the vectorizer; it is the extraction layer that turns a live page into clean, structured text worth embedding.

That is where fastCRW fits. It is a Firecrawl-compatible open-core scraper that returns clean markdown out of the box, so the content you import into Weaviate carries the page's real structure — headings, lists, code blocks — instead of DOM debris. This guide walks the full path: collect web content with fastCRW, define a Weaviate class and vectorize it, run hybrid (BM25 + vector) queries over fresh data, and keep the index current on a schedule.

Why raw HTML hurts vectorization

Embedding models compress meaning into a fixed-length vector. Every token they see competes for that budget. When a page's main article shares the input with a 40-link footer and three "related posts" widgets, the vector becomes a blend of the content you want and the chrome you don't. Two genuinely different articles that share the same site template can end up closer in vector space than they should be, because the shared boilerplate dominates. Clean extraction removes that confound before it ever reaches the vectorizer.

The web-to-Weaviate flow

The pipeline has four stages, and only the first belongs to fastCRW:

Collect — crawl or scrape source pages into clean markdown (fastCRW).
Chunk — split each document into embedding-sized passages.
Vectorize + import — define a Weaviate class and load objects with metadata (Weaviate).
Query — run hybrid search and refresh on a schedule.

fastCRW owns the ingestion edge; Weaviate owns storage, vectorization, and retrieval. Keeping that boundary explicit makes the cost and accuracy trade-offs easy to reason about.

Step 1: Collect web content with fastCRW

fastCRW exposes a Firecrawl-compatible REST surface, so if you already call Firecrawl you can point the same client at your fastCRW base URL after a base-URL swap. The two endpoints you need here are /v1/map (discover URLs) and /v1/crawl (pull pages); for one-off pages use /v1/scrape.

Crawl a site or scrape target pages

For a whole documentation site, map first to see the URL set, then crawl:

POST /v1/map returns every URL fastCRW can discover on the site — useful to scope the crawl before you spend credits.
POST /v1/crawl starts an async BFS crawl and returns a job ID; poll GET /v1/crawl/:id for status and results. crawl accepts maxDepth and maxPages (caps: maxDepth 10, maxPages 1000) so you can bound the run.
POST /v1/scrape handles single pages — a changelog, a pricing page, one API reference — when you don't need a full crawl.

A crawl costs 1 credit per page regardless of renderer, and a standalone scrape costs 1 credit — chrome, lightpanda, or http, the price is the same flat 1 credit. There is no JS-render multiplier or chrome surcharge, which keeps a 500-page docs crawl forecastable.

Clean markdown for embedding

fastCRW's extraction is the part that matters for retrieval quality. On Firecrawl's own public scrape-content-dataset-v1 — 1,000 URLs of which 819 carry labeled ground truth — fastCRW recovered 63.74% truth-recall of the 819 labeled URLs (diagnose_3way.py, 2026-05-08), the highest of the three tools tested (Crawl4AI 59.95%, Firecrawl 56.04%). Higher recall means more of each page's real content survives extraction, so the markdown you chunk and vectorize is more complete. We disclose the trade-off plainly: that recall comes partly from a chrome-stealth fallback that recovers hard pages. fastCRW's p50 of 1914 ms beats Firecrawl's 2305 ms, and in fast mode the p90 of 4348 ms is the lowest of the three. For a batch ingestion job feeding an index, the tail latency is rarely the binding constraint; recall is.

Step 2: Define a Weaviate class and vectorize

With clean markdown in hand, the work moves to Weaviate. fastCRW does no vectorization itself — that is Weaviate's job, via whichever vectorizer module you configure.

Schema and vectorizer choices

Define a class (collection) with the properties you want to store and query: the chunk text, plus metadata like sourceUrl, title, and crawledAt. Pick a vectorizer that matches your needs — a Weaviate inference module (e.g. a text2vec module) or none if you bring your own embeddings. If you embed outside Weaviate, fastCRW still only produces the text; the embedding step is yours.

One honest constraint to plan around: if you use fastCRW's own LLM extraction (formats: ["json"] with a jsonSchema) to pull structured fields before import, that extraction runs on fastCRW's managed LLM and requires a paid plan. For plain markdown ingestion — the common case here — no LLM is involved at all, so this limit doesn't apply to most Weaviate pipelines.

Chunking before import

Embedding models have a context limit, and retrieval works best when each vector represents a focused passage. Split each markdown document into chunks — by heading section, by token count with overlap, or with a structure-aware splitter — and store one Weaviate object per chunk. Because fastCRW preserves heading hierarchy in the markdown, heading-based chunking tends to produce cleaner, self-contained passages than chunking raw HTML would. See best chunking strategies for RAG for the trade-offs between fixed-size, recursive, and semantic chunking.

Step 3: Hybrid search over fresh data

Once objects are imported, Weaviate's hybrid search is where the freshness pays off.

Importing objects with metadata

Load each chunk as an object with its metadata attached. Use a deterministic ID derived from the source URL plus a chunk index so re-imports upsert in place rather than duplicating. Keeping sourceUrl and crawledAt on every object lets you filter queries by source and prune stale content later.

Hybrid (BM25 + vector) queries

Hybrid search blends keyword (BM25) and vector similarity, controlled by an alpha weight. It is the reason clean text matters twice over: the vector half rewards semantically coherent passages, and the BM25 half rewards exact term matches — and both degrade when boilerplate dilutes the chunk. With clean fastCRW markdown, a query for an error message can hit the exact string via BM25 while still surfacing conceptually related passages via the vector side. If you are weighing Weaviate's hybrid mode against hosted search APIs, see best semantic search APIs.

Scheduled refresh

Web content changes; a static index drifts. Re-run the same crawl on a schedule (nightly, weekly — whatever your sources' change rate justifies) and upsert by your deterministic IDs so only changed chunks are rewritten. This is the same pattern any web-backed RAG system needs — covered end to end in building a RAG pipeline with CRW and scheduled crawls with cron.

Cost, footprint, and self-hosting

The cost of this pipeline splits cleanly across the two layers.

Crawl credits vs Weaviate hosting

On the ingestion side you pay fastCRW credits: 1 per crawled page (any renderer — no chrome surcharge), 1 per scrape, 1 per map call. On the storage side you pay for Weaviate — whether that's Weaviate Cloud or your own instance, plus the embedding cost if you use a paid vectorizer. The two are independent: a 1,000-page crawl is ~1,000 credits regardless of how Weaviate is hosted. For current credit allowances per plan, see /pricing.

Run the open-core engine alongside self-hosted Weaviate for $0

This is where the footprint story matters. fastCRW's engine is a single static Rust binary — roughly an ~8 MB Docker image in one container, no Redis or Node.js required. Weaviate runs as its own container. That means the entire ingestion-plus-vectors stack can sit on one box: self-hosted Weaviate next to the self-hosted fastCRW engine, with scraped content never leaving your infrastructure. Self-hosting the AGPL-3.0 engine is free — you pay only for the server — so for a self-managed Weaviate deployment the web-ingestion layer adds $0 in per-page bills. A heavy cloud-only scraper cannot offer that co-location.

Honest limits

Two boundaries are worth stating before you build:

fastCRW does ingestion, not vector indexing. It produces clean markdown; it does not embed, store, or rank. All vectorization, ANN indexing, and hybrid search live in Weaviate. If you want a turnkey vector store, fastCRW is not it — it is deliberately the layer in front of one.
LLM extraction is managed and requires a paid plan. If you use formats: ["json"] to pull structured fields during ingestion, that runs on fastCRW's managed LLM on a paid plan (or your own model endpoint if you self-host). Plain markdown ingestion uses no LLM, so this only bites structured-extraction pipelines.
Single-URL extraction and stateless requests. There is no multi-URL batch /v1/extract — for many URLs you iterate /v1/scrape concurrently or use /v1/crawl. Each request is stateless, which is irrelevant for text ingestion but worth knowing. (Screenshots are supported, if you ever need them: a formats: ["screenshot"] request returns data.screenshot as a base64 PNG data URL.)

Within those boundaries, the pairing is clean: fastCRW gets accurate, well-structured text into your pipeline, and Weaviate turns it into fast hybrid semantic search. For other vector stores, see best vector databases, and for a managed-Pinecone variant of this same pattern, Pinecone + fastCRW.

Sources

Scrape benchmark (truth-recall, p50/p90): bench/server-runs/RESULT_3WAY_1000_FULL.md · diagnose_3way.py, 819 labeled URLs, 2026-05-08 — see /benchmarks
fastCRW repo and pricing: github.com/us/crw · /pricing
Weaviate hybrid search docs: weaviate.io (verify against current Weaviate version)