Use Cases/Use Case / Vector Pipelines

Vector Database Ingestion with fastCRW — Pinecone, Chroma, Weaviate, Qdrant, pgvector, Milvus

Crawl any domain into clean markdown with fastCRW, chunk it, embed it, and bulk-insert into your vector database of choice — Pinecone, Chroma, Weaviate, Qdrant, pgvector/Supabase, or Milvus. One hub, six stores.

Published

May 27, 2026

Updated

June 24, 2026

What this hub covers

Teams that need to turn a documentation site, knowledge base, product catalog, or public web corpus into a vector index — not one URL at a time, but the entire domain in one ingestion run. The bottleneck is rarely the embedding model or the vector store. It is the scrape quality, the queue between crawl and embedder, and the single-row insert pattern that turns a 30-minute job into a 6-hour one.

This page covers the full pipeline from the fastCRW crawl step through to six specific vector stores, with real runnable code for each. If you already have a Firecrawl-based ingestion job, every example below is a base-URL swap — the API shape is identical.

The scrape → chunk → embed → store pipeline

Why clean text is the ceiling

An embedding model can only encode the text it receives. When a page's main article shares the input with a 40-link footer, a cookie banner, three "related posts" widgets, and nav markup, the vector becomes a blend of the content you want and the boilerplate you don't. Two genuinely different articles that share a site template can end up closer in vector space than they should be, because the shared DOM chrome dominates.

fastCRW's role is narrow and deliberate: it turns a live URL into clean, structured markdown with the heading hierarchy intact and boilerplate stripped. It does not store vectors, run similarity search, or replace any of the stores below — it is the extraction layer that feeds them good text.

Accuracy numbers (source: bench/server-runs/RESULT_3WAY_1000_FULL.md, diagnose_3way.py, 2026-05-08):

Tool	Truth-recall (819 labeled URLs)	p50 latency	p90 latency (fast mode)
fastCRW	63.74% (522)	1,914 ms	4,348 ms
Crawl4AI	59.95% (491)	1,916 ms	4,754 ms
Firecrawl	56.04% (459)	2,305 ms	6,937 ms

All three tools threw 0 errors across the 3,000-request run. fastCRW's 91.8% scrape-success (of reachable URLs) pairs with the highest truth-recall and the lowest p90 in fast mode. The chrome-stealth fallback recovers the 34 URLs only fastCRW reaches — 70% more unique recoveries than Crawl4AI and Firecrawl combined — and in fast mode that same mechanism delivers the best p90 of the three. For scheduled batch ingestion, this is a direct quality advantage.

API shape — Firecrawl-compatible REST

POST /v1/map    → discover all URLs on a site (1 credit)
POST /v1/crawl  → start async BFS crawl, returns job_id (1 credit/page, any renderer)
GET  /v1/crawl/:id → status + per-page markdown results
POST /v1/scrape → single-URL fetch (1 credit, any renderer)

Caps: maxDepth up to 10, maxPages up to 1,000 per crawl job. For corpora larger than 1,000 pages, partition by subdomain or sitemap segment and run multiple jobs.

Shared crawl + chunk helper

Every vendor section below assumes you have pages in hand from this shared crawl step. Replace the BASE_URL for managed cloud vs self-hosted:

# crw_helpers.py — shared across all vendor examples
import os
import time
import hashlib
import requests

BASE_URL = os.environ.get("CRW_BASE_URL", "https://api.fastcrw.com/v1")
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}

def map_site(root_url: str) -> list[str]:
    """Discover all URLs on a site (1 credit)."""
    r = requests.post(f"{BASE_URL}/map", json={"url": root_url}, headers=HEADERS, timeout=30)
    r.raise_for_status()
    return r.json().get("urls", [])

def crawl(seed: str, max_depth: int = 3, max_pages: int = 200) -> list[dict]:
    """Async BFS crawl — returns list of {sourceURL, markdown} dicts."""
    r = requests.post(
        f"{BASE_URL}/crawl",
        json={"url": seed, "maxDepth": max_depth, "maxPages": max_pages,
              "scrapeOptions": {"formats": ["markdown"]}},
        headers=HEADERS, timeout=30,
    )
    r.raise_for_status()
    job_id = r.json()["data"]["id"]

    while True:
        r = requests.get(f"{BASE_URL}/crawl/{job_id}", headers=HEADERS, timeout=30)
        r.raise_for_status()
        body = r.json()["data"]
        if body["status"] in ("completed", "failed"):
            return body.get("data", [])
        time.sleep(6)

def chunk_markdown(text: str, max_tokens: int = 800, overlap: int = 80) -> list[str]:
    """
    Split markdown on ## headings first, then enforce a token-window cap.
    Heading-aware splitting keeps sections topically coherent.
    """
    import re
    sections = re.split(r"\n(?=## )", text)
    chunks: list[str] = []
    for section in sections:
        words = section.split()
        if len(words) <= max_tokens:
            if section.strip():
                chunks.append(section.strip())
        else:
            for i in range(0, len(words), max_tokens - overlap):
                part = " ".join(words[i : i + max_tokens])
                if part.strip():
                    chunks.append(part.strip())
    return chunks

def content_hash(text: str) -> str:
    return hashlib.md5(text.encode()).hexdigest()

Pinecone — managed, serverless vector storage

Pinecone is the lowest-ops path: no index to manage, no server to provision, serverless autoscaling. The main discipline is freshness — a one-time bulk import captures a snapshot, and retrieval quality decays at exactly the rate your sources update.

When to use Pinecone

You want the fastest possible time-to-production for a vector search feature
Ops overhead of running a vector DB yourself is not acceptable
You need serverless autoscaling without cluster management

Pinecone ingest + refresh

# pinecone_ingest.py
import os
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from crw_helpers import crawl, chunk_markdown, content_hash

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
oai = OpenAI()

INDEX_NAME = "web-docs"
EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536

def get_or_create_index():
    if INDEX_NAME not in [i.name for i in pc.list_indexes()]:
        pc.create_index(
            name=INDEX_NAME,
            dimension=EMBED_DIM,
            metric="cosine",
            spec=ServerlessSpec(cloud="aws", region="us-east-1"),
        )
    return pc.Index(INDEX_NAME)

def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(model=EMBED_MODEL, input=texts)
    return [d.embedding for d in resp.data]

def upsert_to_pinecone(index, pages: list[dict]) -> int:
    vectors = []
    for page in pages:
        url = page.get("metadata", {}).get("sourceURL", "")
        body = page.get("markdown", "")
        if not body:
            continue
        chunks = chunk_markdown(body)
        embeddings = embed_texts(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            vec_id = f"{content_hash(url)}-{i}"
            vectors.append({
                "id": vec_id,
                "values": emb,
                "metadata": {
                    "source_url": url,
                    "chunk_index": i,
                    "content": chunk[:1000],       # Pinecone metadata has a size limit
                    "content_hash": content_hash(chunk),
                },
            })
    # Batch upsert in groups of 100
    for start in range(0, len(vectors), 100):
        index.upsert(vectors=vectors[start : start + 100])
    return len(vectors)

def main(seed: str) -> None:
    index = get_or_create_index()
    pages = crawl(seed, max_depth=3, max_pages=200)
    count = upsert_to_pinecone(index, pages)
    print(f"Upserted {count} chunks from {len(pages)} pages into Pinecone index '{INDEX_NAME}'")

if __name__ == "__main__":
    main("https://docs.fastcrw.com")

Dedup on scheduled refresh: On each re-crawl, compare the content_hash stored in Pinecone metadata against the fresh page. Skip re-embedding unchanged pages — a nightly refresh of a 200-page docs site typically re-embeds fewer than 10 pages.

Chroma — embedded, in-process, runs on a $5 VPS

Chroma's embedded mode runs as an in-process Python library writing to local files. No Docker Compose, no external service — combine it with a self-hosted fastCRW binary and the entire RAG stack fits on a 1 GB VPS.

When to use Chroma

Local development or hobby projects with a fixed, modest corpus
You want to evaluate RAG quality before committing to a managed service
Budget constraint makes $0-per-page self-host attractive
Chroma's embedded mode writes to disk, requiring no external database service

Chroma ingest

# chroma_ingest.py — runs on the same box as the fastCRW binary
import os
import chromadb
from openai import OpenAI
from crw_helpers import crawl, chunk_markdown, content_hash

# Embedded mode: PersistentClient writes to a local directory
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection(
    name="web_docs",
    metadata={"hnsw:space": "cosine"},
)
oai = OpenAI()

def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest_pages(pages: list[dict]) -> int:
    all_ids, all_docs, all_metas, all_embs = [], [], [], []
    for page in pages:
        url = page.get("metadata", {}).get("sourceURL", "")
        body = page.get("markdown", "")
        if not body:
            continue
        chunks = chunk_markdown(body)
        embeddings = embed_texts(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            chunk_id = f"{content_hash(url)}-{i}"
            all_ids.append(chunk_id)
            all_docs.append(chunk)
            all_metas.append({"url": url, "chunk_index": i,
                               "content_hash": content_hash(chunk)})
            all_embs.append(emb)

    # Chroma upserts: existing IDs are overwritten, new IDs are inserted
    if all_ids:
        collection.upsert(
            ids=all_ids,
            documents=all_docs,
            metadatas=all_metas,
            embeddings=all_embs,
        )
    return len(all_ids)

def query(question: str, n_results: int = 5) -> list[dict]:
    q_emb = embed_texts([question])[0]
    results = collection.query(query_embeddings=[q_emb], n_results=n_results)
    return [
        {"text": doc, "url": meta["url"]}
        for doc, meta in zip(results["documents"][0], results["metadatas"][0])
    ]

def main(seed: str) -> None:
    pages = crawl(seed, max_depth=2, max_pages=100)
    count = ingest_pages(pages)
    print(f"Ingested {count} chunks from {len(pages)} pages into Chroma")

    # Demo query
    hits = query("how do I authenticate with the API?")
    for hit in hits:
        print(f"[{hit['url']}] {hit['text'][:120]}...")

if __name__ == "__main__":
    main("https://docs.fastcrw.com")

VPS sizing note: fastCRW's engine is a single ~8 MB Docker image in one container. Chroma in embedded mode adds only its on-disk index. The real memory ceiling is the embedding step — a hosted API (OpenAI, etc.) keeps the VPS load low; a local transformer model may push you to 2 GB RAM.

Weaviate — hybrid BM25 + vector search from web sources

Weaviate's hybrid search blends keyword (BM25) and vector similarity, controlled by an alpha weight. Clean extraction matters doubly here: the vector half degrades when boilerplate dilutes the chunk, and the BM25 half degrades when exact terms are hidden in DOM noise.

When to use Weaviate

You need hybrid search (keyword + semantic) in a single query path
Your corpus includes heterogeneous sources where keyword recall matters alongside semantic recall
You want to co-locate ingestion and vector storage on one self-hosted box

Weaviate ingest

# weaviate_ingest.py
import os
import weaviate
import weaviate.classes as wvc
from openai import OpenAI
from crw_helpers import crawl, chunk_markdown, content_hash

client = weaviate.connect_to_local()  # or weaviate.connect_to_weaviate_cloud(...)
oai = OpenAI()

COLLECTION_NAME = "WebDoc"

def get_or_create_collection():
    if client.collections.exists(COLLECTION_NAME):
        return client.collections.get(COLLECTION_NAME)
    return client.collections.create(
        name=COLLECTION_NAME,
        vectorizer_config=wvc.config.Configure.Vectorizer.none(),  # BYO embeddings
        properties=[
            wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="source_url", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="chunk_index", data_type=wvc.config.DataType.INT),
            wvc.config.Property(name="content_hash", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="crawled_at", data_type=wvc.config.DataType.DATE),
        ],
    )

def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest_pages(pages: list[dict]) -> int:
    from datetime import datetime, timezone
    collection = get_or_create_collection()
    now = datetime.now(timezone.utc).isoformat()
    total = 0
    with collection.batch.dynamic() as batch:
        for page in pages:
            url = page.get("metadata", {}).get("sourceURL", "")
            body = page.get("markdown", "")
            if not body:
                continue
            chunks = chunk_markdown(body)
            embeddings = embed_texts(chunks)
            for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
                import uuid
                # Deterministic UUID from URL + chunk_index for upsert-safe IDs
                obj_uuid = str(uuid.uuid5(uuid.NAMESPACE_URL, f"{url}#{i}"))
                batch.add_object(
                    properties={
                        "text": chunk,
                        "source_url": url,
                        "chunk_index": i,
                        "content_hash": content_hash(chunk),
                        "crawled_at": now,
                    },
                    vector=emb,
                    uuid=obj_uuid,
                )
                total += 1
    return total

def hybrid_query(question: str, alpha: float = 0.5, limit: int = 5) -> list[dict]:
    """alpha=0 is pure BM25, alpha=1 is pure vector, 0.5 is balanced hybrid."""
    q_emb = embed_texts([question])[0]
    collection = client.collections.get(COLLECTION_NAME)
    response = collection.query.hybrid(
        query=question,
        vector=q_emb,
        alpha=alpha,
        limit=limit,
        return_properties=["text", "source_url"],
    )
    return [{"text": o.properties["text"], "url": o.properties["source_url"]}
            for o in response.objects]

def main(seed: str) -> None:
    pages = crawl(seed, max_depth=3, max_pages=150)
    count = ingest_pages(pages)
    print(f"Imported {count} chunks into Weaviate collection '{COLLECTION_NAME}'")

    hits = hybrid_query("authentication token expiry")
    for hit in hits:
        print(f"[{hit['url']}] {hit['text'][:120]}...")

if __name__ == "__main__":
    main("https://docs.fastcrw.com")

Qdrant — air-gapped, zero data egress, fully self-hosted

When scraped content or even target URLs cannot touch a third-party cloud — defense, health, finance, on-prem enterprise — the usual "crawl with a hosted API" recipe is a non-starter at step one. The blocker is data egress. Qdrant and fastCRW together close that loop: both are self-hostable, and neither requires an outbound vendor call in the critical path.

When to use Qdrant

Data-residency or air-gapped requirements make hosted APIs impossible
You want HNSW indexing with rich payload filtering in a self-hosted service
Internal intranet crawl where target URLs are themselves sensitive

Qdrant ingest

# qdrant_ingest.py — all traffic stays on your infra when self-hosted
import os
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, UpdateStatus
)
from crw_helpers import crawl, chunk_markdown, content_hash

# Self-hosted: point at your local Qdrant instance
qdrant = QdrantClient(url=os.environ.get("QDRANT_URL", "http://localhost:6333"))

COLLECTION_NAME = "web_docs"
EMBED_DIM = 1536

# For local/air-gapped embeddings, swap out for a local model via sentence-transformers
# or Ollama. The example below uses OpenAI for illustration; for true air-gap use a
# locally-served embedding endpoint.
from openai import OpenAI
oai = OpenAI(base_url=os.environ.get("EMBED_BASE_URL"))  # point at local Ollama if needed

def ensure_collection():
    existing = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION_NAME not in existing:
        qdrant.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=EMBED_DIM, distance=Distance.COSINE),
        )

def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest_pages(pages: list[dict]) -> int:
    ensure_collection()
    points: list[PointStruct] = []
    for page in pages:
        url = page.get("metadata", {}).get("sourceURL", "")
        body = page.get("markdown", "")
        if not body:
            continue
        chunks = chunk_markdown(body)
        embeddings = embed_texts(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            # Deterministic int ID derived from hash so re-crawls overwrite in place
            point_id = abs(hash(f"{url}#{i}")) % (2**63)
            points.append(PointStruct(
                id=point_id,
                vector=emb,
                payload={
                    "source_url": url,
                    "chunk_index": i,
                    "content": chunk,
                    "content_hash": content_hash(chunk),
                },
            ))

    # Upsert in batches of 200
    for start in range(0, len(points), 200):
        batch = points[start : start + 200]
        result = qdrant.upsert(collection_name=COLLECTION_NAME, points=batch)
        assert result.status == UpdateStatus.COMPLETED
    return len(points)

def query(question: str, limit: int = 5, days_fresh: int = 30) -> list[dict]:
    from qdrant_client.models import Filter, FieldCondition, Range
    import time
    q_emb = embed_texts([question])[0]
    cutoff = time.time() - days_fresh * 86400
    results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=q_emb,
        limit=limit,
        query_filter=Filter(
            must=[FieldCondition(key="crawled_at", range=Range(gte=cutoff))]
        ) if days_fresh else None,
    )
    return [{"text": r.payload["content"], "url": r.payload["source_url"]}
            for r in results]

def main(seed: str) -> None:
    pages = crawl(seed, max_depth=3, max_pages=200)
    count = ingest_pages(pages)
    print(f"Upserted {count} chunks from {len(pages)} pages into Qdrant")

if __name__ == "__main__":
    main("https://wiki.internal/docs")

Air-gap note: For a fully air-gapped stack, set CRW_BASE_URL to point at your self-hosted fastCRW engine (e.g. http://fastcrw-internal:3002/v1) and set EMBED_BASE_URL to a locally-served Ollama or vLLM embedding endpoint. The crawl target can be an internal hostname. No outbound vendor call is made anywhere in this path.

pgvector / Supabase — vectors in your existing Postgres database

If you already run Postgres or Supabase, you do not need a separate vector service. The pgvector extension adds a vector column type and cosine-distance operators, so retrieval is a single SQL SELECT — with metadata filtering and vector ranking in the same WHERE clause, transactionally, with no second system to sync.

When to use pgvector

You already run Postgres or Supabase
You want to filter vectors by tenant, recency, or access level in the same query
Under a few million vectors — pgvector handles this well without dedicated vector DB overhead

pgvector / Supabase ingest

# pgvector_ingest.py
import os
import psycopg
from openai import OpenAI
from crw_helpers import crawl, chunk_markdown, content_hash

conn_str = os.environ["DATABASE_URL"]  # postgres://user:pass@host:5432/db
oai = OpenAI()

SETUP_SQL = """
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS web_chunks (
    id           BIGSERIAL PRIMARY KEY,
    source_url   TEXT NOT NULL,
    chunk_index  INT NOT NULL,
    content      TEXT NOT NULL,
    content_hash TEXT NOT NULL,
    crawled_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    embedding    vector(1536),
    UNIQUE (content_hash)
);

-- Build HNSW index after first bulk load, not before
-- CREATE INDEX ON web_chunks USING hnsw (embedding vector_cosine_ops);
"""

def setup_db() -> None:
    with psycopg.connect(conn_str) as conn:
        conn.execute(SETUP_SQL)

def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest_pages(pages: list[dict]) -> int:
    setup_db()
    rows: list[tuple] = []
    for page in pages:
        url = page.get("metadata", {}).get("sourceURL", "")
        body = page.get("markdown", "")
        if not body:
            continue
        chunks = chunk_markdown(body)
        embeddings = embed_texts(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            rows.append((url, i, chunk, content_hash(chunk), emb))

    with psycopg.connect(conn_str) as conn:
        conn.executemany(
            """
            INSERT INTO web_chunks (source_url, chunk_index, content, content_hash, embedding)
            VALUES (%s, %s, %s, %s, %s::vector)
            ON CONFLICT (content_hash) DO UPDATE
                SET source_url   = EXCLUDED.source_url,
                    chunk_index  = EXCLUDED.chunk_index,
                    content      = EXCLUDED.content,
                    crawled_at   = NOW(),
                    embedding    = EXCLUDED.embedding
            """,
            rows,
        )
        conn.commit()
    return len(rows)

def query(question: str, limit: int = 8, tenant_id: str | None = None) -> list[dict]:
    q_emb = embed_texts([question])[0]
    # Cosine similarity via <=> operator; add WHERE clauses for metadata filtering
    sql = """
        SELECT content, source_url
        FROM web_chunks
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """
    with psycopg.connect(conn_str) as conn:
        rows = conn.execute(sql, (str(q_emb), limit)).fetchall()
    return [{"text": r[0], "url": r[1]} for r in rows]

def main(seed: str) -> None:
    pages = crawl(seed, max_depth=3, max_pages=300)
    count = ingest_pages(pages)
    print(f"Upserted {count} chunks from {len(pages)} pages into pgvector")

    hits = query("rate limiting and quotas")
    for hit in hits:
        print(f"[{hit['url']}] {hit['text'][:120]}...")

if __name__ == "__main__":
    main("https://docs.fastcrw.com")

Supabase note: Enable pgvector once with CREATE EXTENSION IF NOT EXISTS vector; in the Supabase SQL editor. Build the HNSW index (CREATE INDEX ON web_chunks USING hnsw (embedding vector_cosine_ops)) after the first bulk load, not before — building on an empty table is wasted work.

Milvus — enterprise scale, partitioned collections, billions of vectors

Milvus is built for enterprise-scale RAG: sharding, partitions, HNSW/IVF indexes, and a bulk-insert API optimized for millions-of-rows loads. The ingestion bottleneck at this scale is upstream quality — a perfect HNSW index over garbage chunks still returns garbage. fastCRW's recall lead directly translates to higher signal in the chunks Milvus indexes.

When to use Milvus

Corpus is in the hundreds of millions to billions of vectors
You need partitioned collections for multi-tenant or multi-source isolation
Your team already runs Milvus or Zilliz Cloud

Milvus ingest

# milvus_ingest.py
import os
from pymilvus import (
    connections, Collection, CollectionSchema, FieldSchema,
    DataType, utility
)
from openai import OpenAI
from crw_helpers import crawl, chunk_markdown, content_hash

connections.connect(
    alias="default",
    uri=os.environ.get("MILVUS_URI", "http://localhost:19530"),
    token=os.environ.get("MILVUS_TOKEN", ""),
)
oai = OpenAI()

COLLECTION_NAME = "web_docs"
EMBED_DIM = 1536

def get_or_create_collection() -> Collection:
    if utility.has_collection(COLLECTION_NAME):
        return Collection(COLLECTION_NAME)

    fields = [
        FieldSchema(name="id",           dtype=DataType.INT64,   is_primary=True, auto_id=True),
        FieldSchema(name="source_url",   dtype=DataType.VARCHAR, max_length=2048),
        FieldSchema(name="chunk_index",  dtype=DataType.INT64),
        FieldSchema(name="content",      dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="content_hash", dtype=DataType.VARCHAR, max_length=64),
        FieldSchema(name="embedding",    dtype=DataType.FLOAT_VECTOR, dim=EMBED_DIM),
    ]
    schema = CollectionSchema(fields, description="Web corpus for RAG")
    coll = Collection(name=COLLECTION_NAME, schema=schema)
    # Build HNSW index on the vector field
    coll.create_index(
        field_name="embedding",
        index_params={"metric_type": "COSINE", "index_type": "HNSW",
                      "params": {"M": 16, "efConstruction": 200}},
    )
    coll.load()
    return coll

def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest_pages(pages: list[dict]) -> int:
    coll = get_or_create_collection()
    source_urls, chunk_indices, contents, hashes, embeddings = [], [], [], [], []

    for page in pages:
        url = page.get("metadata", {}).get("sourceURL", "")
        body = page.get("markdown", "")
        if not body:
            continue
        chunks = chunk_markdown(body)
        embs = embed_texts(chunks)
        for i, (chunk, emb) in enumerate(zip(chunks, embs)):
            source_urls.append(url)
            chunk_indices.append(i)
            contents.append(chunk[:65000])  # Milvus VARCHAR limit
            hashes.append(content_hash(chunk))
            embeddings.append(emb)

    # Milvus bulk insert — far faster than row-by-row at enterprise volume
    if source_urls:
        data = [source_urls, chunk_indices, contents, hashes, embeddings]
        coll.insert(data)
        coll.flush()
    return len(source_urls)

def query(question: str, limit: int = 10, partition: str | None = None) -> list[dict]:
    coll = get_or_create_collection()
    q_emb = embed_texts([question])[0]
    search_params = {"metric_type": "COSINE", "params": {"ef": 100}}
    results = coll.search(
        data=[q_emb],
        anns_field="embedding",
        param=search_params,
        limit=limit,
        output_fields=["content", "source_url"],
        partition_names=[partition] if partition else None,
    )
    return [{"text": hit.entity.get("content"), "url": hit.entity.get("source_url")}
            for hit in results[0]]

def main(seed: str) -> None:
    pages = crawl(seed, max_depth=4, max_pages=500)
    count = ingest_pages(pages)
    print(f"Bulk-inserted {count} chunks from {len(pages)} pages into Milvus")

if __name__ == "__main__":
    main("https://docs.fastcrw.com")

Partitioning tip: For multi-tenant or multi-source enterprise RAG, create Milvus partitions keyed by source_domain or tenant_id. Drop and re-ingest a single partition without touching the rest of the index — essential when one source has a full re-crawl and others do not.

Vector DB comparison

	Pinecone	Chroma	Weaviate	Qdrant	pgvector	Milvus
Deployment	Managed-only	Embedded / self-host	Managed + self-host	Self-host / managed	Self-host (Postgres extension)	Self-host / Zilliz Cloud
Best for	Fastest ops path	Local / hobby RAG	Hybrid BM25+vector	Air-gapped / regulated	SQL-native, existing Postgres	Billions-of-vectors enterprise
Scale ceiling	Serverless autoscale	Millions of vecs	Tens of millions	Hundreds of millions	Tens of millions	Billions+
Metadata filtering	Yes (metadata map)	Yes (where clauses)	Yes (GraphQL / gRPC)	Yes (payload filters)	Yes (SQL WHERE)	Yes (bool expr)
Hybrid search	No (vector-only)	No	Yes (BM25 + vector)	No (vector + filter)	Partial (full-text + vector)	No (vector-only by default)
Self-host license	N/A	Apache-2.0	BSD-3	Apache-2.0	PostgreSQL	Apache-2.0
Ops complexity	Very low	Very low	Medium	Medium	Low (if Postgres already runs)	High

All six accept the same clean markdown from fastCRW's /v1/crawl — only the storage call changes between examples.

Freshness and incremental re-crawl

A vector index built once and never refreshed drifts away from the live web at exactly the rate the source pages update. For anything customer-facing or agent-facing this is a slow-motion accuracy bug.

The dedup pattern is the same across all six stores:

Store a content_hash (MD5 or SHA-256 of the chunk text) alongside every vector.
On each scheduled re-crawl, fetch the current page and compute fresh hashes.
Skip re-embedding unchanged chunks (hashes match) — these tend to be the majority.
Re-embed and upsert only changed or new chunks.
Delete vectors whose source URL disappeared from the latest /v1/map output.

A 1,000-page corpus typically re-embeds fewer than 5% of pages per nightly refresh. The crawl still costs 1 credit per page on the managed cloud — but you can reduce even that by comparing /v1/map output against a stored URL-set and skipping known-unchanged sections before crawling.

Self-host vs managed cloud

	Self-hosted fastCRW	Managed fastcrw.com
Per-page cost	$0 (AGPL-3.0, pay only server)	1 credit/page (any renderer)
Ops burden	You manage binary updates and renderer	Zero — Anthropic-hosted
Data egress	None — stays on your infra	Pages processed on fastCRW cloud
Anti-bot	lightpanda (default) + chrome opt-in	Same + managed IP rotation
Best for	Air-gap / budget / high-volume batch	Fastest start, no-ops preference

For comparison, Firecrawl's hosted scraping runs $0.83–5.33 per 1,000 scrapes across its tiers. fastCRW's managed cloud uses the same Firecrawl-compatible API shape, so switching from one to the other is a base-URL swap.

RAG pipelines — broader retrieval-augmented generation patterns
Web scraping for RAG training data — turning crawls into fine-tuning datasets
Firecrawl alternatives — comparing managed scraping APIs
Pricing — per-plan credit allocations for managed crawls (verify current tiers before budgeting)

fastCRWlive

Scrape any URL, live

Get 500 free credits →

Sources

fastCRW /v1/crawl reference

https://docs.fastcrw.com/api-reference/crawl/

pgvector — open-source vector extension for Postgres

https://github.com/pgvector/pgvector

Pinecone batch upsert reference

https://docs.pinecone.io/guides/data/upsert-data

Chroma documentation

https://docs.trychroma.com

Weaviate hybrid search docs

https://weaviate.io/developers/weaviate/search/hybrid

Qdrant documentation

https://qdrant.tech/documentation/

Milvus documentation — bulk insert, partitions, index types

https://milvus.io/docs

FAQ

How big a crawl can fastCRW handle in one job?

Per the canonical fact sheet, /v1/crawl accepts maxDepth up to 10 and maxPages up to 1,000 per job. For larger corpora, partition by subdomain or sitemap segment and run multiple bounded jobs concurrently — the binary is light enough (single ~8 MB image, one container) to run several workers on a single VM.

Why crawl async instead of synchronous scrape loops?

A single HTTP request that takes 30+ minutes to return is fragile — proxies time out, load balancers reset, your worker process dies. The async crawl pattern returns a job id immediately, runs the BFS server-side, and lets your worker poll. fastCRW's Firecrawl-compatible API shape is designed for exactly this pattern.

How accurate is the markdown that lands in the vector store?

fastCRW achieved 63.74% truth-recall on Firecrawl's public 819-labeled-URL scrape dataset (`diagnose_3way.py`, 2026-05-08) — the highest of the three tools tested (Crawl4AI 59.95%, Firecrawl 56.04%) — with 91.8% scrape-success of reachable URLs and 0 thrown errors across the 3,000-request run. Higher truth-recall at the scrape stage means fewer empty chunks and fewer junk vectors clogging your retriever.

Which vector database should I choose for my RAG pipeline?

It depends on your existing stack and scale. Pinecone is the managed path with the least ops overhead. pgvector/Supabase is best if you already run Postgres and want vectors in the same database as your relational data. Chroma is ideal for local or hobby stacks on tight budgets — it runs embedded in-process. Qdrant excels at air-gapped or data-residency-constrained pipelines. Weaviate adds hybrid (BM25 + vector) search out of the box. Milvus is the choice for enterprise scale (billions of vectors, partitioned collections). The ingestion code with fastCRW is almost identical across all six; only the storage call changes.

Can I self-host the whole stack for free?

Yes for AGPL-compatible projects. fastCRW self-hosts as a single ~8 MB Rust binary under AGPL-3.0 — no license fee, $0 per 1,000 scrapes, you pay only for the server. Chroma, Qdrant, Weaviate, Milvus, and pgvector are all open-source and self-hostable. The entire scrape → chunk → embed → store loop can run on your own infra.

What is the latency story for fastCRW at pipeline scale?

fastCRW's median scrape latency is competitive: p50 of 1,914 ms, beating Firecrawl's 2,305 ms in the 2026-05-08 run. In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three (Crawl4AI 4,754 ms, Firecrawl 6,937 ms). The chrome-stealth fallback that recovers hard URLs the other tools miss is what produces both the recall lead and the p90 win. For scheduled batch ingestion into a vector store, this combination of higher recall and low p90 is a clear advantage.

Recommended next step

Claim an API key and start shipping.

Move from evaluation to implementation with credits, docs, and a compatibility-first API.

Create Account

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping API for AI Agents

Next in Use Cases

Web Scraping for News Aggregation

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Content Aggregation

Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.

web scraping for content aggregationDiscover all content URLs on any domain with a single `/v1/map` call

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Vector Database Ingestion with fastCRW — Pinecone, Chroma, Weaviate, Qdrant, pgvector, Milvus

What this hub covers

The scrape → chunk → embed → store pipeline

Why clean text is the ceiling

API shape — Firecrawl-compatible REST

Shared crawl + chunk helper

Pinecone — managed, serverless vector storage

When to use Pinecone

Pinecone ingest + refresh

Chroma — embedded, in-process, runs on a $5 VPS

When to use Chroma

Chroma ingest

Weaviate — hybrid BM25 + vector search from web sources

When to use Weaviate

Weaviate ingest

Qdrant — air-gapped, zero data egress, fully self-hosted

When to use Qdrant

Qdrant ingest

pgvector / Supabase — vectors in your existing Postgres database

When to use pgvector

pgvector / Supabase ingest

Milvus — enterprise scale, partitioned collections, billions of vectors

When to use Milvus

Milvus ingest

Vector DB comparison

Freshness and incremental re-crawl

Self-host vs managed cloud

Related pages

More from Use Cases

Web Scraping for Real Estate Data

Web Scraping for Content Aggregation

Web Scraping for RAG and AI Agent Training Data

Keep the crawl path moving

Alternatives

Benchmarks

Docs