Skip to main content
Use Cases/Use Case / Vector Pipelines

Bulk Vector Database Ingestion with fastCRW

Crawl a whole domain into clean markdown, embed in batches, and bulk-insert into Pinecone, pgvector, or Qdrant — fastCRW's /v1/crawl makes the front of the vector pipeline a single async job.

Published
May 27, 2026
Updated
May 27, 2026
Category
use cases
Async /v1/crawl returns a job id you can poll — no long-lived HTTPSingle static Rust binary, ~50 MB RAM idle (`crw-opencore/README.md`, structural footprint)63.74% truth-recall on Firecrawl's 1,000-URL dataset (`diagnose_3way.py`, 2026-05-08)

Who this is for

Teams that need to turn a whole documentation site, a knowledge base, or a product catalog into a vector index — not one URL at a time, but the entire corpus in one ingestion run. The bottleneck is rarely the embedding model; it is the crawl, the queue between crawl and embedder, and the per-row insert pattern that turns a 30-minute job into a 6-hour one.

fastCRW's /v1/crawl is built for exactly this shape: hand it a seed URL, poll a job id, and your worker pool drains the results into the vector store in batches.

Why fastCRW for bulk pipelines

Three properties matter for bulk ingestion: the crawler returns clean text, the API is async so a flaky pipe does not lose the whole job, and the runtime is light enough to scale horizontally without infrastructure gymnastics.

POST /v1/crawl (docs.fastcrw.com/api-reference/crawl/) starts an async BFS that returns a job id immediately. GET /v1/crawl/{id} returns status and accumulated results; DELETE /v1/crawl/{id} cancels. This matches the Firecrawl shape exactly, so any Firecrawl-targeting ingestion job works after a base-URL swap. maxDepth caps at 10 and maxPages at 1,000 per job (per marketing/CANONICAL-FACTS.md §4) — for larger corpora, partition by subdomain and run multiple jobs.

The accuracy story matters at bulk scale: fastCRW achieved 63.74% truth-recall on Firecrawl's public 1,000-URL labeled dataset (diagnose_3way.py, 2026-05-08), the highest of the three tools tested. Higher recall at the scrape stage means fewer empty chunks and fewer useless vectors clogging the retriever.

The 5-step recipe

  1. Start the crawl with /v1/crawl. POST /v1/crawl with the seed URL, maxDepth, and maxPages (capped at 10 and 1000 respectively). The endpoint returns a job id immediately — the crawl runs server-side as an async BFS.
  2. Poll the job until it completes. GET /v1/crawl/ returns status (scraping, completed, failed) and the accumulated results. Poll every 5-10 seconds; cancel with DELETE /v1/crawl/ if you change your mind.
  3. Stream results into a batch embedder. For each completed page, push the markdown into an embedding queue. Batch 96-256 chunks per OpenAI embeddings call to keep token throughput high and per-request overhead low.
  4. Bulk insert into your vector store. Use the store's native bulk path — Pinecone's batched upsert, pgvector's COPY or executemany, or Qdrant's points/batch endpoint. Single-row inserts will bottleneck the whole pipeline.
  5. Record a content hash for incremental re-crawls. Store an MD5 of each page beside its vectors. Next run, compare hashes and only re-embed pages that actually changed. A 1,000-page corpus typically re-embeds <5% of pages per refresh.
# bulk_ingest.py — run with: python3 bulk_ingest.py
import os
import time
import hashlib
import requests
import psycopg
from openai import OpenAI

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
oai = OpenAI()

def start_crawl(seed: str) -> str:
    r = requests.post(
        f"{CRW}/crawl",
        json={"url": seed, "maxDepth": 3, "maxPages": 500,
              "scrapeOptions": {"formats": ["markdown"]}},
        headers=HEADERS, timeout=30,
    )
    r.raise_for_status()
    return r.json()["data"]["id"]

def wait(job_id: str) -> list[dict]:
    while True:
        r = requests.get(f"{CRW}/crawl/{job_id}", headers=HEADERS, timeout=30)
        r.raise_for_status()
        body = r.json()["data"]
        if body["status"] in ("completed", "failed"):
            return body.get("data", [])
        time.sleep(5)

def embed_batch(texts: list[str]) -> list[list[float]]:
    out = oai.embeddings.create(model="text-embedding-3-small", input=texts).data
    return [d.embedding for d in out]

def bulk_upsert(rows: list[tuple]) -> None:
    with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
        cur.executemany(
            "INSERT INTO corpus_chunks (url, body, content_hash, embedding) "
            "VALUES (%s, %s, %s, %s) "
            "ON CONFLICT (url) DO UPDATE SET body = EXCLUDED.body, "
            "content_hash = EXCLUDED.content_hash, embedding = EXCLUDED.embedding",
            rows,
        )
        conn.commit()

def main(seed: str) -> None:
    pages = wait(start_crawl(seed))
    bodies = [p["markdown"] for p in pages if p.get("markdown")]
    urls = [p["metadata"]["sourceURL"] for p in pages if p.get("markdown")]
    embeddings = embed_batch(bodies)
    rows = [
        (u, b, hashlib.md5(b.encode()).hexdigest(), e)
        for u, b, e in zip(urls, bodies, embeddings)
    ]
    bulk_upsert(rows)
    print(f"Ingested {len(rows)} chunks from {seed}")

if __name__ == "__main__":
    main("https://docs.fastcrw.com")

Next steps

The async crawl reference and scrapeOptions matrix live at docs.fastcrw.com/api-reference/crawl/; managed-cloud per-page pricing is on fastcrw.com/pricing. Self-host the binary to run bulk jobs at $0 per 1,000 scrapes under AGPL-3.0 — the most common pattern is one fastCRW container per worker VM, scaled horizontally.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving