Bulk Vector Database Ingestion with fastCRW
Crawl a whole domain into clean markdown, embed in batches, and bulk-insert into Pinecone, pgvector, or Qdrant — fastCRW's /v1/crawl makes the front of the vector pipeline a single async job.
Who this is for
Teams that need to turn a whole documentation site, a knowledge base, or a product catalog into a vector index — not one URL at a time, but the entire corpus in one ingestion run. The bottleneck is rarely the embedding model; it is the crawl, the queue between crawl and embedder, and the per-row insert pattern that turns a 30-minute job into a 6-hour one.
fastCRW's /v1/crawl is built for exactly this shape: hand it a seed URL,
poll a job id, and your worker pool drains the results into the vector store
in batches.
Why fastCRW for bulk pipelines
Three properties matter for bulk ingestion: the crawler returns clean text, the API is async so a flaky pipe does not lose the whole job, and the runtime is light enough to scale horizontally without infrastructure gymnastics.
POST /v1/crawl
(docs.fastcrw.com/api-reference/crawl/)
starts an async BFS that returns a job id immediately. GET /v1/crawl/{id}
returns status and accumulated results; DELETE /v1/crawl/{id} cancels.
This matches the Firecrawl shape exactly, so any Firecrawl-targeting
ingestion job works after a base-URL swap. maxDepth caps at 10 and
maxPages at 1,000 per job (per marketing/CANONICAL-FACTS.md §4) — for
larger corpora, partition by subdomain and run multiple jobs.
The accuracy story matters at bulk scale: fastCRW achieved 63.74%
truth-recall on Firecrawl's public 1,000-URL labeled dataset
(diagnose_3way.py, 2026-05-08), the highest of the three tools tested.
Higher recall at the scrape stage means fewer empty chunks and fewer
useless vectors clogging the retriever.
The 5-step recipe
- Start the crawl with /v1/crawl. POST /v1/crawl with the seed URL, maxDepth, and maxPages (capped at 10 and 1000 respectively). The endpoint returns a job id immediately — the crawl runs server-side as an async BFS.
- Poll the job until it completes. GET /v1/crawl/ returns status (scraping, completed, failed) and the accumulated results. Poll every 5-10 seconds; cancel with DELETE /v1/crawl/ if you change your mind.
- Stream results into a batch embedder. For each completed page, push the markdown into an embedding queue. Batch 96-256 chunks per OpenAI embeddings call to keep token throughput high and per-request overhead low.
- Bulk insert into your vector store. Use the store's native bulk path — Pinecone's batched upsert, pgvector's COPY or executemany, or Qdrant's points/batch endpoint. Single-row inserts will bottleneck the whole pipeline.
- Record a content hash for incremental re-crawls. Store an MD5 of each page beside its vectors. Next run, compare hashes and only re-embed pages that actually changed. A 1,000-page corpus typically re-embeds <5% of pages per refresh.
# bulk_ingest.py — run with: python3 bulk_ingest.py
import os
import time
import hashlib
import requests
import psycopg
from openai import OpenAI
CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
oai = OpenAI()
def start_crawl(seed: str) -> str:
r = requests.post(
f"{CRW}/crawl",
json={"url": seed, "maxDepth": 3, "maxPages": 500,
"scrapeOptions": {"formats": ["markdown"]}},
headers=HEADERS, timeout=30,
)
r.raise_for_status()
return r.json()["data"]["id"]
def wait(job_id: str) -> list[dict]:
while True:
r = requests.get(f"{CRW}/crawl/{job_id}", headers=HEADERS, timeout=30)
r.raise_for_status()
body = r.json()["data"]
if body["status"] in ("completed", "failed"):
return body.get("data", [])
time.sleep(5)
def embed_batch(texts: list[str]) -> list[list[float]]:
out = oai.embeddings.create(model="text-embedding-3-small", input=texts).data
return [d.embedding for d in out]
def bulk_upsert(rows: list[tuple]) -> None:
with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
cur.executemany(
"INSERT INTO corpus_chunks (url, body, content_hash, embedding) "
"VALUES (%s, %s, %s, %s) "
"ON CONFLICT (url) DO UPDATE SET body = EXCLUDED.body, "
"content_hash = EXCLUDED.content_hash, embedding = EXCLUDED.embedding",
rows,
)
conn.commit()
def main(seed: str) -> None:
pages = wait(start_crawl(seed))
bodies = [p["markdown"] for p in pages if p.get("markdown")]
urls = [p["metadata"]["sourceURL"] for p in pages if p.get("markdown")]
embeddings = embed_batch(bodies)
rows = [
(u, b, hashlib.md5(b.encode()).hexdigest(), e)
for u, b, e in zip(urls, bodies, embeddings)
]
bulk_upsert(rows)
print(f"Ingested {len(rows)} chunks from {seed}")
if __name__ == "__main__":
main("https://docs.fastcrw.com")
Next steps
The async crawl reference and scrapeOptions matrix live at
docs.fastcrw.com/api-reference/crawl/;
managed-cloud per-page pricing is on
fastcrw.com/pricing. Self-host the binary
to run bulk jobs at $0 per 1,000 scrapes under AGPL-3.0 — the most common
pattern is one fastCRW container per worker VM, scaled horizontally.
Continue exploring
More from Use Cases
AI-Powered Structured Extraction from the Web
Pull typed JSON out of any web page with fastCRW — define a JSON Schema, call /v1/extract on managed cloud (or /v1/scrape + jsonSchema self-hosted), and skip the brittle selector layer entirely.
Web Scraping for LLM Agents
Give your LLM agent a reliable browse-and-extract tool — fastCRW's /v1/search and /v1/scrape over REST or MCP, with the same shape ChatGPT, Claude, and OpenAI agents already understand.
Web Dataset Curation for ML Training
Assemble training-ready JSONL datasets from the open web with fastCRW — /v1/map to enumerate URLs, /v1/scrape to fetch them as clean markdown, then deduplicate and serialise for HuggingFace, OpenAI fine-tuning, or your own loader.
Related hubs