Use Cases/Use Case / RAG

Web Scraping for RAG Pipelines

Turn any website into chunked, embedded, retrieval-ready vectors with fastCRW — clean markdown, predictable JSON, and a single binary you can self-host.

Published

March 11, 2026

Updated

May 27, 2026

Who this is for

Teams shipping a RAG application — a docs chatbot, an internal-knowledge assistant, a research copilot — that need to turn live websites into vectors the retriever can actually trust. The hard part is rarely the embedding model; it is the front of the pipeline, where messy HTML produces messy chunks that produce messy retrieval.

fastCRW exists for exactly that front half. You hand it a URL, it hands you back clean markdown plus the structured fields you asked for, and the rest of your pipeline gets to stay simple.

Why fastCRW for RAG

RAG ingestion has three failure modes: too much navigation chrome ends up in chunks, JavaScript-rendered pages return empty, and re-crawling at scale becomes its own infra project. fastCRW addresses all three at the source.

The POST /v1/scrape endpoint (docs.fastcrw.com/api-reference/scrape/) returns a single, clean markdown body per URL — the same shape Firecrawl returns, so any LangChain or LlamaIndex loader that targets Firecrawl works against fastCRW after a base-URL swap. For sites that need JS, the renderer field picks between http, lightpanda, and chrome automatically. And because the engine is one static Rust binary (~50 MB RAM idle, per crw-opencore/README.md structural footprint), running it next to your worker pool is cheap enough to crawl continuously rather than nightly.

For bulk corpus work, pair POST /v1/map (docs.fastcrw.com/api-reference/map/) to enumerate URLs with /v1/scrape to fetch them. That keeps discovery and extraction observable as separate stages, which is what you want when a chunk goes wrong and you need to trace it back to a URL.

The 5-step recipe

Scrape the source page into clean markdown. Call POST /v1/scrape with formats ["markdown"]. fastCRW strips navigation, ads, and boilerplate so the body of the page is what reaches your chunker.
Split the markdown into retrieval-sized chunks. Run a recursive markdown splitter (LangChain RecursiveCharacterTextSplitter, LlamaIndex MarkdownNodeParser, or your own) with ~800-1,200 token chunks and 10-15% overlap. Markdown headings give the splitter natural seams.
Embed each chunk with your model of choice. Pass chunks through an embedding model (OpenAI text-embedding-3-small, Voyage voyage-3, or a local bge-small) and keep the source URL plus heading path as metadata.
Upsert vectors into your store. Write the embeddings to Pinecone, Postgres pgvector, Qdrant, or Weaviate. Include the canonical URL, content hash, and updatedAt so re-crawls cleanly replace stale rows.
Query at retrieval time and cite the source. At inference, embed the user question, top-k against the same index, stitch the chunks into the prompt, and surface the source URL in your answer so the LLM can be audited.

# scrape_and_index.py — run with: python3 scrape_and_index.py
import os
import hashlib
import requests
import psycopg
from openai import OpenAI

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
client = OpenAI()

def scrape(url: str) -> str:
    r = requests.post(
        f"{CRW}/scrape",
        json={"url": url, "formats": ["markdown"]},
        headers=HEADERS,
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["data"]["markdown"]

def chunk(md: str, size: int = 1000, overlap: int = 150) -> list[str]:
    # Replace with LangChain RecursiveCharacterTextSplitter in production.
    out, i = [], 0
    while i < len(md):
        out.append(md[i : i + size])
        i += size - overlap
    return out

def upsert(url: str, chunks: list[str]) -> None:
    embeddings = client.embeddings.create(
        model="text-embedding-3-small", input=chunks
    ).data
    with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
        for idx, (text, emb) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                "INSERT INTO rag_chunks (url, ord, body, content_hash, embedding) "
                "VALUES (%s, %s, %s, %s, %s) "
                "ON CONFLICT (url, ord) DO UPDATE SET body = EXCLUDED.body, "
                "content_hash = EXCLUDED.content_hash, embedding = EXCLUDED.embedding",
                (url, idx, text, hashlib.md5(text.encode()).hexdigest(), emb.embedding),
            )
        conn.commit()

if __name__ == "__main__":
    target = "https://docs.fastcrw.com/api-reference/scrape/"
    upsert(target, chunk(scrape(target)))

The example uses Postgres + pgvector because it is the path of least resistance for a self-hosted RAG stack, but the same five steps map directly onto Pinecone, Qdrant, or Weaviate — only the upsert body changes.

Next steps

The full /v1/scrape and /v1/map reference lives at docs.fastcrw.com, and managed-cloud pricing for teams that would rather not run the binary themselves is on fastcrw.com/pricing. Self-hosters get the same endpoints for $0 per 1,000 scrapes under AGPL-3.0; managed-cloud users add hosted convenience (BYOK extraction, search) on top.

Sources

fastCRW /v1/scrape reference

https://docs.fastcrw.com/api-reference/scrape/

fastCRW canonical fact sheet (marketing/CANONICAL-FACTS.md)

https://github.com/us/crw-saas/blob/main/marketing/CANONICAL-FACTS.md

LangChain RecursiveCharacterTextSplitter

https://python.langchain.com/docs/concepts/text_splitters/

pgvector — open-source vector extension for Postgres

https://github.com/pgvector/pgvector

FAQ

Why scrape into markdown instead of raw HTML?

Markdown is denser per token, retains heading structure that splitters key off, and removes the navigation/ad noise that pollutes retrieval. fastCRW's markdown output is produced server-side so your worker does not pay the HTML-parsing tax on every page.

Can I run the whole pipeline self-hosted?

Yes. fastCRW ships as a single static Rust binary under AGPL-3.0 (`github.com/us/crw`). Pair it with a self-hosted Postgres pgvector or Qdrant and the entire ingestion path stays inside your network.

How do I keep the index fresh as pages change?

Store the page's content hash alongside each vector. Re-scrape on a schedule, compare hashes, and only re-embed pages that actually changed. fastCRW's /v1/map endpoint is the cheap discovery step before the targeted /v1/scrape pass.

Recommended next step

Claim an API key and start shipping.

Move from evaluation to implementation with credits, docs, and a compatibility-first API.

Create Account

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for Lead Enrichment

Next in Use Cases

AI-Powered Structured Extraction from the Web

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Use Cases

Web Scraping for Market Research

Monitor competitors, track pricing changes, harvest customer sentiment, and map market landscapes at scale with fastCRW — structured, timestamped data for repeatable quantitative analysis without manual analyst work.

web scraping for market researchMonitor competitor websites for pricing and feature changes on a schedule

Use Cases

Self-Hosted Web Scraping API

Run fastCRW on your own infrastructure — a single ~8 MB Docker image, no Redis or Node.js required, full Firecrawl-compatible API. Deploy on a $5 VPS or inside your own VPC for complete data control, privacy, and zero per-scrape fees.

self hosted web scraping apiSingle ~8 MB Docker image — no Redis, no Node.js, no Kafka

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.