Skip to main content
Use Cases/Use Case / RAG

Web Scraping for RAG Pipelines

Turn any website into chunked, embedded, retrieval-ready vectors with fastCRW — clean markdown, predictable JSON, and a single binary you can self-host.

Published
March 11, 2026
Updated
May 27, 2026
Category
use cases
Clean markdown output that chunks predictably for retrievalOne binary, ~50 MB RAM idle (`crw-opencore/README.md`, structural footprint)Drop-in path from URL to Pinecone / pgvector / Qdrant

Who this is for

Teams shipping a RAG application — a docs chatbot, an internal-knowledge assistant, a research copilot — that need to turn live websites into vectors the retriever can actually trust. The hard part is rarely the embedding model; it is the front of the pipeline, where messy HTML produces messy chunks that produce messy retrieval.

fastCRW exists for exactly that front half. You hand it a URL, it hands you back clean markdown plus the structured fields you asked for, and the rest of your pipeline gets to stay simple.

Why fastCRW for RAG

RAG ingestion has three failure modes: too much navigation chrome ends up in chunks, JavaScript-rendered pages return empty, and re-crawling at scale becomes its own infra project. fastCRW addresses all three at the source.

The POST /v1/scrape endpoint (docs.fastcrw.com/api-reference/scrape/) returns a single, clean markdown body per URL — the same shape Firecrawl returns, so any LangChain or LlamaIndex loader that targets Firecrawl works against fastCRW after a base-URL swap. For sites that need JS, the renderer field picks between http, lightpanda, and chrome automatically. And because the engine is one static Rust binary (~50 MB RAM idle, per crw-opencore/README.md structural footprint), running it next to your worker pool is cheap enough to crawl continuously rather than nightly.

For bulk corpus work, pair POST /v1/map (docs.fastcrw.com/api-reference/map/) to enumerate URLs with /v1/scrape to fetch them. That keeps discovery and extraction observable as separate stages, which is what you want when a chunk goes wrong and you need to trace it back to a URL.

The 5-step recipe

  1. Scrape the source page into clean markdown. Call POST /v1/scrape with formats ["markdown"]. fastCRW strips navigation, ads, and boilerplate so the body of the page is what reaches your chunker.
  2. Split the markdown into retrieval-sized chunks. Run a recursive markdown splitter (LangChain RecursiveCharacterTextSplitter, LlamaIndex MarkdownNodeParser, or your own) with ~800-1,200 token chunks and 10-15% overlap. Markdown headings give the splitter natural seams.
  3. Embed each chunk with your model of choice. Pass chunks through an embedding model (OpenAI text-embedding-3-small, Voyage voyage-3, or a local bge-small) and keep the source URL plus heading path as metadata.
  4. Upsert vectors into your store. Write the embeddings to Pinecone, Postgres pgvector, Qdrant, or Weaviate. Include the canonical URL, content hash, and updatedAt so re-crawls cleanly replace stale rows.
  5. Query at retrieval time and cite the source. At inference, embed the user question, top-k against the same index, stitch the chunks into the prompt, and surface the source URL in your answer so the LLM can be audited.
# scrape_and_index.py — run with: python3 scrape_and_index.py
import os
import hashlib
import requests
import psycopg
from openai import OpenAI

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
client = OpenAI()

def scrape(url: str) -> str:
    r = requests.post(
        f"{CRW}/scrape",
        json={"url": url, "formats": ["markdown"]},
        headers=HEADERS,
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["data"]["markdown"]

def chunk(md: str, size: int = 1000, overlap: int = 150) -> list[str]:
    # Replace with LangChain RecursiveCharacterTextSplitter in production.
    out, i = [], 0
    while i < len(md):
        out.append(md[i : i + size])
        i += size - overlap
    return out

def upsert(url: str, chunks: list[str]) -> None:
    embeddings = client.embeddings.create(
        model="text-embedding-3-small", input=chunks
    ).data
    with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
        for idx, (text, emb) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                "INSERT INTO rag_chunks (url, ord, body, content_hash, embedding) "
                "VALUES (%s, %s, %s, %s, %s) "
                "ON CONFLICT (url, ord) DO UPDATE SET body = EXCLUDED.body, "
                "content_hash = EXCLUDED.content_hash, embedding = EXCLUDED.embedding",
                (url, idx, text, hashlib.md5(text.encode()).hexdigest(), emb.embedding),
            )
        conn.commit()

if __name__ == "__main__":
    target = "https://docs.fastcrw.com/api-reference/scrape/"
    upsert(target, chunk(scrape(target)))

The example uses Postgres + pgvector because it is the path of least resistance for a self-hosted RAG stack, but the same five steps map directly onto Pinecone, Qdrant, or Weaviate — only the upsert body changes.

Next steps

The full /v1/scrape and /v1/map reference lives at docs.fastcrw.com, and managed-cloud pricing for teams that would rather not run the binary themselves is on fastcrw.com/pricing. Self-hosters get the same endpoints for $0 per 1,000 scrapes under AGPL-3.0; managed-cloud users add hosted convenience (BYOK extraction, search) on top.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving