Web Scraping for RAG Pipelines
Turn any website into chunked, embedded, retrieval-ready vectors with fastCRW — clean markdown, predictable JSON, and a single binary you can self-host.
Who this is for
Teams shipping a RAG application — a docs chatbot, an internal-knowledge assistant, a research copilot — that need to turn live websites into vectors the retriever can actually trust. The hard part is rarely the embedding model; it is the front of the pipeline, where messy HTML produces messy chunks that produce messy retrieval.
fastCRW exists for exactly that front half. You hand it a URL, it hands you back clean markdown plus the structured fields you asked for, and the rest of your pipeline gets to stay simple.
Why fastCRW for RAG
RAG ingestion has three failure modes: too much navigation chrome ends up in chunks, JavaScript-rendered pages return empty, and re-crawling at scale becomes its own infra project. fastCRW addresses all three at the source.
The POST /v1/scrape endpoint
(docs.fastcrw.com/api-reference/scrape/)
returns a single, clean markdown body per URL — the same shape Firecrawl
returns, so any LangChain or LlamaIndex loader that targets Firecrawl works
against fastCRW after a base-URL swap. For sites that need JS, the renderer
field picks between http, lightpanda, and chrome automatically. And
because the engine is one static Rust binary (~50 MB RAM idle, per
crw-opencore/README.md structural footprint), running it next to your
worker pool is cheap enough to crawl continuously rather than nightly.
For bulk corpus work, pair POST /v1/map
(docs.fastcrw.com/api-reference/map/)
to enumerate URLs with /v1/scrape to fetch them. That keeps discovery and
extraction observable as separate stages, which is what you want when a chunk
goes wrong and you need to trace it back to a URL.
The 5-step recipe
- Scrape the source page into clean markdown. Call POST /v1/scrape with formats ["markdown"]. fastCRW strips navigation, ads, and boilerplate so the body of the page is what reaches your chunker.
- Split the markdown into retrieval-sized chunks. Run a recursive markdown splitter (LangChain RecursiveCharacterTextSplitter, LlamaIndex MarkdownNodeParser, or your own) with ~800-1,200 token chunks and 10-15% overlap. Markdown headings give the splitter natural seams.
- Embed each chunk with your model of choice. Pass chunks through an embedding model (OpenAI text-embedding-3-small, Voyage voyage-3, or a local bge-small) and keep the source URL plus heading path as metadata.
- Upsert vectors into your store. Write the embeddings to Pinecone, Postgres pgvector, Qdrant, or Weaviate. Include the canonical URL, content hash, and updatedAt so re-crawls cleanly replace stale rows.
- Query at retrieval time and cite the source. At inference, embed the user question, top-k against the same index, stitch the chunks into the prompt, and surface the source URL in your answer so the LLM can be audited.
# scrape_and_index.py — run with: python3 scrape_and_index.py
import os
import hashlib
import requests
import psycopg
from openai import OpenAI
CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
client = OpenAI()
def scrape(url: str) -> str:
r = requests.post(
f"{CRW}/scrape",
json={"url": url, "formats": ["markdown"]},
headers=HEADERS,
timeout=60,
)
r.raise_for_status()
return r.json()["data"]["markdown"]
def chunk(md: str, size: int = 1000, overlap: int = 150) -> list[str]:
# Replace with LangChain RecursiveCharacterTextSplitter in production.
out, i = [], 0
while i < len(md):
out.append(md[i : i + size])
i += size - overlap
return out
def upsert(url: str, chunks: list[str]) -> None:
embeddings = client.embeddings.create(
model="text-embedding-3-small", input=chunks
).data
with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
for idx, (text, emb) in enumerate(zip(chunks, embeddings)):
cur.execute(
"INSERT INTO rag_chunks (url, ord, body, content_hash, embedding) "
"VALUES (%s, %s, %s, %s, %s) "
"ON CONFLICT (url, ord) DO UPDATE SET body = EXCLUDED.body, "
"content_hash = EXCLUDED.content_hash, embedding = EXCLUDED.embedding",
(url, idx, text, hashlib.md5(text.encode()).hexdigest(), emb.embedding),
)
conn.commit()
if __name__ == "__main__":
target = "https://docs.fastcrw.com/api-reference/scrape/"
upsert(target, chunk(scrape(target)))
The example uses Postgres + pgvector because it is the path of least
resistance for a self-hosted RAG stack, but the same five steps map directly
onto Pinecone, Qdrant, or Weaviate — only the upsert body changes.
Next steps
The full /v1/scrape and /v1/map reference lives at
docs.fastcrw.com, and managed-cloud pricing for
teams that would rather not run the binary themselves is on
fastcrw.com/pricing. Self-hosters get the same
endpoints for $0 per 1,000 scrapes under AGPL-3.0; managed-cloud users add
hosted convenience (BYOK extraction, search) on top.
Continue exploring
More from Use Cases
Web Scraping for Lead Enrichment
AI-Powered Structured Extraction from the Web
Web Scraping for RAG and AI Agent Training Data
Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.
Web Scraping for Market Research
Monitor competitors, track pricing changes, harvest customer sentiment, and map market landscapes at scale with fastCRW — structured, timestamped data for repeatable quantitative analysis without manual analyst work.
Self-Hosted Web Scraping API
Run fastCRW on your own infrastructure — a single ~8 MB Docker image, no Redis or Node.js required, full Firecrawl-compatible API. Deploy on a $5 VPS or inside your own VPC for complete data control, privacy, and zero per-scrape fees.
Related hubs
