Use Cases/Use Case / RAG Training Data

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

Published

June 13, 2026

Updated

June 24, 2026

Who this is for

ML engineers and AI teams who need to collect large web corpora for:

RAG knowledge bases — crawling documentation sites, wikis, or domain-specific article collections to build the retrieval index behind an LLM chatbot
AI agent training datasets — gathering diverse, high-quality web text for instruction tuning, preference learning, or evaluation harness construction
Domain-specific pretraining or continued pretraining — assembling topic-focused corpora from authoritative web sources

The hard problem at this phase is not embedding or retrieval — it is getting clean, faithful body text out of the web at scale without navigation noise, truncated content, or JavaScript-rendered pages that return empty. That is what fastCRW is built for.

This page is about data collection upstream of your index. For inference-time retrieval (chunking, embedding, querying), see RAG pipelines.

Why scraper accuracy is the bottleneck for RAG corpus quality

When you build a RAG knowledge base, every stage of the pipeline inherits from the one before it. A scraper that returns truncated body text, navigation chrome, or cookie-consent boilerplate produces chunks full of that noise. Those chunks get embedded, upserted, and retrieved — and the LLM surfaces them as answers.

The industry has no standard metric for this, so fastCRW commissioned a benchmark against Firecrawl's own public scrape-content-dataset-v1 (1,000 URLs, 819 with labeled ground truth). The harness (diagnose_3way.py, run 2026-05-08) compares each scraper's markdown output to the labeled body text and measures what fraction of labeled URLs produced a faithful extraction.

Metric	fastCRW	Crawl4AI	Firecrawl
Truth-recall (of 819 labeled URLs)	63.74% (522)	59.95% (491)	56.04% (459)
Scrape-success (of reachable URLs)	~92% (91.8%)	—	—
p50 latency	1,914 ms	1,916 ms	2,305 ms
p90 latency (fast mode)	4,348 ms	4,754 ms	6,937 ms
Thrown errors	0	0	0

Source: bench/server-runs/RESULT_3WAY_1000_FULL.md, diagnose_3way.py, 2026-05-08.

What this means for corpus collection:

fastCRW's 63.74% truth-recall is +3.79 percentage points over Crawl4AI and +7.70 pp over Firecrawl on the same 819 labeled URLs. At corpus scale (100,000 pages), that difference is thousands of pages where your RAG index has faithful content versus navigation noise or empty bodies.
In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three. The chrome-stealth fallback that recovers the URLs others miss is the same mechanism that powers both the recall lead and the p90 win. fastCRW also recovers 34 URLs that neither competitor reached — 70% more unique recoveries than Crawl4AI and Firecrawl combined.
fastCRW's 91.8% scrape-success (of reachable URLs) with 0 thrown errors across 3,000 requests rounds out the accuracy picture alongside the truth-recall lead.

Publish the full p50/p90/p99 split in your own benchmarks. A single average hides the tail behavior that matters for scheduler planning.

Differentiating RAG corpus collection from fine-tuning datasets

These two use cases share a crawl step but diverge immediately after:

Concern	RAG corpus collection	Fine-tuning dataset
Output format	Chunked markdown + vector embeddings	JSONL (prompt/completion, instruction/input/output)
Scale	Millions of pages, continuous refresh	Thousands of curated examples, one-time
Quality filter	Dedup + length filter; some noise acceptable	Strict curation; noise degrades model weights
Freshness	Must stay current (re-crawl on schedule)	Static snapshot is fine after training run
Chunk metadata	Source URL + heading path required	Source attribution optional
Primary fastCRW endpoint	`/v1/crawl` for bulk, `/v1/scrape` for targeted	`/v1/crawl` + `/v1/scrape`

For fine-tuning and JSONL pipeline details, see LLM training data. For general ML dataset curation, see dataset curation.

Choosing a web scraping API for RAG corpus collection: fastCRW vs Firecrawl vs Apify

When comparing APIs for this specific use case, the relevant axes are: content fidelity (truth-recall), self-host availability, pricing at corpus scale, and Firecrawl compatibility for existing loaders.

	fastCRW	Firecrawl	Apify
Truth-recall	63.74% (819 labeled URLs, `diagnose_3way.py`, 2026-05-08)	56.04% (same benchmark)	Not benchmarked on this dataset
API style	Firecrawl-compatible REST	Native	Proprietary (Actors)
Self-host	Yes — AGPL-3.0 single binary, $0/page	No	No
Cloud pricing / 1,000 pages	Hobby: ~$2.60 at 5,000 credits/$13 · Scale: ~$0.55 at 1M credits/$549 (source: `PLAN_DISPLAY`, `src/lib/plans-client.ts`)	$0.83–$5.33 per 1,000 across tiers	Varies by Actor; compute-time billing
LLM extraction	Yes — `formats: ["json"]` + `jsonSchema`; 1 scrape credit + metered managed-LLM cost per call	Yes	Actor-dependent
MCP integration	Yes — `crw-mcp` npm package	Partial	No native MCP
Markdown output	Clean server-side stripping of nav/ads	Yes	Actor-dependent
Drop-in migration	—	Swap base URL from fastCRW → Firecrawl	Full rewrite required
p50 latency	1,914 ms	2,305 ms	Not benchmarked

Qualitative notes:

Firecrawl is the market leader and has a mature managed cloud. If you are already on Firecrawl, fastCRW is a drop-in alternative (base-URL swap) with higher truth-recall on the same benchmark dataset. See Firecrawl vs fastCRW.
Apify is the broadest actor marketplace — useful when you need site-specific scrapers (e.g., a dedicated Amazon actor). For general web corpus collection with clean markdown output, fastCRW's uniform API surface is simpler to operate at scale. See Apify alternatives.
Self-host advantage: At corpus scale (millions of pages), managed-API per-page costs dominate the budget. fastCRW's AGPL-3.0 binary lets you run the scraper on your own servers — $0 per page, only compute cost.

Architecture: web corpus collection pipeline for RAG

A production RAG corpus collection pipeline has five distinct stages:

Stage 1 — URL discovery

Use /v1/map to enumerate all reachable URLs from a seed domain. Most documentation sites and knowledge bases have predictable URL patterns; /v1/map also follows sitemaps.

curl -X POST https://api.fastcrw.com/v1/map \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com"}'

/v1/map costs 1 credit per call and returns the full URL list — use it as a cheap discovery step before any scraping credits are spent.

Stage 2 — Bulk crawl with markdown normalization

For domains under 1,000 pages, use /v1/crawl to fetch the entire site asynchronously. For larger domains, iterate /v1/scrape concurrently across the URL list from Stage 1.

# Start async crawl
curl -X POST https://api.fastcrw.com/v1/crawl \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxPages": 1000,
    "maxDepth": 5,
    "formats": ["markdown"]
  }'

/v1/crawl returns a job ID. Poll /v1/crawl/:id for status and results.

Stage 3 — Deduplication and quality filtering

After crawling, deduplicate pages and filter low-quality content before chunking:

import hashlib
import re

def content_hash(markdown: str) -> str:
    # Normalize whitespace before hashing to catch near-identical pages
    normalized = re.sub(r'\s+', ' ', markdown.strip())
    return hashlib.sha256(normalized.encode()).hexdigest()

def quality_filter(markdown: str) -> bool:
    # Reject pages that are too short or mostly non-body content
    word_count = len(markdown.split())
    if word_count < 150:
        return False
    # Reject pages where headings dominate (navigation dumps)
    heading_lines = sum(1 for line in markdown.splitlines() if line.startswith('#'))
    total_lines = max(len(markdown.splitlines()), 1)
    if heading_lines / total_lines > 0.4:
        return False
    return True

Stage 4 — Chunking for retrieval

Split markdown at heading boundaries. The heading structure fastCRW preserves in its output is directly usable as chunk seam points:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n## ", "\n### ", "\n#### ", "\n\n", "\n", " "],
)

def chunk_page(url: str, markdown: str, h1: str = "") -> list[dict]:
    chunks = splitter.split_text(markdown)
    return [
        {
            "text": chunk,
            "metadata": {
                "source_url": url,
                "page_title": h1,
                "chunk_index": idx,
            }
        }
        for idx, chunk in enumerate(chunks)
    ]

Stage 5 — Embedding and upsert

Embed each chunk and upsert to your vector store with source metadata for citation:

from openai import OpenAI
import psycopg

client = OpenAI()

def embed_and_upsert(chunks: list[dict], conn) -> None:
    texts = [c["text"] for c in chunks]
    embeddings = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    ).data

    with conn.cursor() as cur:
        for chunk, emb in zip(chunks, embeddings):
            cur.execute(
                """
                INSERT INTO rag_corpus
                  (source_url, page_title, chunk_index, body, content_hash, embedding)
                VALUES (%s, %s, %s, %s, md5(%s), %s)
                ON CONFLICT (source_url, chunk_index)
                DO UPDATE SET
                  body = EXCLUDED.body,
                  content_hash = EXCLUDED.content_hash,
                  embedding = EXCLUDED.embedding,
                  updated_at = now()
                """,
                (
                    chunk["metadata"]["source_url"],
                    chunk["metadata"]["page_title"],
                    chunk["metadata"]["chunk_index"],
                    chunk["text"],
                    chunk["text"],
                    emb.embedding,
                )
            )
        conn.commit()

Full Python pipeline

Here is a complete working pipeline that ties together all five stages:

"""
rag_corpus_builder.py — Build a RAG knowledge base from a web domain.
Uses fastCRW /v1/map + /v1/crawl, deduplicates, chunks, and upserts to pgvector.
Run with: uv run python rag_corpus_builder.py
"""

import os
import time
import hashlib
import re
import requests
import psycopg
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter

CRW_API = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
openai_client = OpenAI()

# ── Stage 1: URL discovery ─────────────────────────────────────────────────

def discover_urls(seed_url: str) -> list[str]:
    resp = requests.post(f"{CRW_API}/map", json={"url": seed_url}, headers=HEADERS, timeout=60)
    resp.raise_for_status()
    return resp.json().get("urls", [])

# ── Stage 2: Async crawl ───────────────────────────────────────────────────

def start_crawl(seed_url: str, max_pages: int = 500) -> str:
    payload = {
        "url": seed_url,
        "maxPages": max_pages,
        "maxDepth": 5,
        "formats": ["markdown"],
    }
    resp = requests.post(f"{CRW_API}/crawl", json=payload, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    return resp.json()["id"]

def poll_crawl(job_id: str, poll_interval: int = 5) -> list[dict]:
    while True:
        resp = requests.get(f"{CRW_API}/crawl/{job_id}", headers=HEADERS, timeout=30)
        resp.raise_for_status()
        data = resp.json()
        status = data.get("status")
        if status == "completed":
            return data.get("data", [])
        elif status in ("failed", "cancelled"):
            raise RuntimeError(f"Crawl {job_id} ended with status: {status}")
        print(f"  Crawl status: {status} ({data.get('completed', 0)}/{data.get('total', '?')} pages)")
        time.sleep(poll_interval)

# ── Stage 3: Dedup + quality filter ───────────────────────────────────────

def content_hash(text: str) -> str:
    normalized = re.sub(r'\s+', ' ', text.strip())
    return hashlib.sha256(normalized.encode()).hexdigest()

def is_quality(markdown: str, min_words: int = 150) -> bool:
    words = len(markdown.split())
    if words < min_words:
        return False
    lines = markdown.splitlines()
    headings = sum(1 for l in lines if l.startswith('#'))
    if lines and headings / len(lines) > 0.4:
        return False
    return True

def deduplicate(pages: list[dict]) -> list[dict]:
    seen: set[str] = set()
    out: list[dict] = []
    for page in pages:
        md = page.get("markdown", "")
        if not md or not is_quality(md):
            continue
        h = content_hash(md)
        if h not in seen:
            seen.add(h)
            page["_hash"] = h
            out.append(page)
    return out

# ── Stage 4: Chunking ──────────────────────────────────────────────────────

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)

def chunk_page(page: dict) -> list[dict]:
    url = page.get("metadata", {}).get("url", page.get("url", ""))
    title = page.get("metadata", {}).get("title", "")
    md = page.get("markdown", "")
    return [
        {"text": c, "url": url, "title": title, "idx": i}
        for i, c in enumerate(splitter.split_text(md))
    ]

# ── Stage 5: Embed + upsert ────────────────────────────────────────────────

def embed_chunks(chunks: list[dict]) -> None:
    texts = [c["text"] for c in chunks]
    embeddings = openai_client.embeddings.create(
        model="text-embedding-3-small", input=texts
    ).data
    with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
        for chunk, emb in zip(chunks, embeddings):
            cur.execute(
                """
                INSERT INTO rag_corpus
                  (source_url, page_title, chunk_index, body, content_hash, embedding)
                VALUES (%s, %s, %s, %s, md5(%s), %s)
                ON CONFLICT (source_url, chunk_index) DO UPDATE
                  SET body = EXCLUDED.body,
                      content_hash = EXCLUDED.content_hash,
                      embedding = EXCLUDED.embedding,
                      updated_at = now()
                """,
                (chunk["url"], chunk["title"], chunk["idx"],
                 chunk["text"], chunk["text"], emb.embedding),
            )
        conn.commit()

# ── Main ───────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    seed = "https://docs.example.com"

    print(f"[1/5] Discovering URLs on {seed}...")
    urls = discover_urls(seed)
    print(f"  Found {len(urls)} URLs")

    print("[2/5] Starting async crawl (up to 500 pages)...")
    job_id = start_crawl(seed, max_pages=500)
    print(f"  Crawl job: {job_id}")
    pages = poll_crawl(job_id)
    print(f"  Crawled {len(pages)} pages")

    print("[3/5] Deduplicating and quality-filtering...")
    clean_pages = deduplicate(pages)
    print(f"  {len(clean_pages)} unique quality pages (from {len(pages)} raw)")

    print("[4/5] Chunking...")
    all_chunks = []
    for page in clean_pages:
        all_chunks.extend(chunk_page(page))
    print(f"  {len(all_chunks)} chunks")

    print("[5/5] Embedding and upserting to pgvector...")
    batch_size = 100
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i : i + batch_size]
        embed_chunks(batch)
        print(f"  Upserted {min(i + batch_size, len(all_chunks))}/{len(all_chunks)} chunks")

    print("Done. RAG corpus ready.")

JavaScript / TypeScript example

For teams using Node.js or Deno with LangChain:

// rag-corpus-builder.ts
// Drop-in: same endpoints as Firecrawl — swap base URL only.
import { FirecrawlLoader } from "@langchain/community/document_loaders/web/firecrawl";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const loader = new FirecrawlLoader({
  url: "https://docs.example.com",
  apiKey: process.env.CRW_API_KEY!,
  apiUrl: "https://api.fastcrw.com", // ← only change from Firecrawl
  mode: "crawl",
  params: {
    maxPages: 500,
    formats: ["markdown"],
  },
});

const docs = await loader.load();
console.log(`Loaded ${docs.length} pages`);

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 150,
  separators: ["\n## ", "\n### ", "\n\n", "\n", " "],
});

const chunks = await splitter.splitDocuments(docs);
console.log(`Split into ${chunks.length} chunks`);

// Upsert to your vector store here (Pinecone, pgvector, Qdrant, Weaviate)

cURL one-liner: scrape a single page for RAG

curl -X POST https://api.fastcrw.com/v1/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/concepts/overview", "formats": ["markdown"]}' \
  | jq '.data.markdown'

This returns clean body markdown — no navigation, no ads, no cookie banners — ready to pipe into your chunker.

Good fits for this approach

Documentation sites crawled into a support chatbot corpus (product docs, API references, knowledge bases)
Research paper aggregation — crawl arXiv abstracts, conference proceedings, or domain-specific journals into a retrieval index for a research assistant
Internal enterprise knowledge — crawl internal wikis, Confluence spaces, or intranet sites and index them for an employee-facing AI assistant
Domain-specific AI agents — build the grounding corpus for an agent that needs to answer questions about a specific industry (legal, medical, financial)
Competitor intelligence bases — crawl and index public competitor documentation, blog posts, and release notes for retrieval-augmented analysis

Incremental re-crawl: keeping the corpus fresh

A RAG knowledge base built on web content becomes stale as the source pages change. Implement incremental refresh:

Weekly URL audit — re-run /v1/map on each seed domain. Diff against your stored URL list to detect added and removed pages.
Changed-page detection — re-scrape all URLs and compare content hashes. fastCRW's 91.8% scrape-success (of reachable URLs, RESULT_3WAY_1000_FULL.md, 2026-05-08) means unreachable pages are the rare case; plan for retry on any genuinely unreachable URLs.
Selective re-embedding — only re-chunk and re-embed pages where the content hash changed. On a 100,000-page corpus with a typical 5–10% weekly change rate, this is 5,000–10,000 pages/week, not 100,000.
Soft delete removed pages — when a URL disappears from /v1/map, mark its chunks as inactive rather than deleting immediately. This preserves the vector rows until you confirm the page is gone, not just temporarily unreachable.

Pricing for corpus collection at scale

Credit costs (source: PLAN_DISPLAY, src/lib/plans-client.ts):

/v1/map — 1 credit per call (covers an entire domain)
/v1/scrape — 1 credit per page (any renderer: auto, http, lightpanda, or chrome)
/v1/crawl — 1 credit per page crawled (any renderer)

Example: 10,000-page corpus

URL discovery: 10 domains × 1 credit = 10 credits (negligible)
Crawl 10,000 pages × 1 credit = 10,000 credits
Weekly refresh (8% change rate, 800 pages) × 1 credit = 800 credits/week ≈ 3,200 credits/month
Total: ~13,200 credits/month → Standard plan ($69/mo, 100,000 credits; source: PLAN_DISPLAY)

Example: 100,000-page corpus, mixed rendering

Crawl 100,000 pages × 1 credit = 100,000 credits (flat rate, any renderer)
Weekly refresh (8% change rate, 8,000 pages) = 8,000 credits/week ≈ 32,000 credits/month
Total: ~132,000 credits/month → Growth plan ($279/mo, 500,000 credits; source: PLAN_DISPLAY)

Self-hosting: AGPL-3.0, single binary. Crawl at $0/page on your own server. Only cost is compute. See self-hosting.

Comparison: RAG corpus collection vs inference-time retrieval

	Corpus collection (this page)	Inference-time RAG (→ RAG pipelines)
When it runs	Scheduled batch job (daily / weekly)	Every user query (real-time)
Primary cost driver	Scrape credits (per page crawled)	Embedding API calls + vector query latency
Bottleneck	Content fidelity and dedup quality	Chunk quality and retrieval precision
fastCRW role	`/v1/crawl` + `/v1/map` for bulk collection	`/v1/scrape` for on-demand page fetch
Freshness pattern	Incremental re-crawl on schedule	Always live (or near-real-time)
Scale	Millions of pages, once	One page per turn, per user

FAQ

Q: How does fastCRW's p90 latency compare to the other tools?

A: In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three tools tested (Crawl4AI 4,754 ms, Firecrawl 6,937 ms) (RESULT_3WAY_1000_FULL.md, 2026-05-08). The chrome-stealth fallback that recovers the URLs others miss is the same mechanism behind both the recall lead and the p90 win. For bulk corpus collection (batch jobs, not real-time), this combination of high recall and competitive p90 is a direct advantage.

Q: Can I use fastCRW with LangChain, LlamaIndex, or other RAG frameworks?

A: Yes. fastCRW is Firecrawl-compatible — the same base shape, same endpoint names, same response fields. Any LangChain FirecrawlLoader or LlamaIndex FirecrawlWebReader that targets Firecrawl works against fastCRW after a base-URL swap (https://api.fastcrw.com). See the TypeScript example above.

Q: How do I extract structured metadata (title, author, publish date) from corpus pages?

A: Pass formats: ["json"] and a jsonSchema to /v1/scrape. fastCRW's LLM extraction (1 scrape credit + metered managed-LLM cost) fills your schema fields automatically from the page HTML. For corpus collection at scale, extract metadata on the pages where it matters (news articles, research papers) and skip extraction on reference docs where structure is less important.

Q: Is fastCRW suitable for scraping behind authentication?

A: fastCRW does not manage authenticated sessions. For pages behind login, pre-authenticate in a real browser, export cookies, and pass them as request headers via the headers field on /v1/scrape. This works for session-cookie-based auth; OAuth flows requiring redirects need to be completed outside fastCRW.

Q: What is the maximum corpus size supported by /v1/crawl?

A: /v1/crawl accepts maxPages up to 1,000 (and maxDepth up to 10) per job (crw-opencore/README.md). For domains larger than 1,000 pages, break the crawl into multiple jobs by path prefix (e.g., /docs/api/, /docs/guides/ as separate seeds), or use /v1/map to discover all URLs and iterate /v1/scrape concurrently across the full list.

Q: How do I validate that fastCRW captured a page correctly before embedding it?

A: Sample 50–100 pages from your crawl and manually compare the markdown output to the live page. Look for: (1) body text present, (2) code blocks preserved, (3) headings intact, (4) no navigation/footer artifacts dominating the output. The truth-recall benchmark (63.74% of 819 labeled URLs, diagnose_3way.py, 2026-05-08) gives you a baseline expectation — on a typical web corpus, you should see faithful extraction on 60–65% of pages; the remainder may need manual review or a different renderer.

RAG pipelines — inference-time retrieval: chunking, embedding, and querying the corpus you built here
LLM training data — fine-tuning and JSONL output from web content
Dataset curation — general ML dataset assembly from the open web
Firecrawl alternatives — drop-in migration guide from Firecrawl to fastCRW
Apify alternatives — when to use Apify actors vs a uniform scraping API
Benchmarks — full 3-way benchmark methodology and raw results
Pricing — current plan pricing and credits

fastCRWlive

Scrape any URL, live

Get 500 free credits →

Sources

fastCRW 3-way benchmark result (RESULT_3WAY_1000_FULL.md, 2026-05-08)

https://github.com/us/crw-saas/blob/main/bench/server-runs/RESULT_3WAY_1000_FULL.md

fastCRW /v1/scrape API reference

https://docs.fastcrw.com/api-reference/scrape/

LangChain RecursiveCharacterTextSplitter

https://python.langchain.com/docs/concepts/text_splitters/

pgvector — open-source vector extension for Postgres

https://github.com/pgvector/pgvector

FAQ

Why does scraper accuracy matter for RAG and AI agent training data?

The quality of your retrieval index is bounded by the fidelity of the underlying text. If the scraper captures navigation chrome, cookie banners, or truncated body text, those artifacts appear in retrieved chunks and degrade LLM answer quality. fastCRW's truth-recall benchmark — 63.74% of 819 labeled URLs on Firecrawl's public scrape-content-dataset-v1 (`diagnose_3way.py`, 2026-05-08) — measures exactly that: whether the meaningful body text made it through intact, compared to 59.95% for Crawl4AI and 56.04% for Firecrawl on the same dataset.

How is this different from rag-pipelines.mdx?

The RAG pipelines page covers inference-time retrieval: embedding a page and querying it at chat time. This page covers the data-collection phase upstream of that: how to crawl, clean, deduplicate, and normalize large web corpora that become the retrieval index. You typically do corpus collection once (and incrementally re-crawl); inference-time retrieval happens on every user query.

Can I use fastCRW to self-host the entire corpus collection pipeline?

Yes. fastCRW ships as a single static Rust binary under AGPL-3.0 (`github.com/us/crw`). Run it on any server — no Redis, no Node.js, no containers required. Pair it with a self-hosted vector store (Qdrant, pgvector) and the full pipeline — crawl → chunk → embed → store — stays inside your network at $0 per 1,000 pages on your own infrastructure.

How do I handle JavaScript-rendered pages in my corpus?

Pass renderer: "lightpanda" or renderer: "chrome" on the /v1/scrape or /v1/crawl request. fastCRW auto-selects with a chrome → lightpanda → http fallback. Every renderer costs 1 credit per page — auto, lightpanda, http, and chrome all cost the same. For most documentation and article sites, http or lightpanda is sufficient — reserve chrome rendering for heavily interactive pages.

What is the recommended chunk size for RAG corpora built from web content?

For web content, 800–1,200 tokens with 10–15% overlap works well with most embedding models. Markdown heading structure from fastCRW output gives splitters natural seam points (## and ### boundaries). Include the H1 → H2 heading path in chunk metadata so the retriever can cite the exact section, not just the page URL.

How do I keep my RAG corpus fresh as pages change?

Store a SHA-256 hash of each page's markdown alongside the vectors. Re-run /v1/map weekly to catch new or removed URLs. Re-scrape the existing URL list and compare hashes — only re-embed pages that actually changed. This incremental approach keeps costs proportional to the rate of change, not the size of the corpus.

Recommended next step

Claim an API key and start shipping.

Move from evaluation to implementation with credits, docs, and a compatibility-first API.

Create Account

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for Content Aggregation

Next in Use Cases

Self-Hosted Web Scraping API

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Job Board Data

Use fastCRW to scrape job listings from public boards and build recruiting pipelines with structured extraction of title, company, salary, and location.

job board scrapingExtract job title, company, location, salary, and job description from public listings

Use Cases

Web Scraping for Competitor Monitoring

Track competitor websites, pricing pages, feature launches, and content changes on a schedule with fastCRW — structured, timestamped change signals.

competitor monitoringScrape competitor pricing, features, and content changes

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Web Scraping for RAG and AI Agent Training Data

Who this is for

Why scraper accuracy is the bottleneck for RAG corpus quality

Differentiating RAG corpus collection from fine-tuning datasets

Choosing a web scraping API for RAG corpus collection: fastCRW vs Firecrawl vs Apify

Architecture: web corpus collection pipeline for RAG

Stage 1 — URL discovery

Stage 2 — Bulk crawl with markdown normalization

Stage 3 — Deduplication and quality filtering

Stage 4 — Chunking for retrieval

Stage 5 — Embedding and upsert

Full Python pipeline

JavaScript / TypeScript example

cURL one-liner: scrape a single page for RAG

Good fits for this approach

Incremental re-crawl: keeping the corpus fresh

Pricing for corpus collection at scale

Comparison: RAG corpus collection vs inference-time retrieval

FAQ

Related resources

More from Use Cases

Web Scraping for Real Estate Data

Web Scraping for Job Board Data

Web Scraping for Competitor Monitoring

Keep the crawl path moving

Alternatives

Benchmarks

Docs