Skip to main content
Use Cases/Use Case / RAG Training Data

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

Published
June 13, 2026
Updated
June 13, 2026
Category
use cases
63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools testedClean markdown output that chunks predictably without navigation noiseCrawl entire domains at scale with `/v1/crawl` — up to 1,000 pages per jobFirecrawl-compatible API — swap base URL, keep your existing loaders

Who this is for

ML engineers and AI teams who need to collect large web corpora for:

  • RAG knowledge bases — crawling documentation sites, wikis, or domain-specific article collections to build the retrieval index behind an LLM chatbot
  • AI agent training datasets — gathering diverse, high-quality web text for instruction tuning, preference learning, or evaluation harness construction
  • Domain-specific pretraining or continued pretraining — assembling topic-focused corpora from authoritative web sources

The hard problem at this phase is not embedding or retrieval — it is getting clean, faithful body text out of the web at scale without navigation noise, truncated content, or JavaScript-rendered pages that return empty. That is what fastCRW is built for.

This page is about data collection upstream of your index. For inference-time retrieval (chunking, embedding, querying), see RAG pipelines.


Why scraper accuracy is the bottleneck for RAG corpus quality

When you build a RAG knowledge base, every stage of the pipeline inherits from the one before it. A scraper that returns truncated body text, navigation chrome, or cookie-consent boilerplate produces chunks full of that noise. Those chunks get embedded, upserted, and retrieved — and the LLM surfaces them as answers.

The industry has no standard metric for this, so fastCRW commissioned a benchmark against Firecrawl's own public scrape-content-dataset-v1 (1,000 URLs, 819 with labeled ground truth). The harness (diagnose_3way.py, run 2026-05-08) compares each scraper's markdown output to the labeled body text and measures what fraction of labeled URLs produced a faithful extraction.

MetricfastCRWCrawl4AIFirecrawl
Truth-recall (of 819 labeled URLs)63.74% (522)59.95% (491)56.04% (459)
Scrape-success (of 1,000 URLs)87.7% (877)83.5% (835)89.7% (897)
p50 latency1,914 ms1,916 ms2,305 ms
p90 latency14,157 ms4,754 ms6,937 ms
p99 latency15,012 ms13,749 ms21,107 ms
Thrown errors000

Source: bench/server-runs/RESULT_3WAY_1000_FULL.md, diagnose_3way.py, 2026-05-08.

What this means for corpus collection:

  • fastCRW's 63.74% truth-recall is +3.79 percentage points over Crawl4AI and +7.70 pp over Firecrawl on the same 819 labeled URLs. At corpus scale (100,000 pages), that difference is thousands of pages where your RAG index has faithful content versus navigation noise or empty bodies.
  • fastCRW's p90 latency (14,157 ms) is the highest of the three. This is a deliberate trade-off: the chrome-stealth fallback that recovers the URLs others miss is the same mechanism that produces a slow tail on complex pages. For bulk corpus collection (not real-time scraping), this trade-off favors accuracy.
  • Firecrawl has higher scrape-success (89.7% vs 87.7%) but lower truth-recall. It fetches the page more often but extracts the meaningful body text less faithfully.

Publish the full p50/p90/p99 split in your own benchmarks. A single average hides the tail behavior that matters for scheduler planning.


Differentiating RAG corpus collection from fine-tuning datasets

These two use cases share a crawl step but diverge immediately after:

ConcernRAG corpus collectionFine-tuning dataset
Output formatChunked markdown + vector embeddingsJSONL (prompt/completion, instruction/input/output)
ScaleMillions of pages, continuous refreshThousands of curated examples, one-time
Quality filterDedup + length filter; some noise acceptableStrict curation; noise degrades model weights
FreshnessMust stay current (re-crawl on schedule)Static snapshot is fine after training run
Chunk metadataSource URL + heading path requiredSource attribution optional
Primary fastCRW endpoint/v1/crawl for bulk, /v1/scrape for targeted/v1/crawl + /v1/scrape

For fine-tuning and JSONL pipeline details, see LLM training data. For general ML dataset curation, see dataset curation.


Choosing a web scraping API for RAG corpus collection: fastCRW vs Firecrawl vs Apify

When comparing APIs for this specific use case, the relevant axes are: content fidelity (truth-recall), self-host availability, pricing at corpus scale, and Firecrawl compatibility for existing loaders.

fastCRWFirecrawlApify
Truth-recall63.74% (819 labeled URLs, diagnose_3way.py, 2026-05-08)56.04% (same benchmark)Not benchmarked on this dataset
API styleFirecrawl-compatible RESTNativeProprietary (Actors)
Self-hostYes — AGPL-3.0 single binary, $0/pageNoNo
Cloud pricing / 1,000 pagesHobby: ~$4.33 at 3,000 credits/$13 · Scale: ~$0.55 at 1M credits/$549 (source: PLAN_DISPLAY, src/lib/plans-client.ts)$0.83–$5.33 per 1,000 across tiers (source: marketing/competitor-prices.lock.md, verified 2026-05-18)Varies by Actor; compute-time billing
LLM extractionYes — formats: ["json"] + jsonSchema; 5 credits per callYesActor-dependent
MCP integrationYes — crw-mcp npm packagePartialNo native MCP
Markdown outputClean server-side stripping of nav/adsYesActor-dependent
Drop-in migrationSwap base URL from fastCRW → FirecrawlFull rewrite required
p50 latency1,914 ms2,305 msNot benchmarked

Qualitative notes:

  • Firecrawl is the market leader and has a mature managed cloud. If you are already on Firecrawl, fastCRW is a drop-in alternative (base-URL swap) with higher truth-recall on the same benchmark dataset. See Firecrawl vs fastCRW.
  • Apify is the broadest actor marketplace — useful when you need site-specific scrapers (e.g., a dedicated Amazon actor). For general web corpus collection with clean markdown output, fastCRW's uniform API surface is simpler to operate at scale. See Apify alternatives.
  • Self-host advantage: At corpus scale (millions of pages), managed-API per-page costs dominate the budget. fastCRW's AGPL-3.0 binary lets you run the scraper on your own servers — $0 per page, only compute cost.

Architecture: web corpus collection pipeline for RAG

A production RAG corpus collection pipeline has five distinct stages:

Stage 1 — URL discovery

Use /v1/map to enumerate all reachable URLs from a seed domain. Most documentation sites and knowledge bases have predictable URL patterns; /v1/map also follows sitemaps.

curl -X POST https://api.fastcrw.com/v1/map \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com"}'

/v1/map costs 1 credit per call and returns the full URL list — use it as a cheap discovery step before any scraping credits are spent.

Stage 2 — Bulk crawl with markdown normalization

For domains under 1,000 pages, use /v1/crawl to fetch the entire site asynchronously. For larger domains, iterate /v1/scrape concurrently across the URL list from Stage 1.

# Start async crawl
curl -X POST https://api.fastcrw.com/v1/crawl \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxPages": 1000,
    "maxDepth": 5,
    "formats": ["markdown"]
  }'

/v1/crawl returns a job ID. Poll /v1/crawl/:id for status and results.

Stage 3 — Deduplication and quality filtering

After crawling, deduplicate pages and filter low-quality content before chunking:

import hashlib
import re

def content_hash(markdown: str) -> str:
    # Normalize whitespace before hashing to catch near-identical pages
    normalized = re.sub(r'\s+', ' ', markdown.strip())
    return hashlib.sha256(normalized.encode()).hexdigest()

def quality_filter(markdown: str) -> bool:
    # Reject pages that are too short or mostly non-body content
    word_count = len(markdown.split())
    if word_count < 150:
        return False
    # Reject pages where headings dominate (navigation dumps)
    heading_lines = sum(1 for line in markdown.splitlines() if line.startswith('#'))
    total_lines = max(len(markdown.splitlines()), 1)
    if heading_lines / total_lines > 0.4:
        return False
    return True

Stage 4 — Chunking for retrieval

Split markdown at heading boundaries. The heading structure fastCRW preserves in its output is directly usable as chunk seam points:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n## ", "\n### ", "\n#### ", "\n\n", "\n", " "],
)

def chunk_page(url: str, markdown: str, h1: str = "") -> list[dict]:
    chunks = splitter.split_text(markdown)
    return [
        {
            "text": chunk,
            "metadata": {
                "source_url": url,
                "page_title": h1,
                "chunk_index": idx,
            }
        }
        for idx, chunk in enumerate(chunks)
    ]

Stage 5 — Embedding and upsert

Embed each chunk and upsert to your vector store with source metadata for citation:

from openai import OpenAI
import psycopg

client = OpenAI()

def embed_and_upsert(chunks: list[dict], conn) -> None:
    texts = [c["text"] for c in chunks]
    embeddings = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    ).data

    with conn.cursor() as cur:
        for chunk, emb in zip(chunks, embeddings):
            cur.execute(
                """
                INSERT INTO rag_corpus
                  (source_url, page_title, chunk_index, body, content_hash, embedding)
                VALUES (%s, %s, %s, %s, md5(%s), %s)
                ON CONFLICT (source_url, chunk_index)
                DO UPDATE SET
                  body = EXCLUDED.body,
                  content_hash = EXCLUDED.content_hash,
                  embedding = EXCLUDED.embedding,
                  updated_at = now()
                """,
                (
                    chunk["metadata"]["source_url"],
                    chunk["metadata"]["page_title"],
                    chunk["metadata"]["chunk_index"],
                    chunk["text"],
                    chunk["text"],
                    emb.embedding,
                )
            )
        conn.commit()

Full Python pipeline

Here is a complete working pipeline that ties together all five stages:

"""
rag_corpus_builder.py — Build a RAG knowledge base from a web domain.
Uses fastCRW /v1/map + /v1/crawl, deduplicates, chunks, and upserts to pgvector.
Run with: uv run python rag_corpus_builder.py
"""

import os
import time
import hashlib
import re
import requests
import psycopg
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter

CRW_API = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
openai_client = OpenAI()

# ── Stage 1: URL discovery ─────────────────────────────────────────────────

def discover_urls(seed_url: str) -> list[str]:
    resp = requests.post(f"{CRW_API}/map", json={"url": seed_url}, headers=HEADERS, timeout=60)
    resp.raise_for_status()
    return resp.json().get("urls", [])

# ── Stage 2: Async crawl ───────────────────────────────────────────────────

def start_crawl(seed_url: str, max_pages: int = 500) -> str:
    payload = {
        "url": seed_url,
        "maxPages": max_pages,
        "maxDepth": 5,
        "formats": ["markdown"],
    }
    resp = requests.post(f"{CRW_API}/crawl", json=payload, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    return resp.json()["id"]

def poll_crawl(job_id: str, poll_interval: int = 5) -> list[dict]:
    while True:
        resp = requests.get(f"{CRW_API}/crawl/{job_id}", headers=HEADERS, timeout=30)
        resp.raise_for_status()
        data = resp.json()
        status = data.get("status")
        if status == "completed":
            return data.get("data", [])
        elif status in ("failed", "cancelled"):
            raise RuntimeError(f"Crawl {job_id} ended with status: {status}")
        print(f"  Crawl status: {status} ({data.get('completed', 0)}/{data.get('total', '?')} pages)")
        time.sleep(poll_interval)

# ── Stage 3: Dedup + quality filter ───────────────────────────────────────

def content_hash(text: str) -> str:
    normalized = re.sub(r'\s+', ' ', text.strip())
    return hashlib.sha256(normalized.encode()).hexdigest()

def is_quality(markdown: str, min_words: int = 150) -> bool:
    words = len(markdown.split())
    if words < min_words:
        return False
    lines = markdown.splitlines()
    headings = sum(1 for l in lines if l.startswith('#'))
    if lines and headings / len(lines) > 0.4:
        return False
    return True

def deduplicate(pages: list[dict]) -> list[dict]:
    seen: set[str] = set()
    out: list[dict] = []
    for page in pages:
        md = page.get("markdown", "")
        if not md or not is_quality(md):
            continue
        h = content_hash(md)
        if h not in seen:
            seen.add(h)
            page["_hash"] = h
            out.append(page)
    return out

# ── Stage 4: Chunking ──────────────────────────────────────────────────────

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)

def chunk_page(page: dict) -> list[dict]:
    url = page.get("metadata", {}).get("url", page.get("url", ""))
    title = page.get("metadata", {}).get("title", "")
    md = page.get("markdown", "")
    return [
        {"text": c, "url": url, "title": title, "idx": i}
        for i, c in enumerate(splitter.split_text(md))
    ]

# ── Stage 5: Embed + upsert ────────────────────────────────────────────────

def embed_chunks(chunks: list[dict]) -> None:
    texts = [c["text"] for c in chunks]
    embeddings = openai_client.embeddings.create(
        model="text-embedding-3-small", input=texts
    ).data
    with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
        for chunk, emb in zip(chunks, embeddings):
            cur.execute(
                """
                INSERT INTO rag_corpus
                  (source_url, page_title, chunk_index, body, content_hash, embedding)
                VALUES (%s, %s, %s, %s, md5(%s), %s)
                ON CONFLICT (source_url, chunk_index) DO UPDATE
                  SET body = EXCLUDED.body,
                      content_hash = EXCLUDED.content_hash,
                      embedding = EXCLUDED.embedding,
                      updated_at = now()
                """,
                (chunk["url"], chunk["title"], chunk["idx"],
                 chunk["text"], chunk["text"], emb.embedding),
            )
        conn.commit()

# ── Main ───────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    seed = "https://docs.example.com"

    print(f"[1/5] Discovering URLs on {seed}...")
    urls = discover_urls(seed)
    print(f"  Found {len(urls)} URLs")

    print("[2/5] Starting async crawl (up to 500 pages)...")
    job_id = start_crawl(seed, max_pages=500)
    print(f"  Crawl job: {job_id}")
    pages = poll_crawl(job_id)
    print(f"  Crawled {len(pages)} pages")

    print("[3/5] Deduplicating and quality-filtering...")
    clean_pages = deduplicate(pages)
    print(f"  {len(clean_pages)} unique quality pages (from {len(pages)} raw)")

    print("[4/5] Chunking...")
    all_chunks = []
    for page in clean_pages:
        all_chunks.extend(chunk_page(page))
    print(f"  {len(all_chunks)} chunks")

    print("[5/5] Embedding and upserting to pgvector...")
    batch_size = 100
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i : i + batch_size]
        embed_chunks(batch)
        print(f"  Upserted {min(i + batch_size, len(all_chunks))}/{len(all_chunks)} chunks")

    print("Done. RAG corpus ready.")

JavaScript / TypeScript example

For teams using Node.js or Deno with LangChain:

// rag-corpus-builder.ts
// Drop-in: same endpoints as Firecrawl — swap base URL only.
import { FirecrawlLoader } from "@langchain/community/document_loaders/web/firecrawl";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const loader = new FirecrawlLoader({
  url: "https://docs.example.com",
  apiKey: process.env.CRW_API_KEY!,
  apiUrl: "https://api.fastcrw.com",  // ← only change from Firecrawl
  mode: "crawl",
  params: {
    maxPages: 500,
    formats: ["markdown"],
  },
});

const docs = await loader.load();
console.log(`Loaded ${docs.length} pages`);

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 150,
  separators: ["\n## ", "\n### ", "\n\n", "\n", " "],
});

const chunks = await splitter.splitDocuments(docs);
console.log(`Split into ${chunks.length} chunks`);

// Upsert to your vector store here (Pinecone, pgvector, Qdrant, Weaviate)

cURL one-liner: scrape a single page for RAG

curl -X POST https://api.fastcrw.com/v1/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/concepts/overview", "formats": ["markdown"]}' \
  | jq '.data.markdown'

This returns clean body markdown — no navigation, no ads, no cookie banners — ready to pipe into your chunker.


Good fits for this approach

  • Documentation sites crawled into a support chatbot corpus (product docs, API references, knowledge bases)
  • Research paper aggregation — crawl arXiv abstracts, conference proceedings, or domain-specific journals into a retrieval index for a research assistant
  • Internal enterprise knowledge — crawl internal wikis, Confluence spaces, or intranet sites and index them for an employee-facing AI assistant
  • Domain-specific AI agents — build the grounding corpus for an agent that needs to answer questions about a specific industry (legal, medical, financial)
  • Competitor intelligence bases — crawl and index public competitor documentation, blog posts, and release notes for retrieval-augmented analysis

Incremental re-crawl: keeping the corpus fresh

A RAG knowledge base built on web content becomes stale as the source pages change. Implement incremental refresh:

  1. Weekly URL audit — re-run /v1/map on each seed domain. Diff against your stored URL list to detect added and removed pages.
  2. Changed-page detection — re-scrape all URLs and compare content hashes. Scrape-success rate on the Firecrawl benchmark was 87.7% (RESULT_3WAY_1000_FULL.md, 2026-05-08), so plan for ~12% of pages needing retry on any given crawl.
  3. Selective re-embedding — only re-chunk and re-embed pages where the content hash changed. On a 100,000-page corpus with a typical 5–10% weekly change rate, this is 5,000–10,000 pages/week, not 100,000.
  4. Soft delete removed pages — when a URL disappears from /v1/map, mark its chunks as inactive rather than deleting immediately. This preserves the vector rows until you confirm the page is gone, not just temporarily unreachable.

Pricing for corpus collection at scale

Credit costs (source: PLAN_DISPLAY, src/lib/plans-client.ts):

  • /v1/map — 1 credit per call (covers an entire domain)
  • /v1/scrape with http/lightpanda renderer — 1 credit per page
  • /v1/scrape with chrome renderer — 2 credits per page
  • /v1/crawl — 1 credit per page crawled (2 per page with chrome)

Example: 10,000-page corpus, all http/lightpanda renderer

  • URL discovery: 10 domains × 1 credit = 10 credits (negligible)
  • Crawl 10,000 pages × 1 credit = 10,000 credits
  • Weekly refresh (8% change rate, 800 pages) × 1 credit = 800 credits/week ≈ 3,200 credits/month
  • Total: ~13,200 credits/month → Standard plan ($69/mo launch price, $99/mo regular, 100,000 credits; source: PLAN_DISPLAY)

Example: 100,000-page corpus, mixed rendering

  • Crawl 100,000 pages (90% lightpanda @ 1 cr, 10% chrome @ 2 cr) = 110,000 credits
  • Weekly refresh (8% change rate, 8,000 pages) = 8,000 credits/week ≈ 32,000 credits/month
  • Total: ~142,000 credits/month → Growth plan ($279/mo launch price, $399/mo regular, 500,000 credits; source: PLAN_DISPLAY)

Self-hosting: AGPL-3.0, single binary. Crawl at $0/page on your own server. Only cost is compute. See self-hosting.

Launch pricing ends 2026-06-01; prices revert to regular after that date.


Comparison: RAG corpus collection vs inference-time retrieval

Corpus collection (this page)Inference-time RAG (→ RAG pipelines)
When it runsScheduled batch job (daily / weekly)Every user query (real-time)
Primary cost driverScrape credits (per page crawled)Embedding API calls + vector query latency
BottleneckContent fidelity and dedup qualityChunk quality and retrieval precision
fastCRW role/v1/crawl + /v1/map for bulk collection/v1/scrape for on-demand page fetch
Freshness patternIncremental re-crawl on scheduleAlways live (or near-real-time)
ScaleMillions of pages, onceOne page per turn, per user

FAQ

Q: Why does the benchmark show fastCRW with the worst p90 latency if it has the best truth-recall?

A: These metrics are causally linked. fastCRW's chrome-stealth fallback is what recovers pages that simpler renderers fail on — and those are the slow pages. The p90 of 14,157 ms (RESULT_3WAY_1000_FULL.md, 2026-05-08) reflects that tail of complex, JS-heavy pages. For bulk corpus collection (batch jobs, not real-time), that latency is acceptable. For real-time scraping on a user's request, plan your timeout budget accordingly or filter to known simple-render sites.

Q: Can I use fastCRW with LangChain, LlamaIndex, or other RAG frameworks?

A: Yes. fastCRW is Firecrawl-compatible — the same base shape, same endpoint names, same response fields. Any LangChain FirecrawlLoader or LlamaIndex FirecrawlWebReader that targets Firecrawl works against fastCRW after a base-URL swap (https://api.fastcrw.com). See the TypeScript example above.

Q: How do I extract structured metadata (title, author, publish date) from corpus pages?

A: Pass formats: ["json"] and a jsonSchema to /v1/scrape. fastCRW's LLM extraction (5 credits per call) fills your schema fields automatically from the page HTML. For corpus collection at scale, extract metadata on the pages where it matters (news articles, research papers) and skip extraction on reference docs where structure is less important.

Q: Is fastCRW suitable for scraping behind authentication?

A: fastCRW does not manage authenticated sessions. For pages behind login, pre-authenticate in a real browser, export cookies, and pass them as request headers via the headers field on /v1/scrape. This works for session-cookie-based auth; OAuth flows requiring redirects need to be completed outside fastCRW.

Q: What is the maximum corpus size supported by /v1/crawl?

A: /v1/crawl accepts maxPages up to 1,000 (and maxDepth up to 10) per job (crw-opencore/README.md). For domains larger than 1,000 pages, break the crawl into multiple jobs by path prefix (e.g., /docs/api/, /docs/guides/ as separate seeds), or use /v1/map to discover all URLs and iterate /v1/scrape concurrently across the full list.

Q: How do I validate that fastCRW captured a page correctly before embedding it?

A: Sample 50–100 pages from your crawl and manually compare the markdown output to the live page. Look for: (1) body text present, (2) code blocks preserved, (3) headings intact, (4) no navigation/footer artifacts dominating the output. The truth-recall benchmark (63.74% of 819 labeled URLs, diagnose_3way.py, 2026-05-08) gives you a baseline expectation — on a typical web corpus, you should see faithful extraction on 60–65% of pages; the remainder may need manual review or a different renderer.


Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving