Web Scraping for RAG and AI Agent Training Data
Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.
Who this is for
ML engineers and AI teams who need to collect large web corpora for:
- RAG knowledge bases — crawling documentation sites, wikis, or domain-specific article collections to build the retrieval index behind an LLM chatbot
- AI agent training datasets — gathering diverse, high-quality web text for instruction tuning, preference learning, or evaluation harness construction
- Domain-specific pretraining or continued pretraining — assembling topic-focused corpora from authoritative web sources
The hard problem at this phase is not embedding or retrieval — it is getting clean, faithful body text out of the web at scale without navigation noise, truncated content, or JavaScript-rendered pages that return empty. That is what fastCRW is built for.
This page is about data collection upstream of your index. For inference-time retrieval (chunking, embedding, querying), see RAG pipelines.
Why scraper accuracy is the bottleneck for RAG corpus quality
When you build a RAG knowledge base, every stage of the pipeline inherits from the one before it. A scraper that returns truncated body text, navigation chrome, or cookie-consent boilerplate produces chunks full of that noise. Those chunks get embedded, upserted, and retrieved — and the LLM surfaces them as answers.
The industry has no standard metric for this, so fastCRW commissioned a benchmark against Firecrawl's own public scrape-content-dataset-v1 (1,000 URLs, 819 with labeled ground truth). The harness (diagnose_3way.py, run 2026-05-08) compares each scraper's markdown output to the labeled body text and measures what fraction of labeled URLs produced a faithful extraction.
| Metric | fastCRW | Crawl4AI | Firecrawl |
|---|---|---|---|
| Truth-recall (of 819 labeled URLs) | 63.74% (522) | 59.95% (491) | 56.04% (459) |
| Scrape-success (of 1,000 URLs) | 87.7% (877) | 83.5% (835) | 89.7% (897) |
| p50 latency | 1,914 ms | 1,916 ms | 2,305 ms |
| p90 latency | 14,157 ms | 4,754 ms | 6,937 ms |
| p99 latency | 15,012 ms | 13,749 ms | 21,107 ms |
| Thrown errors | 0 | 0 | 0 |
Source: bench/server-runs/RESULT_3WAY_1000_FULL.md, diagnose_3way.py, 2026-05-08.
What this means for corpus collection:
- fastCRW's 63.74% truth-recall is +3.79 percentage points over Crawl4AI and +7.70 pp over Firecrawl on the same 819 labeled URLs. At corpus scale (100,000 pages), that difference is thousands of pages where your RAG index has faithful content versus navigation noise or empty bodies.
- fastCRW's p90 latency (14,157 ms) is the highest of the three. This is a deliberate trade-off: the chrome-stealth fallback that recovers the URLs others miss is the same mechanism that produces a slow tail on complex pages. For bulk corpus collection (not real-time scraping), this trade-off favors accuracy.
- Firecrawl has higher scrape-success (89.7% vs 87.7%) but lower truth-recall. It fetches the page more often but extracts the meaningful body text less faithfully.
Publish the full p50/p90/p99 split in your own benchmarks. A single average hides the tail behavior that matters for scheduler planning.
Differentiating RAG corpus collection from fine-tuning datasets
These two use cases share a crawl step but diverge immediately after:
| Concern | RAG corpus collection | Fine-tuning dataset |
|---|---|---|
| Output format | Chunked markdown + vector embeddings | JSONL (prompt/completion, instruction/input/output) |
| Scale | Millions of pages, continuous refresh | Thousands of curated examples, one-time |
| Quality filter | Dedup + length filter; some noise acceptable | Strict curation; noise degrades model weights |
| Freshness | Must stay current (re-crawl on schedule) | Static snapshot is fine after training run |
| Chunk metadata | Source URL + heading path required | Source attribution optional |
| Primary fastCRW endpoint | /v1/crawl for bulk, /v1/scrape for targeted | /v1/crawl + /v1/scrape |
For fine-tuning and JSONL pipeline details, see LLM training data. For general ML dataset curation, see dataset curation.
Choosing a web scraping API for RAG corpus collection: fastCRW vs Firecrawl vs Apify
When comparing APIs for this specific use case, the relevant axes are: content fidelity (truth-recall), self-host availability, pricing at corpus scale, and Firecrawl compatibility for existing loaders.
| fastCRW | Firecrawl | Apify | |
|---|---|---|---|
| Truth-recall | 63.74% (819 labeled URLs, diagnose_3way.py, 2026-05-08) | 56.04% (same benchmark) | Not benchmarked on this dataset |
| API style | Firecrawl-compatible REST | Native | Proprietary (Actors) |
| Self-host | Yes — AGPL-3.0 single binary, $0/page | No | No |
| Cloud pricing / 1,000 pages | Hobby: ~$4.33 at 3,000 credits/$13 · Scale: ~$0.55 at 1M credits/$549 (source: PLAN_DISPLAY, src/lib/plans-client.ts) | $0.83–$5.33 per 1,000 across tiers (source: marketing/competitor-prices.lock.md, verified 2026-05-18) | Varies by Actor; compute-time billing |
| LLM extraction | Yes — formats: ["json"] + jsonSchema; 5 credits per call | Yes | Actor-dependent |
| MCP integration | Yes — crw-mcp npm package | Partial | No native MCP |
| Markdown output | Clean server-side stripping of nav/ads | Yes | Actor-dependent |
| Drop-in migration | — | Swap base URL from fastCRW → Firecrawl | Full rewrite required |
| p50 latency | 1,914 ms | 2,305 ms | Not benchmarked |
Qualitative notes:
- Firecrawl is the market leader and has a mature managed cloud. If you are already on Firecrawl, fastCRW is a drop-in alternative (base-URL swap) with higher truth-recall on the same benchmark dataset. See Firecrawl vs fastCRW.
- Apify is the broadest actor marketplace — useful when you need site-specific scrapers (e.g., a dedicated Amazon actor). For general web corpus collection with clean markdown output, fastCRW's uniform API surface is simpler to operate at scale. See Apify alternatives.
- Self-host advantage: At corpus scale (millions of pages), managed-API per-page costs dominate the budget. fastCRW's AGPL-3.0 binary lets you run the scraper on your own servers — $0 per page, only compute cost.
Architecture: web corpus collection pipeline for RAG
A production RAG corpus collection pipeline has five distinct stages:
Stage 1 — URL discovery
Use /v1/map to enumerate all reachable URLs from a seed domain. Most documentation sites and knowledge bases have predictable URL patterns; /v1/map also follows sitemaps.
curl -X POST https://api.fastcrw.com/v1/map \
-H "Authorization: Bearer $CRW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com"}'
/v1/map costs 1 credit per call and returns the full URL list — use it as a cheap discovery step before any scraping credits are spent.
Stage 2 — Bulk crawl with markdown normalization
For domains under 1,000 pages, use /v1/crawl to fetch the entire site asynchronously. For larger domains, iterate /v1/scrape concurrently across the URL list from Stage 1.
# Start async crawl
curl -X POST https://api.fastcrw.com/v1/crawl \
-H "Authorization: Bearer $CRW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxPages": 1000,
"maxDepth": 5,
"formats": ["markdown"]
}'
/v1/crawl returns a job ID. Poll /v1/crawl/:id for status and results.
Stage 3 — Deduplication and quality filtering
After crawling, deduplicate pages and filter low-quality content before chunking:
import hashlib
import re
def content_hash(markdown: str) -> str:
# Normalize whitespace before hashing to catch near-identical pages
normalized = re.sub(r'\s+', ' ', markdown.strip())
return hashlib.sha256(normalized.encode()).hexdigest()
def quality_filter(markdown: str) -> bool:
# Reject pages that are too short or mostly non-body content
word_count = len(markdown.split())
if word_count < 150:
return False
# Reject pages where headings dominate (navigation dumps)
heading_lines = sum(1 for line in markdown.splitlines() if line.startswith('#'))
total_lines = max(len(markdown.splitlines()), 1)
if heading_lines / total_lines > 0.4:
return False
return True
Stage 4 — Chunking for retrieval
Split markdown at heading boundaries. The heading structure fastCRW preserves in its output is directly usable as chunk seam points:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["\n## ", "\n### ", "\n#### ", "\n\n", "\n", " "],
)
def chunk_page(url: str, markdown: str, h1: str = "") -> list[dict]:
chunks = splitter.split_text(markdown)
return [
{
"text": chunk,
"metadata": {
"source_url": url,
"page_title": h1,
"chunk_index": idx,
}
}
for idx, chunk in enumerate(chunks)
]
Stage 5 — Embedding and upsert
Embed each chunk and upsert to your vector store with source metadata for citation:
from openai import OpenAI
import psycopg
client = OpenAI()
def embed_and_upsert(chunks: list[dict], conn) -> None:
texts = [c["text"] for c in chunks]
embeddings = client.embeddings.create(
model="text-embedding-3-small",
input=texts
).data
with conn.cursor() as cur:
for chunk, emb in zip(chunks, embeddings):
cur.execute(
"""
INSERT INTO rag_corpus
(source_url, page_title, chunk_index, body, content_hash, embedding)
VALUES (%s, %s, %s, %s, md5(%s), %s)
ON CONFLICT (source_url, chunk_index)
DO UPDATE SET
body = EXCLUDED.body,
content_hash = EXCLUDED.content_hash,
embedding = EXCLUDED.embedding,
updated_at = now()
""",
(
chunk["metadata"]["source_url"],
chunk["metadata"]["page_title"],
chunk["metadata"]["chunk_index"],
chunk["text"],
chunk["text"],
emb.embedding,
)
)
conn.commit()
Full Python pipeline
Here is a complete working pipeline that ties together all five stages:
"""
rag_corpus_builder.py — Build a RAG knowledge base from a web domain.
Uses fastCRW /v1/map + /v1/crawl, deduplicates, chunks, and upserts to pgvector.
Run with: uv run python rag_corpus_builder.py
"""
import os
import time
import hashlib
import re
import requests
import psycopg
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
CRW_API = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
openai_client = OpenAI()
# ── Stage 1: URL discovery ─────────────────────────────────────────────────
def discover_urls(seed_url: str) -> list[str]:
resp = requests.post(f"{CRW_API}/map", json={"url": seed_url}, headers=HEADERS, timeout=60)
resp.raise_for_status()
return resp.json().get("urls", [])
# ── Stage 2: Async crawl ───────────────────────────────────────────────────
def start_crawl(seed_url: str, max_pages: int = 500) -> str:
payload = {
"url": seed_url,
"maxPages": max_pages,
"maxDepth": 5,
"formats": ["markdown"],
}
resp = requests.post(f"{CRW_API}/crawl", json=payload, headers=HEADERS, timeout=30)
resp.raise_for_status()
return resp.json()["id"]
def poll_crawl(job_id: str, poll_interval: int = 5) -> list[dict]:
while True:
resp = requests.get(f"{CRW_API}/crawl/{job_id}", headers=HEADERS, timeout=30)
resp.raise_for_status()
data = resp.json()
status = data.get("status")
if status == "completed":
return data.get("data", [])
elif status in ("failed", "cancelled"):
raise RuntimeError(f"Crawl {job_id} ended with status: {status}")
print(f" Crawl status: {status} ({data.get('completed', 0)}/{data.get('total', '?')} pages)")
time.sleep(poll_interval)
# ── Stage 3: Dedup + quality filter ───────────────────────────────────────
def content_hash(text: str) -> str:
normalized = re.sub(r'\s+', ' ', text.strip())
return hashlib.sha256(normalized.encode()).hexdigest()
def is_quality(markdown: str, min_words: int = 150) -> bool:
words = len(markdown.split())
if words < min_words:
return False
lines = markdown.splitlines()
headings = sum(1 for l in lines if l.startswith('#'))
if lines and headings / len(lines) > 0.4:
return False
return True
def deduplicate(pages: list[dict]) -> list[dict]:
seen: set[str] = set()
out: list[dict] = []
for page in pages:
md = page.get("markdown", "")
if not md or not is_quality(md):
continue
h = content_hash(md)
if h not in seen:
seen.add(h)
page["_hash"] = h
out.append(page)
return out
# ── Stage 4: Chunking ──────────────────────────────────────────────────────
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
def chunk_page(page: dict) -> list[dict]:
url = page.get("metadata", {}).get("url", page.get("url", ""))
title = page.get("metadata", {}).get("title", "")
md = page.get("markdown", "")
return [
{"text": c, "url": url, "title": title, "idx": i}
for i, c in enumerate(splitter.split_text(md))
]
# ── Stage 5: Embed + upsert ────────────────────────────────────────────────
def embed_chunks(chunks: list[dict]) -> None:
texts = [c["text"] for c in chunks]
embeddings = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts
).data
with psycopg.connect(os.environ["DATABASE_URL"]) as conn, conn.cursor() as cur:
for chunk, emb in zip(chunks, embeddings):
cur.execute(
"""
INSERT INTO rag_corpus
(source_url, page_title, chunk_index, body, content_hash, embedding)
VALUES (%s, %s, %s, %s, md5(%s), %s)
ON CONFLICT (source_url, chunk_index) DO UPDATE
SET body = EXCLUDED.body,
content_hash = EXCLUDED.content_hash,
embedding = EXCLUDED.embedding,
updated_at = now()
""",
(chunk["url"], chunk["title"], chunk["idx"],
chunk["text"], chunk["text"], emb.embedding),
)
conn.commit()
# ── Main ───────────────────────────────────────────────────────────────────
if __name__ == "__main__":
seed = "https://docs.example.com"
print(f"[1/5] Discovering URLs on {seed}...")
urls = discover_urls(seed)
print(f" Found {len(urls)} URLs")
print("[2/5] Starting async crawl (up to 500 pages)...")
job_id = start_crawl(seed, max_pages=500)
print(f" Crawl job: {job_id}")
pages = poll_crawl(job_id)
print(f" Crawled {len(pages)} pages")
print("[3/5] Deduplicating and quality-filtering...")
clean_pages = deduplicate(pages)
print(f" {len(clean_pages)} unique quality pages (from {len(pages)} raw)")
print("[4/5] Chunking...")
all_chunks = []
for page in clean_pages:
all_chunks.extend(chunk_page(page))
print(f" {len(all_chunks)} chunks")
print("[5/5] Embedding and upserting to pgvector...")
batch_size = 100
for i in range(0, len(all_chunks), batch_size):
batch = all_chunks[i : i + batch_size]
embed_chunks(batch)
print(f" Upserted {min(i + batch_size, len(all_chunks))}/{len(all_chunks)} chunks")
print("Done. RAG corpus ready.")
JavaScript / TypeScript example
For teams using Node.js or Deno with LangChain:
// rag-corpus-builder.ts
// Drop-in: same endpoints as Firecrawl — swap base URL only.
import { FirecrawlLoader } from "@langchain/community/document_loaders/web/firecrawl";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const loader = new FirecrawlLoader({
url: "https://docs.example.com",
apiKey: process.env.CRW_API_KEY!,
apiUrl: "https://api.fastcrw.com", // ← only change from Firecrawl
mode: "crawl",
params: {
maxPages: 500,
formats: ["markdown"],
},
});
const docs = await loader.load();
console.log(`Loaded ${docs.length} pages`);
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
separators: ["\n## ", "\n### ", "\n\n", "\n", " "],
});
const chunks = await splitter.splitDocuments(docs);
console.log(`Split into ${chunks.length} chunks`);
// Upsert to your vector store here (Pinecone, pgvector, Qdrant, Weaviate)
cURL one-liner: scrape a single page for RAG
curl -X POST https://api.fastcrw.com/v1/scrape \
-H "Authorization: Bearer $CRW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com/concepts/overview", "formats": ["markdown"]}' \
| jq '.data.markdown'
This returns clean body markdown — no navigation, no ads, no cookie banners — ready to pipe into your chunker.
Good fits for this approach
- Documentation sites crawled into a support chatbot corpus (product docs, API references, knowledge bases)
- Research paper aggregation — crawl arXiv abstracts, conference proceedings, or domain-specific journals into a retrieval index for a research assistant
- Internal enterprise knowledge — crawl internal wikis, Confluence spaces, or intranet sites and index them for an employee-facing AI assistant
- Domain-specific AI agents — build the grounding corpus for an agent that needs to answer questions about a specific industry (legal, medical, financial)
- Competitor intelligence bases — crawl and index public competitor documentation, blog posts, and release notes for retrieval-augmented analysis
Incremental re-crawl: keeping the corpus fresh
A RAG knowledge base built on web content becomes stale as the source pages change. Implement incremental refresh:
- Weekly URL audit — re-run
/v1/mapon each seed domain. Diff against your stored URL list to detect added and removed pages. - Changed-page detection — re-scrape all URLs and compare content hashes. Scrape-success rate on the Firecrawl benchmark was 87.7% (
RESULT_3WAY_1000_FULL.md, 2026-05-08), so plan for ~12% of pages needing retry on any given crawl. - Selective re-embedding — only re-chunk and re-embed pages where the content hash changed. On a 100,000-page corpus with a typical 5–10% weekly change rate, this is 5,000–10,000 pages/week, not 100,000.
- Soft delete removed pages — when a URL disappears from
/v1/map, mark its chunks as inactive rather than deleting immediately. This preserves the vector rows until you confirm the page is gone, not just temporarily unreachable.
Pricing for corpus collection at scale
Credit costs (source: PLAN_DISPLAY, src/lib/plans-client.ts):
/v1/map— 1 credit per call (covers an entire domain)/v1/scrapewithhttp/lightpandarenderer — 1 credit per page/v1/scrapewithchromerenderer — 2 credits per page/v1/crawl— 1 credit per page crawled (2 per page with chrome)
Example: 10,000-page corpus, all http/lightpanda renderer
- URL discovery: 10 domains × 1 credit = 10 credits (negligible)
- Crawl 10,000 pages × 1 credit = 10,000 credits
- Weekly refresh (8% change rate, 800 pages) × 1 credit = 800 credits/week ≈ 3,200 credits/month
- Total: ~13,200 credits/month → Standard plan ($69/mo launch price, $99/mo regular, 100,000 credits; source:
PLAN_DISPLAY)
Example: 100,000-page corpus, mixed rendering
- Crawl 100,000 pages (90% lightpanda @ 1 cr, 10% chrome @ 2 cr) = 110,000 credits
- Weekly refresh (8% change rate, 8,000 pages) = 8,000 credits/week ≈ 32,000 credits/month
- Total: ~142,000 credits/month → Growth plan ($279/mo launch price, $399/mo regular, 500,000 credits; source:
PLAN_DISPLAY)
Self-hosting: AGPL-3.0, single binary. Crawl at $0/page on your own server. Only cost is compute. See self-hosting.
Launch pricing ends 2026-06-01; prices revert to regular after that date.
Comparison: RAG corpus collection vs inference-time retrieval
| Corpus collection (this page) | Inference-time RAG (→ RAG pipelines) | |
|---|---|---|
| When it runs | Scheduled batch job (daily / weekly) | Every user query (real-time) |
| Primary cost driver | Scrape credits (per page crawled) | Embedding API calls + vector query latency |
| Bottleneck | Content fidelity and dedup quality | Chunk quality and retrieval precision |
| fastCRW role | /v1/crawl + /v1/map for bulk collection | /v1/scrape for on-demand page fetch |
| Freshness pattern | Incremental re-crawl on schedule | Always live (or near-real-time) |
| Scale | Millions of pages, once | One page per turn, per user |
FAQ
Q: Why does the benchmark show fastCRW with the worst p90 latency if it has the best truth-recall?
A: These metrics are causally linked. fastCRW's chrome-stealth fallback is what recovers pages that simpler renderers fail on — and those are the slow pages. The p90 of 14,157 ms (RESULT_3WAY_1000_FULL.md, 2026-05-08) reflects that tail of complex, JS-heavy pages. For bulk corpus collection (batch jobs, not real-time), that latency is acceptable. For real-time scraping on a user's request, plan your timeout budget accordingly or filter to known simple-render sites.
Q: Can I use fastCRW with LangChain, LlamaIndex, or other RAG frameworks?
A: Yes. fastCRW is Firecrawl-compatible — the same base shape, same endpoint names, same response fields. Any LangChain FirecrawlLoader or LlamaIndex FirecrawlWebReader that targets Firecrawl works against fastCRW after a base-URL swap (https://api.fastcrw.com). See the TypeScript example above.
Q: How do I extract structured metadata (title, author, publish date) from corpus pages?
A: Pass formats: ["json"] and a jsonSchema to /v1/scrape. fastCRW's LLM extraction (5 credits per call) fills your schema fields automatically from the page HTML. For corpus collection at scale, extract metadata on the pages where it matters (news articles, research papers) and skip extraction on reference docs where structure is less important.
Q: Is fastCRW suitable for scraping behind authentication?
A: fastCRW does not manage authenticated sessions. For pages behind login, pre-authenticate in a real browser, export cookies, and pass them as request headers via the headers field on /v1/scrape. This works for session-cookie-based auth; OAuth flows requiring redirects need to be completed outside fastCRW.
Q: What is the maximum corpus size supported by /v1/crawl?
A: /v1/crawl accepts maxPages up to 1,000 (and maxDepth up to 10) per job (crw-opencore/README.md). For domains larger than 1,000 pages, break the crawl into multiple jobs by path prefix (e.g., /docs/api/, /docs/guides/ as separate seeds), or use /v1/map to discover all URLs and iterate /v1/scrape concurrently across the full list.
Q: How do I validate that fastCRW captured a page correctly before embedding it?
A: Sample 50–100 pages from your crawl and manually compare the markdown output to the live page. Look for: (1) body text present, (2) code blocks preserved, (3) headings intact, (4) no navigation/footer artifacts dominating the output. The truth-recall benchmark (63.74% of 819 labeled URLs, diagnose_3way.py, 2026-05-08) gives you a baseline expectation — on a typical web corpus, you should see faithful extraction on 60–65% of pages; the remainder may need manual review or a different renderer.
Related resources
- RAG pipelines — inference-time retrieval: chunking, embedding, and querying the corpus you built here
- LLM training data — fine-tuning and JSONL output from web content
- Dataset curation — general ML dataset assembly from the open web
- Firecrawl alternatives — drop-in migration guide from Firecrawl to fastCRW
- Apify alternatives — when to use Apify actors vs a uniform scraping API
- Benchmarks — full 3-way fastCRW facts and raw results
- Pricing — current plan pricing and credits
Continue exploring
More from Use Cases
Self-Hosted Web Scraping API
Run fastCRW on your own infrastructure — a single ~8 MB Docker image, no Redis or Node.js required, full Firecrawl-compatible API. Deploy on a $5 VPS or inside your own VPC for complete data control, privacy, and zero per-scrape fees.
Web Scraping API for AI Agents
Give AI agents live web context via fastCRW — a Firecrawl-compatible scrape, search, crawl, and map API with an official MCP server, clean markdown output, and a single static Rust binary you can self-host free.
Vector Database Ingestion with fastCRW — Pinecone, Chroma, Weaviate, Qdrant, pgvector, Milvus
Crawl any domain into clean markdown with fastCRW, chunk it, embed it, and bulk-insert into your vector database of choice — Pinecone, Chroma, Weaviate, Qdrant, pgvector/Supabase, or Milvus. One hub, six stores.
Related hubs
