Skip to main content
Engineering

Training Data Collection via Web Scraping

Collect an LLM training corpus by scraping the web responsibly: coverage, provenance, robots.txt, and the legal questions you must settle before you start.

fastcrw
By RecepJune 23, 20268 min readLast updated: June 2, 2026

By the fastCRW team · Benchmarks and self-host cost verified 2026-05-18 · Numbers trace to the fastCRW canonical fact sheet · Verify independently before relying on these figures.

Training data collection via web scraping starts with coverage and provenance, not volume

If you are assembling an LLM training corpus from the web, the failure mode is rarely "we did not scrape enough pages." It is "we cannot prove where half of this came from," or "the pages we did keep were extracted badly and now the model has learned the navigation chrome." Training data collection via web scraping is mostly a data-governance and accuracy problem wearing a crawler costume. This guide is scoped to the collection stage — discovery, gathering, legal discipline, and provenance — and hands off instruction-pair structuring to its sibling guide, LLM fine-tuning data pipelines from the web.

The honest framing up front: we build fastCRW, an open-core Rust scraper with a Firecrawl-compatible API, so weigh this accordingly. We have kept the trade-offs and the gaps explicit because a corpus you cannot defend later is worse than no corpus at all.

What good corpus collection requires

Coverage, provenance, and source accuracy

Three properties decide whether a scraped corpus is usable for training, and they are not the same thing:

  • Coverage — did you actually reach the documents you intended to, or did a JS-heavy site silently hand you empty shells? Gaps here become silent holes in what the model never sees.
  • Source accuracy — for each page you did fetch, did you extract the real content, or boilerplate, cookie banners, and broken markdown? Bad extraction poisons every downstream step.
  • Provenance — can you answer, for any single document, "where did this come from, when, and under what license?" Without it, you cannot honor a takedown, audit a bias report, or pass a data-governance review.

Why accuracy at the source matters more than volume

Volume is the easy lever and the wrong one to pull first. A larger corpus of badly-extracted pages amplifies noise; the model dutifully learns the artifacts. Extraction quality is the variable that caps everything downstream, which is why we benchmark it directly. On Firecrawl's own public scrape-content dataset (1,000 URLs, of which 819 carry labeled ground truth), fastCRW posted the highest truth-recall of the three tools tested — 63.74% of the 819 labeled URLs, ahead of Crawl4AI (59.95%) and Firecrawl (56.04%) (diagnose_3way.py, single run, 2026-05-08). Higher truth-recall means each document you keep starts from cleaner, more faithful content — the same reason it matters for AI data preparation generally. Chase coverage and accuracy first; scale volume only once those hold.

Building the collection pipeline

Discover with map, gather with crawl

A corpus build is two distinct phases, and conflating them wastes credits and politeness budget. First discover: POST /v1/map returns the URLs on a site so you can see the shape of what you are about to ingest, filter out the obvious junk (login walls, pagination loops, faceted-search explosions), and estimate scale before you commit. Then gather: POST /v1/crawl runs an async BFS crawl and returns a job ID; you poll GET /v1/crawl/:id for status and results. Crawl honors maxDepth (cap 10) and maxPages (cap 1000), so you bound each run explicitly rather than letting it wander — essential for both cost and politeness. For whole-site collection patterns, see crawling an entire website by sitemap.

Clean markdown so downstream processing is cheaper

Collect to clean markdown, not raw HTML. Markdown strips the navigation, ad slots, and script tags that would otherwise survive into your dataset and waste tokens in every later pass — dedup, language detection, quality filtering, chunking, or instruction-pair construction. The substrate you store at collection time is the substrate every downstream stage pays for, so paying once for clean extraction is cheaper than re-cleaning the same noise N times. This is the same reasoning behind treating the corpus build as the front of a longer LLM training data pipeline rather than a one-off dump.

Legal, ethical, and provenance discipline

robots.txt and when overrides are legitimate

fastCRW respects robots.txt by default. It can be explicitly overridden only when the caller has the legal right to do so — for example, you own the site, you have a contract or written permission, or a jurisdiction-specific exemption clearly applies. That is a real constraint, not marketing copy: the default is to honor the site's stated crawl policy, and the override is a deliberate, accountable decision you make with legal cover, not a convenience flag to flip when a crawl is slow.

Treat robots.txt as the floor, not the ceiling. Terms of service, copyright, and emerging regional rules on text-and-data-mining can all impose obligations stricter than a permissive robots file suggests. We are not your lawyers and this is not legal advice — settle the licensing question with counsel before you scrape, because un-collecting data after the fact is far harder than not collecting it.

Tracking source and license per document

The discipline that separates a defensible corpus from a liability is per-document attribution. For every document you keep, record the source URL and, where the site declares one, the license or usage terms. This is what lets you later exclude a source category wholesale, respond to a removal request without re-deriving where a string came from, or prove to a reviewer that no disallowed source slipped in. Capture it at collection time; reconstructing provenance after a corpus is merged and shuffled is usually impossible.

Provenance metadata you must keep

Recording URL, fetch time, and license per doc

At minimum, attach to each collected document: the canonical URL, the fetch timestamp, the HTTP status and renderer used, a content hash for dedup and change detection, and any declared license. The fetch timestamp matters more than people expect — web content is mutable, so "we trained on the version of this page as of 2026-06-02" is a meaningfully different claim than "we trained on this page." A content hash also lets a later re-collection detect what actually changed instead of re-ingesting everything.

Auditability for a regulated corpus

If your corpus might ever face a data-governance review — model cards, bias audits, regulatory inquiry, or a customer's procurement questionnaire — provenance metadata is the audit trail. The practical test: pick a random training document a year from now and answer "where did this come from, when, and were we allowed to use it?" in under a minute. If your collection pipeline cannot do that, fix it before you scale, not after. Keeping the data and its provenance on infrastructure you control also simplifies this enormously, which is the core argument for local-first scraping and data privacy.

Scaling the collection cheaply

Self-hosting to remove per-page cost

At training scale, per-page cost dominates the economics, and metered APIs make a large corpus expensive precisely when you want it cheap. Self-hosting the AGPL-3.0 engine is $0 per 1,000 scrapes — you pay only for your own server, with no per-page meter. fastCRW's engine is a single static Rust binary, so a corpus collection run can sit on a modest box rather than a multi-service stack. For a regulated corpus, self-hosting has a second payoff beyond cost: the scraped content and the target URLs never leave your infrastructure, which keeps the entire provenance story inside your own audit boundary. When you do want the managed option for convenience, link /pricing rather than a number that moves.

Re-running collection on a schedule

A training corpus is rarely collected once. Web content drifts, new pages appear, and a stale snapshot quietly degrades. Because fastCRW is stateless per request, change detection and run history are yours to own — there is no built-in monitor watching for diffs. In practice that means a plain scheduled job: re-map the site, re-crawl within the same bounds, compare content hashes against your last run, and ingest only what changed. That is the honest trade versus a managed monitoring product, and for most corpus refreshes it is enough.

Honest gaps you should plan around

State these plainly so your collection design accounts for them:

  • No multi-URL batch extraction. There is no /v1/batch/scrape endpoint; for many URLs you iterate /v1/scrape concurrently or use /v1/crawl. Structured per-page extraction is single-URL.
  • LLM extraction is OpenAI/Anthropic only. If your collection step uses schema-driven JSON extraction (5 credits per request), the extraction LLM is limited to those two providers.
  • No screenshot output (a formats: ["screenshot"] request returns HTTP 422), no Fire-engine anti-bot, and no managed proxy depth. For hostile, heavily-defended targets, a proxy-network specialist is the right tool, not fastCRW.
  • Fast p50 and competitive p90. fastCRW's p50 of 1,914 ms beats Firecrawl's 2,305 ms; in fast mode the p90 is 4,348 ms — the lowest of the three tools tested (Crawl4AI 4,754 ms, Firecrawl 6,937 ms). The chrome-stealth fallback that recovers pages others miss extends the tail in recall mode, which for a batch corpus build rarely matters.

Sources

  • fastCRW canonical fact sheet — scrape benchmark (truth-recall 63.74% of 819 labeled URLs, diagnose_3way.py, 2026-05-08), self-hosted $0 per 1,000 scrapes, robots.txt and honest gaps, /v1/map and /v1/crawl.
  • Benchmark of record: bench/server-runs/RESULT_3WAY_1000_FULL.md.
  • fastCRW repo and pricing: github.com/us/crw · fastcrw.com.

Related: LLM training data pipeline with CRW · LLM fine-tuning data pipelines from the web · AI data preparation guide · Crawl an entire website by sitemap

FAQ

Frequently asked questions

How do I collect a training corpus by scraping the web?
Treat it as two phases. First discover with POST /v1/map to list a site's URLs, filter out junk, and estimate scale. Then gather with POST /v1/crawl, an async BFS crawl bounded by maxDepth (cap 10) and maxPages (cap 1000); poll GET /v1/crawl/:id for results. Collect to clean markdown rather than raw HTML so every downstream pass — dedup, filtering, chunking — is cheaper, and attach provenance metadata to each document as you go.
Is web-scraped training data legal to use?
It depends on the source's terms, copyright, and your jurisdiction's text-and-data-mining rules — robots.txt is the floor, not the full answer. This is not legal advice; settle licensing with counsel before scraping, because removing data from a merged corpus afterward is far harder than never collecting it. Recording per-document license and source at collection time is what makes later compliance possible.
How does fastCRW handle robots.txt during collection?
fastCRW respects robots.txt by default. It can be explicitly overridden only when the caller has the legal right to do so — for example you own the site, hold a contract or written permission, or a clear jurisdictional exemption applies. The default is to honor the site's stated crawl policy; the override is a deliberate, accountable decision, not a convenience toggle.
How do I track provenance and licensing for a scraped corpus?
Attach metadata to every kept document at collection time: canonical URL, fetch timestamp, HTTP status and renderer, a content hash for dedup and change detection, and any declared license. The fetch timestamp pins which version of a mutable page you used; the hash lets a later re-crawl detect real changes. The test of a defensible corpus is answering 'where did this come from, when, and were we allowed to use it?' for any random document in under a minute.
How much does large-scale training data collection cost?
Self-hosting the AGPL-3.0 fastCRW engine is $0 per 1,000 scrapes — you pay only for your own server, with no per-page meter. Because the engine is a single static Rust binary, a corpus run can sit on a modest box. Self-hosting also keeps scraped content and target URLs inside your own audit boundary. For the managed option, check live pricing at /pricing rather than a figure that may have changed.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More engineering posts

View category archive