By the fastCRW team · Performance figures from diagnose_3way.py on Firecrawl's public 819-labeled-URL dataset, verified 2026-05-18 · Credit costs current as of 2026-05-18 · Verify independently before building.
Sitemap to web crawl: the scale problem
Going from a sitemap to a web crawl is trivial on a 50-page marketing site and genuinely hard on a 40,000-page e-commerce catalog or docs portal. This post is scale-only: it assumes you already know how to fire a single crawl and want to control budget, depth, and the latency tail when the URL count climbs. If you need the fundamentals first — what a crawl job is, how to start one, how to read results — start with our crawl an entire website from its sitemap walkthrough and come back here when you hit a wall.
The core insight for large sites: do not point a blind breadth-first crawl at a domain root and hope the budget holds. Seed discovery from the URLs you actually want, then crawl with explicit caps. fastCRW gives you two primitives for exactly this — /v1/map for discovery and /v1/crawl for the async traversal — and the discipline is in how you chain them.
sitemap.xml, /v1/map, and /v1/crawl are three different things
People conflate these constantly, and at scale the conflation costs money. They are distinct layers:
- sitemap.xml is a file the site publishes. It is a hint, often incomplete, sometimes stale, and not guaranteed to exist.
- POST
/v1/mapdiscovers all URLs on a site and returns them as a list — 1 credit for the whole call. It is your inventory step: cheap, fast, no page content. - POST
/v1/crawlruns an asynchronous breadth-first traversal, fetches each page, converts it to clean Markdown (or JSON), and bills per page.
The mistake is treating /v1/crawl as both discovery and collection on a big site. Separating them — map to plan, crawl to collect — is what makes large jobs predictable.
Seeding a large crawl from discovered URLs
The map-first pattern is the single most important habit for crawling at scale. One /v1/map call costs 1 credit and returns the URL universe. You then inspect that list before spending 1 credit per page on a crawl that might be 10x larger than you expected.
Map first, then crawl the URLs you actually want
A typical large-site flow:
- Call
/v1/mapon the domain. Get back, say, 38,000 URLs for 1 credit. - Filter that list locally — by path prefix, by query-string presence, by language directory — to the subset you care about.
- Crawl only that subset, or crawl with a tight
maxPagesceiling, so the bill matches the plan you actually had.
The economics are stark. Discovering 38,000 URLs to learn that 30,000 are paginated filter permutations you do not need costs you 1 credit via map. Discovering the same thing by crawling costs you up to 38,000 credits. The map step is the cheapest insurance you will buy all week.
Filtering URL patterns before you spend credits at scale
Large sites are full of crawl traps: faceted-search permutations (?color=red&size=m&sort=price), infinite calendars, session-tagged URLs, and printer-friendly duplicates. Map gives you the list; you decide what is signal. Drop query-heavy URLs you do not need, collapse trailing-slash duplicates, and exclude changelog/search/tag pages before the crawl ever starts. Every URL you remove here is a credit (or two) you do not spend later.
Controlling scope at scale: maxDepth and maxPages
When you do crawl, the two knobs that keep a large job bounded are maxDepth and maxPages. These are hard ceilings in fastCRW, not suggestions.
The maxDepth cap of 10 and the maxPages cap of 1000
fastCRW caps maxDepth at 10 and maxPages at 1000 per crawl job (source: crw-opencore README endpoint table, verified 2026-05-18). On a large site this matters in two ways:
- Depth. Most useful content on a well-structured site sits within 3–5 hops of an entry point. A
maxDepthof 3–4 often captures the catalog while skipping the deep tail of pagination. The cap of 10 means you cannot accidentally chase an infinite link structure forever. - Pages. The 1000-page ceiling per job means a 40,000-page site is not one crawl — it is a series of scoped crawls, each seeded from a filtered map slice. Plan around batches, not one monster job.
limit and max_pages as accepted aliases
If you are porting existing code or following a Firecrawl example, you do not have to rename fields. limit and max_pages are accepted serde aliases of maxPages, so all three resolve to the same cap. This is part of the Firecrawl-compatible surface: most crawl request bodies work unchanged after a base-URL swap.
Async job model: start a crawl, poll the job ID
/v1/crawl is asynchronous by design — exactly right for large sites where a job can run for minutes. The contract:
- POST
/v1/crawlreturns a job ID immediately. It does not block. - GET
/v1/crawl/:idreturns status plus results as they complete — poll this. - DELETE
/v1/crawl/:idcancels a running job, which matters when you realize the scope was wrong three minutes in.
For scale, the cancel endpoint is underrated: it is your circuit breaker. If a crawl is consuming faster than expected, kill it, re-filter the map, and restart with a tighter cap rather than letting it run to the ceiling.
Concurrency, latency, and the honest tail
At scale, aggregate throughput is dominated by per-page latency, and we are going to be straight with you about ours rather than quote a single flattering average.
Median latency is fast; the p90 tail is the worst of three — disclosed
On Firecrawl's own public dataset (819 labeled URLs, diagnose_3way.py, 2026-05-08), fastCRW's p50 latency was 1914 ms — lower p50 latency than Firecrawl's 2305 ms and effectively tied with Crawl4AI's 1916 ms. That is the good half. The honest half: fastCRW's p90 was 14157 ms, the worst of the three tools tested (Crawl4AI 4754 ms, Firecrawl 6937 ms). We publish the full split because a single average would hide exactly the number a large-site planner needs.
Why the slow tail exists and how to plan around it
The tail is causal, not incidental. fastCRW's chrome-stealth fallback is the same mechanism that recovers the difficult pages other tools miss — it is why fastCRW also posted the highest truth-recall of the three (63.74% of 819 labeled URLs) and a 0-error run paired with 87.7% scrape-success. The pages that fall to the stealth fallback are the slow ones. For a large crawl, that means:
- Size timeouts for the tail, not the median. A timeout tuned to 2 seconds will kill the very pages the fallback exists to rescue.
- Run pages concurrently so the p90 outliers overlap with fast p50 fetches instead of serializing behind them.
- Treat wall-clock estimates as p90-bound on hard sites: a 1000-page crawl is not 1000 × 1914 ms, because a meaningful fraction lands in the multi-second tail.
If you want the full mechanics of where scrape latency comes from, see scraping latency explained, and the raw numbers live at /benchmarks.
Credit math for a large-site pass
The pricing is deliberately simple, which makes large jobs easy to forecast.
| Operation | Credits | Notes |
|---|---|---|
/v1/map | 1 | Whole-site URL discovery, one call |
/v1/crawl (http / lightpanda) | 1 per page | Default renderers |
/v1/crawl (chrome-rendered) | 2 per page | When JS execution is required |
A worked example: map a 40,000-URL site (1 credit), filter to 8,000 product pages, crawl them across eight scoped jobs at the 1000-page cap. At the default http/lightpanda renderer that is 8,000 credits plus the 1 map credit = 8,001 credits. If those pages need JavaScript and fall to the chrome renderer, double the per-page cost to roughly 16,000 credits. The chrome multiplier is the single biggest lever on a large bill — confirm your target actually needs it before defaulting to it.
Self-host for $0 vs managed credits at scale
If your crawl volume is large and recurring, the credit math changes shape entirely. fastCRW's engine is AGPL-3.0 and self-hostable — a single static Rust binary — so you can run unlimited crawls for the cost of your own server, no per-page credits at all. The managed cloud is the right call when you want zero ops and predictable monthly pricing; self-host wins when volume is high enough that the server price beats the credits. Compare live managed tiers on the pricing page rather than hard-coding numbers that move. The API is identical either way, so you can prototype on managed credits and move the heavy recurring crawls to a self-hosted binary later with the same request bodies.
Sources
- fastCRW endpoint table, renderer selection, and crawl caps: github.com/us/crw (crw-opencore README, verified 2026-05-18)
- Scrape/latency benchmark of record:
diagnose_3way.pyon Firecrawl's publicscrape-content-dataset-v1, 819 labeled URLs, 2026-05-08 (bench/server-runs/RESULT_3WAY_1000_FULL.md) - Live pricing and credit costs: fastcrw.com/pricing
Related: Crawl an entire website from its sitemap · The /v1/map endpoint deep dive · The /v1/crawl endpoint deep dive · Scraping latency explained
