Skip to main content
Tutorial

Sitemap to Crawl: Optimized Discovery at Scale

Go from sitemap to a full crawl on large sites: seed with /v1/map, then cap maxDepth and maxPages. Discovery patterns, caps, concurrency, and credit costs.

fastcrw
June 9, 20269 min readLast updated: June 2, 2026

By the fastCRW team · Performance figures from diagnose_3way.py on Firecrawl's public 819-labeled-URL dataset, verified 2026-05-18 · Credit costs current as of 2026-05-18 · Verify independently before building.

Sitemap to web crawl: the scale problem

Going from a sitemap to a web crawl is trivial on a 50-page marketing site and genuinely hard on a 40,000-page e-commerce catalog or docs portal. This post is scale-only: it assumes you already know how to fire a single crawl and want to control budget, depth, and the latency tail when the URL count climbs. If you need the fundamentals first — what a crawl job is, how to start one, how to read results — start with our crawl an entire website from its sitemap walkthrough and come back here when you hit a wall.

The core insight for large sites: do not point a blind breadth-first crawl at a domain root and hope the budget holds. Seed discovery from the URLs you actually want, then crawl with explicit caps. fastCRW gives you two primitives for exactly this — /v1/map for discovery and /v1/crawl for the async traversal — and the discipline is in how you chain them.

sitemap.xml, /v1/map, and /v1/crawl are three different things

People conflate these constantly, and at scale the conflation costs money. They are distinct layers:

  • sitemap.xml is a file the site publishes. It is a hint, often incomplete, sometimes stale, and not guaranteed to exist.
  • POST /v1/map discovers all URLs on a site and returns them as a list — 1 credit for the whole call. It is your inventory step: cheap, fast, no page content.
  • POST /v1/crawl runs an asynchronous breadth-first traversal, fetches each page, converts it to clean Markdown (or JSON), and bills per page.

The mistake is treating /v1/crawl as both discovery and collection on a big site. Separating them — map to plan, crawl to collect — is what makes large jobs predictable.

Seeding a large crawl from discovered URLs

The map-first pattern is the single most important habit for crawling at scale. One /v1/map call costs 1 credit and returns the URL universe. You then inspect that list before spending 1 credit per page on a crawl that might be 10x larger than you expected.

Map first, then crawl the URLs you actually want

A typical large-site flow:

  1. Call /v1/map on the domain. Get back, say, 38,000 URLs for 1 credit.
  2. Filter that list locally — by path prefix, by query-string presence, by language directory — to the subset you care about.
  3. Crawl only that subset, or crawl with a tight maxPages ceiling, so the bill matches the plan you actually had.

The economics are stark. Discovering 38,000 URLs to learn that 30,000 are paginated filter permutations you do not need costs you 1 credit via map. Discovering the same thing by crawling costs you up to 38,000 credits. The map step is the cheapest insurance you will buy all week.

Filtering URL patterns before you spend credits at scale

Large sites are full of crawl traps: faceted-search permutations (?color=red&size=m&sort=price), infinite calendars, session-tagged URLs, and printer-friendly duplicates. Map gives you the list; you decide what is signal. Drop query-heavy URLs you do not need, collapse trailing-slash duplicates, and exclude changelog/search/tag pages before the crawl ever starts. Every URL you remove here is a credit (or two) you do not spend later.

Controlling scope at scale: maxDepth and maxPages

When you do crawl, the two knobs that keep a large job bounded are maxDepth and maxPages. These are hard ceilings in fastCRW, not suggestions.

The maxDepth cap of 10 and the maxPages cap of 1000

fastCRW caps maxDepth at 10 and maxPages at 1000 per crawl job (source: crw-opencore README endpoint table, verified 2026-05-18). On a large site this matters in two ways:

  • Depth. Most useful content on a well-structured site sits within 3–5 hops of an entry point. A maxDepth of 3–4 often captures the catalog while skipping the deep tail of pagination. The cap of 10 means you cannot accidentally chase an infinite link structure forever.
  • Pages. The 1000-page ceiling per job means a 40,000-page site is not one crawl — it is a series of scoped crawls, each seeded from a filtered map slice. Plan around batches, not one monster job.

limit and max_pages as accepted aliases

If you are porting existing code or following a Firecrawl example, you do not have to rename fields. limit and max_pages are accepted serde aliases of maxPages, so all three resolve to the same cap. This is part of the Firecrawl-compatible surface: most crawl request bodies work unchanged after a base-URL swap.

Async job model: start a crawl, poll the job ID

/v1/crawl is asynchronous by design — exactly right for large sites where a job can run for minutes. The contract:

  • POST /v1/crawl returns a job ID immediately. It does not block.
  • GET /v1/crawl/:id returns status plus results as they complete — poll this.
  • DELETE /v1/crawl/:id cancels a running job, which matters when you realize the scope was wrong three minutes in.

For scale, the cancel endpoint is underrated: it is your circuit breaker. If a crawl is consuming faster than expected, kill it, re-filter the map, and restart with a tighter cap rather than letting it run to the ceiling.

Concurrency, latency, and the honest tail

At scale, aggregate throughput is dominated by per-page latency, and we are going to be straight with you about ours rather than quote a single flattering average.

Median latency is fast; the p90 tail is the worst of three — disclosed

On Firecrawl's own public dataset (819 labeled URLs, diagnose_3way.py, 2026-05-08), fastCRW's p50 latency was 1914 ms — lower p50 latency than Firecrawl's 2305 ms and effectively tied with Crawl4AI's 1916 ms. That is the good half. The honest half: fastCRW's p90 was 14157 ms, the worst of the three tools tested (Crawl4AI 4754 ms, Firecrawl 6937 ms). We publish the full split because a single average would hide exactly the number a large-site planner needs.

Why the slow tail exists and how to plan around it

The tail is causal, not incidental. fastCRW's chrome-stealth fallback is the same mechanism that recovers the difficult pages other tools miss — it is why fastCRW also posted the highest truth-recall of the three (63.74% of 819 labeled URLs) and a 0-error run paired with 87.7% scrape-success. The pages that fall to the stealth fallback are the slow ones. For a large crawl, that means:

  • Size timeouts for the tail, not the median. A timeout tuned to 2 seconds will kill the very pages the fallback exists to rescue.
  • Run pages concurrently so the p90 outliers overlap with fast p50 fetches instead of serializing behind them.
  • Treat wall-clock estimates as p90-bound on hard sites: a 1000-page crawl is not 1000 × 1914 ms, because a meaningful fraction lands in the multi-second tail.

If you want the full mechanics of where scrape latency comes from, see scraping latency explained, and the raw numbers live at /benchmarks.

Credit math for a large-site pass

The pricing is deliberately simple, which makes large jobs easy to forecast.

OperationCreditsNotes
/v1/map1Whole-site URL discovery, one call
/v1/crawl (http / lightpanda)1 per pageDefault renderers
/v1/crawl (chrome-rendered)2 per pageWhen JS execution is required

A worked example: map a 40,000-URL site (1 credit), filter to 8,000 product pages, crawl them across eight scoped jobs at the 1000-page cap. At the default http/lightpanda renderer that is 8,000 credits plus the 1 map credit = 8,001 credits. If those pages need JavaScript and fall to the chrome renderer, double the per-page cost to roughly 16,000 credits. The chrome multiplier is the single biggest lever on a large bill — confirm your target actually needs it before defaulting to it.

Self-host for $0 vs managed credits at scale

If your crawl volume is large and recurring, the credit math changes shape entirely. fastCRW's engine is AGPL-3.0 and self-hostable — a single static Rust binary — so you can run unlimited crawls for the cost of your own server, no per-page credits at all. The managed cloud is the right call when you want zero ops and predictable monthly pricing; self-host wins when volume is high enough that the server price beats the credits. Compare live managed tiers on the pricing page rather than hard-coding numbers that move. The API is identical either way, so you can prototype on managed credits and move the heavy recurring crawls to a self-hosted binary later with the same request bodies.

Sources

  • fastCRW endpoint table, renderer selection, and crawl caps: github.com/us/crw (crw-opencore README, verified 2026-05-18)
  • Scrape/latency benchmark of record: diagnose_3way.py on Firecrawl's public scrape-content-dataset-v1, 819 labeled URLs, 2026-05-08 (bench/server-runs/RESULT_3WAY_1000_FULL.md)
  • Live pricing and credit costs: fastcrw.com/pricing

Related: Crawl an entire website from its sitemap · The /v1/map endpoint deep dive · The /v1/crawl endpoint deep dive · Scraping latency explained

FAQ

Frequently asked questions

How do I crawl a large website efficiently from its sitemap?
Map first, then crawl. Call POST /v1/map once (1 credit) to get the full URL inventory, filter that list locally to the subset you actually want — dropping faceted-search permutations and crawl traps — then run POST /v1/crawl only on that subset with a tight maxPages ceiling. This avoids spending 1 credit per page discovering URLs you do not need.
What is the maximum crawl depth and page count fastCRW allows?
fastCRW caps maxDepth at 10 and maxPages at 1000 per crawl job (crw-opencore README, verified 2026-05-18). limit and max_pages are accepted aliases of maxPages. A site larger than 1000 pages becomes a series of scoped crawls, each seeded from a filtered map slice, rather than one monster job.
Should I map first or crawl directly at scale?
Map first at scale. A blind crawl pays 1 credit per page to discover URLs; a single /v1/map call discovers the whole URL universe for 1 credit so you can filter and forecast cost before committing. Crawling directly is fine for small sites but burns budget on large ones where most discovered URLs turn out to be noise.
How fast is a large-site crawl, and why does the tail latency spike?
On Firecrawl's public 819-labeled-URL dataset (diagnose_3way.py, 2026-05-08), fastCRW's p50 was 1914 ms (lower p50 latency than Firecrawl's 2305 ms), but its p90 was 14157 ms — the worst of the three tools tested. The tail is causal: the chrome-stealth fallback that recovers hard pages (and drives the highest truth-recall, 63.74%) is also the slow path. Size timeouts and concurrency for the p90, not the median.
What does crawling thousands of pages cost in credits?
Crawling costs 1 credit per page on the http/lightpanda renderers, or 2 credits per page when a page is chrome-rendered, plus 1 credit for the initial /v1/map call. So 8,000 default-renderer pages cost about 8,001 credits; the same pages requiring JavaScript double to roughly 16,000. For high recurring volume, self-hosting the AGPL-3.0 binary removes per-page credits entirely — you pay only your server.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive