Skip to main content
Engineering

URL Mapping vs Sitemap Parsing for Discovery

URL mapping vs sitemap.xml parsing for site discovery: coverage, freshness, and cost. When /v1/map beats a stale sitemap and feeds a crawl for 1 credit.

fastcrw
June 3, 20268 min readLast updated: June 2, 2026

By the fastCRW team · Credit and endpoint facts verified 2026-05-18 against the canonical fact sheet · Verify independently before you build on them.

URL mapping vs sitemap parsing: two ways to discover a site's URLs

Before you can crawl or extract a site, you need to know what pages exist. There are two ways to build that URL inventory, and the choice of URL mapping vs sitemap parsing for site discovery shapes both how complete your coverage is and how much it costs. The first way is to read the site's declared sitemap.xml. The second is active URL mapping — asking a discovery endpoint to enumerate the live URLs it can actually reach. They sound interchangeable. In practice they fail and succeed in different places, and a robust crawler usually uses both.

Parsing sitemap.xml

A sitemap is an XML file (usually at /sitemap.xml or referenced from robots.txt) where the site author declares the URLs they want indexed, often with a lastmod timestamp and a priority hint. Parsing it is cheap and instant: one HTTP fetch, one XML parse, and you have a list. When the sitemap is fresh and complete, it is the single best source of a site's intended URL set, because it is the author telling you directly what to look at.

Active URL mapping

Active mapping does not trust a declared file. It discovers URLs the way a crawler sees them — following links, expanding from the homepage, and returning the set of URLs that are actually live and reachable right now. fastCRW exposes this as POST /v1/map, which returns a site's URL inventory for 1 credit (canonical fact sheet §3-4, verified 2026-05-18). Mapping costs a little more than reading a static file, but it does not inherit the sitemap's blind spots.

Where sitemap parsing falls short

Sitemaps are a declaration of intent, not a guarantee of reality, and the gap between the two is where discovery pipelines quietly lose pages.

Stale or partial sitemaps

Most sitemaps are generated on a schedule, not on every publish. A page added an hour ago may not appear for a day; a page deleted last week may still be listed. Many CMSs also cap the sitemap at a fixed page count or only include "important" routes, so deep or paginated sections never make it in. If your pipeline trusts the sitemap as ground truth, you silently miss whatever the generator skipped.

Sites with no sitemap at all

Plenty of sites — internal tools, smaller marketing sites, hand-built pages — ship no sitemap whatsoever. A discovery step that depends on sitemap.xml returns an empty list and your crawl never starts. Active mapping degrades gracefully here: with no sitemap to read, it falls back to following links from the entry URL, so you still get a usable inventory.

Sitemaps that omit dynamic routes

Search-result pages, filtered listings, faceted navigation, and JavaScript-generated routes are frequently absent from sitemaps because the author never enumerated them. These are often exactly the pages you want for a data pull. Reading the sitemap alone will never surface them; you have to discover them by traversal.

POST /v1/map: active URL discovery

The map endpoint is fastCRW's first-class discovery primitive. It is a deliberate, separate step rather than a side effect of crawling — you map first to see the shape of a site, then decide what to crawl.

What map returns and how

You send a single URL to POST /v1/map and get back a list of URLs discovered for that site. Because it is synchronous and returns a flat inventory rather than page content, it is fast and cheap to run — the right tool when the question is "what URLs exist here?" rather than "what is on each page?". The endpoint is Firecrawl-compatible, so if you already call Firecrawl's map, the request shape is a drop-in after a base-URL swap (canonical fact sheet §1, §4).

1 credit per map call

Mapping a site costs 1 credit, flat (canonical fact sheet §3, verified 2026-05-18). That is the same price as a single HTTP scrape and a fraction of a JSON extraction, which costs 5 credits. Discovery is intentionally cheap so that mapping before you crawl is never the expensive part of the pipeline — the crawl that follows is where the page-by-page cost lives.

Coverage beyond the sitemap

Because map discovers reachable URLs rather than reading a declared file, it can surface routes the sitemap omits — recently published pages, dynamic listings, and link-reachable sections a generator skipped. It is not magic: if a page is reachable only behind a login or is never linked from anywhere crawlable, mapping will not invent it. But for the common case of "the sitemap is stale or incomplete," active mapping closes most of the gap.

Map then crawl: the discovery-to-extraction pipeline

Mapping is rarely the end goal. The usual pattern is map to discover, then POST /v1/crawl to extract content from the URLs worth reading. Crawl is an asynchronous breadth-first job that returns a job ID you poll for results (canonical fact sheet §4).

maxDepth and maxPages caps

A crawl is governed by two explicit limits so a discovery pass never runs away: maxDepth (how many link-hops from the entry URL, capped at 10) and maxPages (the total page budget, capped at 1000) — the fields limit and max_pages are accepted aliases of maxPages (canonical fact sheet §4). Setting both is the difference between a bounded job and an open-ended one that burns credits crawling pagination forever.

Why map first saves crawl budget

If you crawl blind, you discover and fetch in the same pass, so you only learn the site is 40,000 pages deep after you have already spent credits getting there. If you map first, you see the URL count up front for 1 credit, then crawl a filtered subset — only the /docs tree, say, or only URLs matching a path prefix. Mapping turns "crawl and hope" into "scope, then crawl," which is how you keep the bill predictable. Crawl is billed at 1 credit per page (2 per page when Chrome-rendered), so trimming the page set before you start is the single biggest lever on crawl cost.

Feeding an llms.txt or RAG index

A mapped URL inventory is also a useful artifact on its own. It is the natural input for generating an llms.txt manifest, seeding a retrieval index, or diffing a site over time to detect new and removed pages. In those cases you may not crawl every URL at all — the map is the deliverable, and the 1-credit cost makes recurring re-mapping for freshness cheap.

Choosing map vs sitemap by goal

The decision is not "one is better" — it is which failure mode you can tolerate for the job at hand.

Freshness vs completeness

If you need the most current view of what is live right now — newly published pages, dynamic routes — active mapping wins, because it discovers reality rather than a declaration. If you need exactly the author's intended index and the sitemap is well maintained, reading the sitemap is faster and free of traversal noise. The strongest pipelines do both: read the sitemap for the declared set, then map to catch what the sitemap missed, and union the two.

Decision checklist

Your situationUse
Site has a fresh, complete sitemap and you want the author's indexParse sitemap.xml (free, instant)
Sitemap is stale, partial, or capped/v1/map to catch the gap
Site has no sitemap at all/v1/map (falls back to link traversal)
You need dynamic, filtered, or recently added routes/v1/map
You want maximum coverage and can spend 1 creditBoth: union sitemap + map
You plan to extract content from many pages nextMap first, then /v1/crawl with maxDepth/maxPages

One non-negotiable across both paths: robots.txt is respected by default, and it should only be overridden when you have the legal right to do so (canonical fact sheet §9). Discovery does not exempt you from a site's crawl directives.

The short version

Sitemap parsing is the cheapest, fastest source of a site's declared URLs — and it is only as good as the generator that wrote the file. Active URL mapping via POST /v1/map discovers what is actually reachable for 1 credit, catching the stale, missing, and dynamic routes a sitemap omits. Use the sitemap when it is fresh and you trust it; reach for map when freshness or completeness matters; and map before you crawl so that /v1/crawl, bounded by maxDepth and maxPages, only spends credits on the pages you actually want.

Sources

  • fastCRW canonical fact sheet — credit costs (§3), endpoint surface (§4: /v1/map, /v1/crawl, maxDepth cap 10 / maxPages cap 1000), and honest gaps (§9: robots.txt respected by default). Verified 2026-05-18.
  • fastCRW open-core README endpoint table: github.com/us/crw
  • Pricing reference: /pricing

Related: The /v1/map endpoint deep dive · Crawl an entire website · Scraper vs crawler · What is a web index?

FAQ

Frequently asked questions

What is the difference between URL mapping and parsing a sitemap?
Parsing a sitemap reads the XML file a site author declares (usually at /sitemap.xml) — a cheap, instant list of intended URLs that is only as fresh and complete as the generator that wrote it. URL mapping actively discovers the URLs that are reachable on the live site by following links from an entry point. fastCRW exposes active mapping as POST /v1/map for 1 credit. Sitemaps give you the author's declared index; mapping gives you what is actually there now.
Does the map endpoint find URLs not in the sitemap?
Often, yes. Because /v1/map discovers reachable URLs rather than reading a declared file, it can surface recently published pages, dynamic and filtered routes, and link-reachable sections a sitemap generator skipped or capped. It is not unlimited — pages reachable only behind a login or never linked from a crawlable page will not appear — but for the common case of a stale or partial sitemap, mapping closes most of the coverage gap.
How much does mapping a site cost in credits?
A single POST /v1/map call costs 1 credit (fastCRW canonical fact sheet §3, verified 2026-05-18). That is the same price as one HTTP scrape and a fraction of a JSON extraction, which is 5 credits. Discovery is intentionally cheap so that mapping before a crawl is never the expensive step.
Should I map a site before crawling it?
Usually yes. Mapping first shows you the URL count and shape for 1 credit before you commit to fetching pages, so you can crawl a filtered subset instead of crawling blind. Since /v1/crawl bills 1 credit per page (2 per page when Chrome-rendered), trimming the page set up front is the biggest lever on crawl cost. Map to scope, then crawl what is worth reading.
What are the crawl depth and page limits?
POST /v1/crawl accepts maxDepth (link-hops from the entry URL, capped at 10) and maxPages (total page budget, capped at 1000); limit and max_pages are accepted aliases of maxPages (fastCRW canonical fact sheet §4). Setting both keeps a discovery-to-extraction pass bounded so it never runs away crawling pagination indefinitely.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More engineering posts

View category archive