By the fastCRW team · Credit and endpoint facts verified 2026-05-18 against the canonical fact sheet · Verify independently before you build on them.
URL mapping vs sitemap parsing: two ways to discover a site's URLs
Before you can crawl or extract a site, you need to know what pages exist. There are two ways to build that URL inventory, and the choice of URL mapping vs sitemap parsing for site discovery shapes both how complete your coverage is and how much it costs. The first way is to read the site's declared sitemap.xml. The second is active URL mapping — asking a discovery endpoint to enumerate the live URLs it can actually reach. They sound interchangeable. In practice they fail and succeed in different places, and a robust crawler usually uses both.
Parsing sitemap.xml
A sitemap is an XML file (usually at /sitemap.xml or referenced from robots.txt) where the site author declares the URLs they want indexed, often with a lastmod timestamp and a priority hint. Parsing it is cheap and instant: one HTTP fetch, one XML parse, and you have a list. When the sitemap is fresh and complete, it is the single best source of a site's intended URL set, because it is the author telling you directly what to look at.
Active URL mapping
Active mapping does not trust a declared file. It discovers URLs the way a crawler sees them — following links, expanding from the homepage, and returning the set of URLs that are actually live and reachable right now. fastCRW exposes this as POST /v1/map, which returns a site's URL inventory for 1 credit (canonical fact sheet §3-4, verified 2026-05-18). Mapping costs a little more than reading a static file, but it does not inherit the sitemap's blind spots.
Where sitemap parsing falls short
Sitemaps are a declaration of intent, not a guarantee of reality, and the gap between the two is where discovery pipelines quietly lose pages.
Stale or partial sitemaps
Most sitemaps are generated on a schedule, not on every publish. A page added an hour ago may not appear for a day; a page deleted last week may still be listed. Many CMSs also cap the sitemap at a fixed page count or only include "important" routes, so deep or paginated sections never make it in. If your pipeline trusts the sitemap as ground truth, you silently miss whatever the generator skipped.
Sites with no sitemap at all
Plenty of sites — internal tools, smaller marketing sites, hand-built pages — ship no sitemap whatsoever. A discovery step that depends on sitemap.xml returns an empty list and your crawl never starts. Active mapping degrades gracefully here: with no sitemap to read, it falls back to following links from the entry URL, so you still get a usable inventory.
Sitemaps that omit dynamic routes
Search-result pages, filtered listings, faceted navigation, and JavaScript-generated routes are frequently absent from sitemaps because the author never enumerated them. These are often exactly the pages you want for a data pull. Reading the sitemap alone will never surface them; you have to discover them by traversal.
POST /v1/map: active URL discovery
The map endpoint is fastCRW's first-class discovery primitive. It is a deliberate, separate step rather than a side effect of crawling — you map first to see the shape of a site, then decide what to crawl.
What map returns and how
You send a single URL to POST /v1/map and get back a list of URLs discovered for that site. Because it is synchronous and returns a flat inventory rather than page content, it is fast and cheap to run — the right tool when the question is "what URLs exist here?" rather than "what is on each page?". The endpoint is Firecrawl-compatible, so if you already call Firecrawl's map, the request shape is a drop-in after a base-URL swap (canonical fact sheet §1, §4).
1 credit per map call
Mapping a site costs 1 credit, flat (canonical fact sheet §3, verified 2026-05-18). That is the same price as a single HTTP scrape and a fraction of a JSON extraction, which costs 5 credits. Discovery is intentionally cheap so that mapping before you crawl is never the expensive part of the pipeline — the crawl that follows is where the page-by-page cost lives.
Coverage beyond the sitemap
Because map discovers reachable URLs rather than reading a declared file, it can surface routes the sitemap omits — recently published pages, dynamic listings, and link-reachable sections a generator skipped. It is not magic: if a page is reachable only behind a login or is never linked from anywhere crawlable, mapping will not invent it. But for the common case of "the sitemap is stale or incomplete," active mapping closes most of the gap.
Map then crawl: the discovery-to-extraction pipeline
Mapping is rarely the end goal. The usual pattern is map to discover, then POST /v1/crawl to extract content from the URLs worth reading. Crawl is an asynchronous breadth-first job that returns a job ID you poll for results (canonical fact sheet §4).
maxDepth and maxPages caps
A crawl is governed by two explicit limits so a discovery pass never runs away: maxDepth (how many link-hops from the entry URL, capped at 10) and maxPages (the total page budget, capped at 1000) — the fields limit and max_pages are accepted aliases of maxPages (canonical fact sheet §4). Setting both is the difference between a bounded job and an open-ended one that burns credits crawling pagination forever.
Why map first saves crawl budget
If you crawl blind, you discover and fetch in the same pass, so you only learn the site is 40,000 pages deep after you have already spent credits getting there. If you map first, you see the URL count up front for 1 credit, then crawl a filtered subset — only the /docs tree, say, or only URLs matching a path prefix. Mapping turns "crawl and hope" into "scope, then crawl," which is how you keep the bill predictable. Crawl is billed at 1 credit per page (2 per page when Chrome-rendered), so trimming the page set before you start is the single biggest lever on crawl cost.
Feeding an llms.txt or RAG index
A mapped URL inventory is also a useful artifact on its own. It is the natural input for generating an llms.txt manifest, seeding a retrieval index, or diffing a site over time to detect new and removed pages. In those cases you may not crawl every URL at all — the map is the deliverable, and the 1-credit cost makes recurring re-mapping for freshness cheap.
Choosing map vs sitemap by goal
The decision is not "one is better" — it is which failure mode you can tolerate for the job at hand.
Freshness vs completeness
If you need the most current view of what is live right now — newly published pages, dynamic routes — active mapping wins, because it discovers reality rather than a declaration. If you need exactly the author's intended index and the sitemap is well maintained, reading the sitemap is faster and free of traversal noise. The strongest pipelines do both: read the sitemap for the declared set, then map to catch what the sitemap missed, and union the two.
Decision checklist
| Your situation | Use |
|---|---|
| Site has a fresh, complete sitemap and you want the author's index | Parse sitemap.xml (free, instant) |
| Sitemap is stale, partial, or capped | /v1/map to catch the gap |
| Site has no sitemap at all | /v1/map (falls back to link traversal) |
| You need dynamic, filtered, or recently added routes | /v1/map |
| You want maximum coverage and can spend 1 credit | Both: union sitemap + map |
| You plan to extract content from many pages next | Map first, then /v1/crawl with maxDepth/maxPages |
One non-negotiable across both paths: robots.txt is respected by default, and it should only be overridden when you have the legal right to do so (canonical fact sheet §9). Discovery does not exempt you from a site's crawl directives.
The short version
Sitemap parsing is the cheapest, fastest source of a site's declared URLs — and it is only as good as the generator that wrote the file. Active URL mapping via POST /v1/map discovers what is actually reachable for 1 credit, catching the stale, missing, and dynamic routes a sitemap omits. Use the sitemap when it is fresh and you trust it; reach for map when freshness or completeness matters; and map before you crawl so that /v1/crawl, bounded by maxDepth and maxPages, only spends credits on the pages you actually want.
Sources
- fastCRW canonical fact sheet — credit costs (§3), endpoint surface (§4:
/v1/map,/v1/crawl,maxDepthcap 10 /maxPagescap 1000), and honest gaps (§9:robots.txtrespected by default). Verified 2026-05-18. - fastCRW open-core README endpoint table: github.com/us/crw
- Pricing reference: /pricing
Related: The /v1/map endpoint deep dive · Crawl an entire website · Scraper vs crawler · What is a web index?
