By the fastCRW team · Last reviewed 2026-05-18
Disclosure: fastCRW is a Firecrawl-compatible scraper built by the author. This is an endpoint-level engineering post; every code sample also runs against fastCRW by changing the base URL.
What /scrape is for
Scrape is the workhorse: one URL in, clean machine-usable content out. It's the endpoint behind RAG ingestion, "read this page" agent tools, content monitoring, and enrichment. If you understand scrape deeply, crawl and search mostly fall out of it (crawl is scrape applied across a discovered URL set; search-then-scrape chains them).
The request anatomy
POST /v1/scrape
Content-Type: application/json
Authorization: Bearer YOUR_KEY
{
"url": "https://example.com/pricing",
"formats": ["markdown"]
}
The decisive field is formats. It is not cosmetic — it changes what work the engine does and, on metered services, what it costs:
markdown— cleaned, LLM-ready text with structure preserved (headings, lists, links). The default choice for RAG and agent context. Smallest, cheapest, most token-efficient.html— the cleaned/rendered DOM. Use when you need to run your own selectors or preserve exact structure.rawHtml— pre-clean HTML. Rarely needed; useful for debugging extraction or capturing things the cleaner strips.- JSON / structured — page → JSON against a schema or natural-language instruction. This is "extraction," and it's the field that matters most for billing (see below).
Rule of thumb: request the narrowest format that satisfies the consumer. Asking for markdown + html + rawHtml + json "just in case" multiplies work and, on some plans, cost.
Markdown vs HTML vs JSON: choosing correctly
| Consumer | Best format | Why |
|---|---|---|
| RAG / vector store | markdown | Token-efficient, preserves semantic structure, no parsing needed |
| LLM "read this page" tool | markdown | Fits context windows, model handles structure natively |
| Field extraction (price, author, SKU) | JSON + schema | Deterministic shape, no post-parsing, validate-able |
| Custom DOM selectors / tables | html | You need elements, not prose |
| Debugging the cleaner | rawHtml | See what was stripped |
JavaScript rendering
Modern sites render content client-side, so a scraper that only fetches static HTML returns empty shells for SPAs. Both Firecrawl and fastCRW render JavaScript so the markdown reflects what a browser would see. Two engineering implications:
- Latency: rendered pages are slower than static fetches. Budget p95, not p50, for JS-heavy targets.
- Cost: on some plans render-heavy modes can bill more than the headline 1 credit/page — verify against response credit metadata. fastCRW handles JS rendering within the standard scrape path and credit; there is no separate render tier or multiplier.
You generally don't toggle a browser flag manually for the common case — the engine decides when rendering is needed. Where you do control it, prefer the lightest mode that returns complete content.
Structured extraction: the field that drives the bill
Requesting a JSON format with a schema turns scrape into structured extraction:
{
"url": "https://shop.example.com/item/42",
"formats": ["json"],
"jsonOptions": {
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"in_stock": { "type": "boolean" }
},
"required": ["name", "price"]
}
}
}
This is where the two products diverge economically, and it's the single most important thing to know before adopting Firecrawl heavily. Firecrawl's AI extraction is widely reported to run on a separate token-based subscription stacked on top of your credit plan — a Standard plan plus extraction reportedly totals ~$172–188/mo minimum. fastCRW folds JSON extraction into the same scrape call under the same 1-credit-per-page model — no second subscription. If your pipeline extracts structured data on most pages (typical for enrichment and agents), this difference can roughly halve the bill.
The response and defensive parsing
{
"success": true,
"data": {
"markdown": "# Pricing\n\n...",
"metadata": {
"title": "Pricing",
"sourceURL": "https://example.com/pricing",
"statusCode": 200
}
}
}
Write the client to read the specific fields it consumes (data.markdown, data.metadata.statusCode) rather than asserting on the whole envelope. Metadata extras and the error-envelope shape can differ between compatible engines; field-level reads survive that, payload snapshots don't.
Error handling that survives a backend swap
Classify by HTTP status first, payload second:
- 2xx, empty content: often a render gap or a consent/anti-bot wall. Retry once with rendering; then treat as a soft miss.
- 4xx: bad URL, auth, or quota/concurrency. Do not blind-retry 401/402/429 — fix the cause or back off.
- 5xx / timeout: transient. Exponential backoff with a cap and a max-attempts ceiling.
This logic is portable across any Firecrawl-compatible backend because it keys on status codes, not vendor-specific error JSON.
The same call, two backends
from firecrawl import FirecrawlApp
# Firecrawl Cloud
fc = FirecrawlApp(api_key="fc-...", api_url="https://api.firecrawl.dev")
# fastCRW — self-hosted single ~6MB binary, or managed cloud
crw = FirecrawlApp(api_key="key", api_url="https://your-fastcrw-host")
params = {"formats": ["markdown"]}
a = fc.scrape_url("https://example.com", params=params)
b = crw.scrape_url("https://example.com", params=params)
# same SDK, same method, same shape — diff the markdown to validate parity
That symmetry is the entire point of API compatibility: the scrape endpoint you've already coded against keeps working when you change where it points — to a managed cloud or to an open-core engine you run yourself.
Scrape performance notes
fastCRW's engine is a single Rust binary engineered for speed and a tiny footprint (~6MB binary), which tends to help scrape latency and per-page cost at volume versus a heavier hosted stack. Treat speed as qualitative unless you've run a dated benchmark on your own URL mix — the honest claim is "open-core Rust scraper, local-first, Firecrawl-compatible," and you should measure p50/p95 on your traffic before quoting numbers internally.
The scrape failure modes you must design for
Most scrape "bugs" are not bugs — they are predictable site behaviors a robust client anticipates. The catalog worth coding against, regardless of backend:
- The consent/cookie wall. The page returns 200 but the content is a GDPR banner, not the article. Detect by suspiciously short markdown plus boilerplate keywords; retry with rendering, then mark as a soft miss rather than indexing the banner.
- The skeleton SPA. Static fetch returns an empty app shell; content arrives via XHR after hydration. This is exactly why JS rendering exists — confirm your scrape path renders for these targets and budget the extra latency.
- The soft 404. The site returns 200 with a "not found" page instead of a real 404. Status-code logic alone misses this; add a content heuristic for known soft-404 markers on sites that do it.
- The rate-limit redirect. Under load the target redirects to a challenge or throttle page. You will see content drift, not an error. Sample-monitor content length over time to catch it.
- The truncated giant. Extremely long pages may be capped. If completeness matters, assert on the presence of an expected end-of-content marker rather than assuming the full document came back.
None of these are vendor-specific; they are properties of the open web. A scrape client that treats "200 with content present and plausible" as success — rather than just "200" — survives them. Build that check once, in the adapter, and every downstream consumer inherits it.
Output hygiene before the LLM sees it
Clean markdown from the engine still benefits from a thin normalization pass before it enters a context window or vector store:
- Collapse whitespace runs so chunk boundaries are stable across re-scrapes and engines (this is why parity checks should normalize whitespace).
- Strip residual nav/footer boilerplate that survives on unusual layouts — a short per-domain denylist of repeated lines pays for itself in token savings.
- Preserve and record
sourceURLwith every chunk for provenance and incremental refresh. - Hash the normalized content so you can skip re-embedding unchanged pages on the next run — the single biggest recurring-cost lever in any ingestion pipeline.
This pass is identical regardless of which Firecrawl-compatible backend produced the markdown, which is the point: standardize the scrape contract and your post-processing becomes a stable, backend-neutral asset rather than something you re-tune per vendor.
Why scrape economics decide the architecture
Scrape is the highest-frequency call in almost every pipeline, so its unit economics dominate the bill more than any other endpoint. Two levers move that economics structurally: whether structured extraction is in-credit or a separate subscription (it is in-credit on a Firecrawl-compatible single-credit engine), and whether you can remove the per-call meter entirely by self-hosting the same engine. Because fastCRW's scrape path is the same single ~6MB AGPL-3.0 Rust binary whether self-hosted or managed, the same scrape code you write today can run metered-in-the-cloud during prototyping and unmetered-on-your-infra at scale, decided by one config value. Designing the scrape layer to be backend-neutral is therefore not fastidiousness — it is the cheapest insurance you can buy against the line item most likely to balloon.
Sources
- Firecrawl scrape docs: docs.firecrawl.dev
- fastCRW repo: github.com/us/crw
Related: Firecrawl /crawl deep dive · Firecrawl API compatibility