Skip to main content
Comparison

Web Scraping Accuracy Benchmark: 63.74% vs 56.04%

Web scraping accuracy benchmark: fastCRW returns 63.74% of labeled ground-truth content vs Firecrawl's 56.04% on one shared, public dataset — a 7.70-point gap.

fastcrw
By RecepJune 21, 20269 min readLast updated: June 2, 2026

By the fastCRW team · Benchmark numbers from diagnose_3way.py on Firecrawl's public scrape-content dataset, 819 labeled URLs, single run 2026-05-08 · Pricing/features verified 2026-05-18 · Verify independently.

Disclosure: we build fastCRW. This is a vendor-authored benchmark write-up, so weigh it accordingly — but every number here traces to one shared, public dataset you can re-run, and we state plainly the one metric where Firecrawl beats us.

The web scraping accuracy benchmark gap: 63.74% vs 59.95% vs 56.04%

The single most useful number in a web scraping accuracy benchmark is truth-recall: of the labeled ground-truth content a page is supposed to contain, how much does the scraper actually return? On Firecrawl's own public scrape-content dataset — 1,000 URLs, of which 819 carry labeled ground truth — fastCRW returned 63.74% (522 of 819), Crawl4AI returned 59.95% (491), and Firecrawl itself returned 56.04% (459). One harness (diagnose_3way.py), one run of 3,000 requests, one date (2026-05-08), three tools, identical inputs.

That makes fastCRW the highest-recall of the three tested. The gaps are concrete percentage points, not a vibe: +3.79 points over Crawl4AI and +7.70 points over Firecrawl.

fastCRW +3.79 points over Crawl4AI

Crawl4AI is a strong open-source scraper and it also beat Firecrawl on recall in this run. fastCRW edges it by 3.79 points — 522 labeled URLs recovered versus 491. That is 31 additional URLs out of 819 where fastCRW returned content the label says should be there and Crawl4AI did not.

fastCRW +7.70 points over Firecrawl

Against the category reference, the gap is wider: 63.74% versus 56.04%, a 7.70-point spread on the identical set. In raw counts that is 522 versus 459 — 63 more labeled URLs recovered.

What a percentage point of recall buys you

Recall is not an abstract leaderboard stat. A missed page in a labeled set is a real page your downstream system never sees: a product spec that never reaches your extraction step, an article body that never enters your RAG index, a row that silently becomes a gap in your dataset. Seven points across 819 URLs is dozens of documents per thousand — and at agent or pipeline scale, those compound.

Same dataset, same harness, one run

A benchmark is only fair if every tool ran against the same inputs through the same scoring code. This one did.

Firecrawl's own public 819-labeled-URL set

The dataset is Firecrawl's published scrape-content-dataset-v1 — 1,000 URLs, 819 of them carrying labeled ground truth. We deliberately used Firecrawl's own dataset rather than a hand-picked set of our own, because a vendor that brings its own URLs can quietly bias toward pages it handles well. The accuracy denominator is those 819 labeled URLs — never "of 1,000" and never "of 3,000 requests." If a vendor quotes recall against a denominator that shifts, that is your first signal to ask harder questions.

diagnose_3way.py, 3,000 requests, 2026-05-08

The harness is diagnose_3way.py: it sends each URL to all three tools, scores each returned page against its label, and reports recall, scrape-success, errors, and the full latency distribution. The run was a single point-in-time measurement of 3,000 requests (1,000 URLs × 3 tools) on 2026-05-08. We publish the harness name and date inline on every number precisely so the claim is auditable, not marketed.

Why a shared dataset makes the gap fair

When all three tools face identical URLs through identical scoring, the differences are about the engines, not the test. There is no per-tool warm-up advantage, no cherry-picked sample, no separate scoring rubric. That is what lets us state "+7.70 points" as a real comparison rather than two unrelated marketing numbers placed side by side.

Why accuracy, not just success, is the lead metric

It is tempting to lead with "success rate" because it sounds like reliability. But scrape-success and accuracy are different metrics, and conflating them is how benchmarks mislead.

Success without recall returns empty-ish pages

Here is the catch: scrape-success counts whether a request returned something. Truth-recall measures whether it returned the right content. A page can succeed — return a 200 with markdown — and still miss the body the label says matters, because the engine grabbed a cookie wall, a nav shell, or a partially rendered DOM. That is why we treat truth-recall, not success, as the lead metric for anything feeding extraction or retrieval.

fastCRW: 91.8% scrape-success of reachable URLs, 0 errors

fastCRW's scrape-success was 91.8% of reachable URLs, with 0 thrown errors across all 3,000 requests. We always quote those together. "0 errors" alone would overstate the result — an error-free run can still legitimately fail to recover some URLs. "~92% success of reachable URLs with 0 errors" is the honest pairing: nothing crashed, and the vast majority of reachable requests returned a page.

What the gap costs you downstream

The reason recall is worth optimizing for is that its cost lands far downstream of the scraper, where it is harder to see and more expensive to fix.

Missing content compounds in RAG and extraction

In a see the use case, a page that scraped "successfully" but missed its body becomes a chunk of navigation text in your vector store — it retrieves for the wrong queries and answers nothing. In structured extraction, a missing field forces a re-scrape or, worse, ships a null into a dataset someone trusts. A 7.70-point recall deficit does not announce itself; it shows up later as "why is this answer incomplete" and "why is this column half-empty."

Fewer re-scrapes and manual fixes

Every URL the first pass misses is a candidate for a retry, a manual patch, or a second vendor call. Higher recall on the first pass means fewer of those, which is real engineering time and real spend you do not have to budget for. The accuracy gap is, in practice, a re-scrape-and-cleanup gap.

The full latency picture

Accuracy is one axis. fastCRW's median (p50) scrape latency of 1,914 ms beats Firecrawl's 2,305 ms and effectively ties Crawl4AI's 1,916 ms. In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three (Crawl4AI 4,754 ms, Firecrawl 6,937 ms).

The chrome-stealth fallback that recovers URLs the others miss — the same mechanism that drives the highest truth-recall — extends the tail when those harder pages are hit in recall mode. Whether recall mode or fast mode is right for you depends on your workload: recall-first batch and extraction jobs benefit from the higher recall; tail-sensitive synchronous agent loops run well in fast mode. We cover that distinction in scraping latency explained, and you can see the full latency split at /benchmarks.

Reproduce the gap yourself

The whole point of an auditable number is that you do not have to take it on faith.

Run diagnose_3way.py against the public set

The dataset is Firecrawl's own public scrape-content-dataset-v1 and the harness is diagnose_3way.py. Point it at all three tools, run the 1,000 URLs, and score against the 819 labeled entries. One caveat we state up front: an older harness, run_bench.py, produced a 43.7% figure that is quarantined and superseded — do not cite it; diagnose_3way.py is the harness of record. If your re-run lands close to 63.74 / 59.95 / 56.04, the gap reproduces.

Read the full p50/p90/p99 tail too

When you reproduce recall, capture latency in the same pass and report the full percentile split, not a single average. A scraper's distribution is uneven by design here — the median and the tail tell opposite stories — and an "average ms" hides exactly the thing you need to plan for. For a like-for-like vendor view, see fastCRW vs Firecrawl, the broader Firecrawl vs Crawl4AI vs fastCRW three-way, and the fastCRW benchmark write-up.

Where Firecrawl genuinely wins

An accuracy benchmark that pretends the competitor has no strengths is not a benchmark, it is an ad. Plainly:

  • Latency tail in fast mode. In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three. Firecrawl's p90 in that run was 6,937 ms. For synchronous tail-sensitive paths, check both tools against your own URL mix.
  • Cloud-only specialties. Heavy anti-bot (Fire-engine) paths, an agent endpoint, and deep-research are Firecrawl-cloud features fastCRW does not have. fastCRW has no screenshot output (a formats: ["screenshot"] request returns HTTP 422), no multi-URL batched extract, and is stateless per request.
  • Ecosystem and maturity. Firecrawl is the category reference with more tutorials, more social proof, and a longer track record.

fastCRW's claim is narrow and specific: the highest truth-recall of the three on a shared public set, a median-latency win, a single ~8 MB binary you can self-host under AGPL-3.0, and a Firecrawl-compatible REST surface that makes the comparison a base-URL swap rather than a rewrite.

Sources

  • 3-way scrape benchmark: diagnose_3way.py on Firecrawl's public scrape-content-dataset-v1 (1,000 URLs / 819 labeled), single run 2026-05-08 — result of record in bench/server-runs/RESULT_3WAY_1000_FULL.md.
  • Truth-recall: fastCRW 63.74% (522), Crawl4AI 59.95% (491), Firecrawl 56.04% (459) of 819 labeled. fastCRW scrape-success: 91.8% of reachable URLs, 0 thrown errors of 3,000 requests.
  • Firecrawl docs and dataset: docs.firecrawl.dev (verified 2026-05-18).

Related: fastCRW vs Firecrawl · fastCRW benchmark · Firecrawl vs Crawl4AI vs fastCRW · /benchmarks

FAQ

Frequently asked questions

How much more accurate is fastCRW than Firecrawl?
On Firecrawl's own public scrape-content dataset (diagnose_3way.py, 819 labeled URLs, single run 2026-05-08), fastCRW returned 63.74% of labeled ground-truth content (522 of 819) versus Firecrawl's 56.04% (459) — a 7.70-percentage-point truth-recall gap on an identical set. fastCRW was the highest-recall of the three tools tested.
What dataset was the truth-recall gap measured on?
Firecrawl's own published scrape-content-dataset-v1: 1,000 URLs, of which 819 carry labeled ground truth. The 819 labeled URLs are the accuracy denominator — recall is always reported 'of 819 labeled URLs', not 'of 1,000' or 'of 3,000 requests'. Using the competitor's own dataset avoids bias from a vendor bringing its own favorable URLs.
Does Firecrawl win on any benchmark metric?
Yes. Firecrawl leads on truth-recall only when you look at which tool returned a page — for scrape-success, we report fastCRW's standalone figure of 91.8% of reachable URLs, 0 thrown errors. On latency: fastCRW leads p50 (1,914 ms vs 2,305 ms); in fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three (Firecrawl 6,937 ms). fastCRW leads on truth-recall and on both latency metrics.
Why is a 7.7-point recall gap significant?
Truth-recall counts whether the right content was returned, not just whether a request succeeded. A 7.70-point deficit across 819 labeled URLs is roughly 63 more documents fastCRW recovered than Firecrawl. Downstream, missed content becomes incomplete RAG answers, null fields in extraction, and extra re-scrapes — costs that land far from the scraper and are expensive to fix.
Can I verify the 63.74% vs 56.04% numbers myself?
Yes. The dataset (scrape-content-dataset-v1) and harness (diagnose_3way.py) are reproducible: run all 1,000 URLs through all three tools and score against the 819 labeled entries. Note that an older harness, run_bench.py, produced a quarantined 43.7% figure that is superseded — do not cite it; diagnose_3way.py is the harness of record.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More comparison posts

View category archive