By the fastCRW team · Benchmark figures verified 2026-05-18 against the run of record (diagnose_3way.py, 2026-05-08) · Verify independently before quoting.
Disclosure: We build fastCRW. This post teaches a measurement method and then shows our own result on it. The whole point of publishing the method is that you can reproduce the number rather than take our word for it — and we name where a competitor genuinely beats us.
How to measure web scraper accuracy: start with truth-recall
The most common mistake in evaluating a web scraper is treating "it returned a page" as "it returned the right content." Those are different measurements, and conflating them is how vendors quietly inflate accuracy claims. To measure web scraper accuracy honestly you need a metric that compares what the scraper returned against what the page was actually supposed to contain. That metric is truth-recall: of the known, labeled ground-truth content for a set of URLs, how much did the scraper actually recover?
Truth-recall is the lead accuracy number we publish for fastCRW — 63.74% of 819 labeled URLs, the highest of the three tools tested on Firecrawl's own public dataset (diagnose_3way.py, 2026-05-08). Below is exactly what that means, why the denominator is 819 and not 1,000, and how to run the same measurement yourself.
Recall vs precision in extraction
Borrow the two words from information retrieval. Recall asks: of everything that should have been returned, how much was? Precision asks: of everything that was returned, how much was correct? A scraper that returns an empty string has perfect precision on the zero things it returned and catastrophic recall. For scraping-for-RAG and extraction pipelines, recall is usually the metric that hurts you when it is low: missing content silently degrades every downstream answer. Truth-recall is a recall metric scoped to a labeled ground-truth set, which is why it is the right headline number for accuracy.
Why "it returned something" is not accuracy
A request can succeed at the HTTP level, return a 200, hand back well-formed Markdown, and still miss the article body — capturing only a cookie banner, a navigation menu, or a paywall stub. None of those failures show up as errors. They show up as thin content. If your only metric is "did the request complete," you will rate a scraper that returns boilerplate identically to one that returns the actual article. Accuracy has to be measured against what the page should have contained, not against whether bytes came back.
The role of labeled ground truth
To score recall you need an answer key: for each URL, a labeled record of the content that page genuinely holds. That label is the ground truth. With it, every scraper's output for that URL can be scored against the same target. Without it, "accuracy" is just self-assessment — the scraper grading its own homework. The hard part of an honest accuracy benchmark is not the scraping; it is having a credible, shared, labeled dataset to score against.
How to build a ground-truth accuracy benchmark
You can build a defensible truth-recall benchmark in three steps: choose a labeled dataset, fix the denominator, and define how a returned page is scored against its label.
Choosing a labeled dataset (Firecrawl's public 819-URL set)
Use a dataset you did not author. We score against Firecrawl's own public scrape-content-dataset-v1 — 1,000 URLs, of which 819 carry labeled ground truth (diagnose_3way.py, 2026-05-08). Scoring against a competitor's published set removes the obvious objection that the benchmark author hand-picked URLs that flatter their own engine. If you build your own labeled set, publish it; a private dataset is an unverifiable claim no matter how good the methodology.
Defining the denominator: labeled URLs, not requests
This is where most accuracy numbers go wrong. Recall is a fraction, and the fraction is only meaningful if the denominator is the number of items that have a known correct answer. For this dataset that is 819 labeled URLs — not the 1,000 total URLs, and definitely not the 3,000 requests issued across three tools. Quoting recall "out of 3,000" or "out of 1,000" silently shrinks the percentage or changes what it measures. fastCRW's 63.74% is always stated as "of 819 labeled URLs." If a vendor will not tell you their denominator, you cannot trust their recall.
Scoring a returned page against its label
For each labeled URL, the harness fetches the page through the scraper, then compares the returned content against the label to decide whether the ground-truth content was recovered. Truth-recall is the count of URLs where the labeled content was successfully recovered, divided by 819. The scoring rule must be identical for every tool in the comparison — same harness, same threshold, same run — or you are comparing measurement noise, not scrapers. See our full benchmark write-up for the per-tool breakdown and the /benchmarks page for the headline numbers.
Truth-recall vs scrape-success rate: two different numbers
Scrape-success rate and truth-recall are easy to confuse and measure opposite ends of the same request. Success rate asks did a usable page come back at all; truth-recall asks was it the right content. A tool can win one and lose the other, which is exactly what happened on this dataset.
Why a page can succeed but miss the truth
Scrape-success counts requests that returned content without throwing. A page that returns a cookie wall, a soft 404, or just the header and footer counts as a success — the request completed and bytes came back — while its truth-recall for that URL is zero because none of the labeled body was recovered. This is why a high success rate can sit on top of mediocre recall: success measures liveness, recall measures correctness. For more on how thin returns slip past success checks, see our note on LLM-ready Markdown extraction.
Pairing 87.7% success with 0 errors honestly
On the same run, fastCRW posted 87.7% scrape-success (877 of 1,000) with 0 thrown errors across 3,000 requests (diagnose_3way.py, 2026-05-08). We always state those two numbers together. "0 errors" alone overstates the result — it sounds like a flawless run when it only means nothing crashed; the 12.3% of URLs that did not yield usable content are part of the honest picture. And on success rate specifically, fastCRW does not lead: Firecrawl posted the highest scrape-success at 89.7% (897 of 1,000) on this run. We lead truth-recall and concede scrape-success — that is what an honest scoreboard looks like.
A real 3-way result you can reproduce
Here is the full accuracy scoreboard from the run of record, so the method above is grounded in numbers you can check rather than asserted in the abstract.
| Tool | Truth-recall (of 819 labeled) | Scrape-success (of 1,000) | Thrown errors (of 3,000) |
|---|---|---|---|
| fastCRW | 63.74% (522) | 87.7% (877) | 0 |
| Crawl4AI | 59.95% (491) | 83.5% (835) | 0 |
| Firecrawl | 56.04% (459) | 89.7% (897) | 0 |
Source: Firecrawl public scrape-content-dataset-v1 (819 labeled URLs), diagnose_3way.py, single run, 3,000 requests, 2026-05-08.
fastCRW 63.74%, Crawl4AI 59.95%, Firecrawl 56.04%
fastCRW recovered the labeled ground truth on 522 of 819 URLs (63.74%) — the highest of the three — versus Crawl4AI's 491 (59.95%) and Firecrawl's 459 (56.04%). That is +3.79 percentage points over Crawl4AI and +7.70 over Firecrawl on an identical set. Note the inversion against the success column: Firecrawl returned the most live pages but recovered the least labeled content, which is the clearest possible illustration that success and recall are different metrics. Accuracy is bought in the recall column, not the success column.
One honest caveat that belongs next to the recall win: the same mechanism that recovers content others miss — a chrome-stealth fallback — also gives fastCRW the worst tail latency of the three (p90 14157 ms vs Crawl4AI's 4754 ms). Accuracy has a cost, and we publish it. If your workload is latency-sensitive rather than recall-sensitive, read scraping latency explained before deciding.
How to run diagnose_3way.py yourself
The benchmark is reproducible by design. Point diagnose_3way.py at the public scrape-content-dataset-v1, run all three scrapers against the same URLs in one pass, and score each returned page against its label with the 819-URL denominator. Use a single run with a fixed dataset snapshot so every tool sees identical inputs, and report the date — the live web changes, so an accuracy number with no date is meaningless. Our result is a single point-in-time run on 2026-05-08; reproduce it on your own date and your numbers may differ as the underlying pages change.
(One housekeeping note for anyone digging through the repo: an older harness, run_bench.py, produced a different recall figure under a different scoring rule. It is quarantined and superseded — diagnose_3way.py is the harness of record. Do not mix figures from the two.)
How to read any vendor's accuracy claim
Whether the vendor is us or anyone else, three questions separate a measurable accuracy claim from marketing.
Demand the dataset, method, and date
Every accuracy number should arrive with the dataset it was measured on, the harness or method used to score it, and the date it was run. "63.74% truth-recall on Firecrawl's public 819-labeled-URL dataset, diagnose_3way.py, 2026-05-08" is checkable. "Industry-leading accuracy" is not. If a vendor cannot name all three, treat the number as unverified — including ours.
Watch for averaged or denominator-shifted numbers
Two tricks inflate accuracy quietly. The first is shifting the denominator — quoting recall against total URLs or total requests instead of labeled URLs, which dilutes the fraction in the vendor's favor. The second is averaging across runs or URL classes to smooth out a weak category. Ask what the denominator is, ask whether the figure is a single run or an average, and ask which URLs were excluded. An honest accuracy benchmark answers all three without flinching, names where it loses (we lose scrape-success to Firecrawl and tail latency to Crawl4AI), and hands you the script to check the rest.
Sources
- fastCRW canonical fact sheet — scrape benchmark of record (truth-recall, scrape-success, latency).
- Benchmark run of record:
bench/server-runs/RESULT_3WAY_1000_FULL.mdanddiagnose_3way.py(single run, 3,000 requests, 2026-05-08). - Dataset: Firecrawl's public
scrape-content-dataset-v1— 1,000 URLs, 819 labeled — firecrawl.dev (verified 2026-05-18). - fastCRW repo and benchmarks: github.com/us/crw · /benchmarks
Related: The fastCRW benchmark, in full · Scraping latency explained · LLM-ready Markdown extraction
