What is truth-recall in web scraping?

Truth-recall measures how much of a page's known, labeled ground-truth content a scraper actually recovers. For a labeled dataset it is the count of URLs where the correct content was returned, divided by the number of labeled URLs. It is a recall metric: it answers 'of everything that should have come back, how much did?' On Firecrawl's public 819-labeled-URL dataset, fastCRW scored 63.74% truth-recall — the highest of three tools tested (diagnose_3way.py, 2026-05-08).

How is scraper accuracy different from scrape-success rate?

Scrape-success rate counts requests that returned a usable page without throwing — it measures liveness. Truth-recall measures whether the returned page contained the right content — it measures correctness. A page can succeed (return a 200 with a cookie banner or header/footer) yet have zero truth-recall because none of the labeled body was recovered. On the benchmark run, fastCRW led truth-recall (63.74% of 819 labeled URLs) while posting 91.8% scrape-success of reachable URLs and 0 errors — proving the two are distinct numbers.

What dataset is fastCRW's accuracy benchmark run on?

It runs on Firecrawl's own public scrape-content-dataset-v1: 1,000 URLs, of which 819 carry labeled ground truth used as the accuracy denominator. Scoring is done by diagnose_3way.py in a single run of 3,000 requests across all three tools on 2026-05-08. Using a competitor's published dataset removes the objection that the benchmark author cherry-picked flattering URLs.

Why does fastCRW report accuracy out of 819 URLs and not 1,000?

Recall is only meaningful against items that have a known correct answer. Of the dataset's 1,000 URLs, only 819 carry labeled ground truth, so 819 is the correct denominator. Reporting recall out of 1,000 (or out of the 3,000 total requests) would shift the denominator and silently change what the percentage means. fastCRW's 63.74% is always stated 'of 819 labeled URLs' for that reason.

How to Measure Web Scraper Accuracy (Truth-Recall)

By the fastCRW team · Benchmark figures verified 2026-05-18 against the run of record (diagnose_3way.py, 2026-05-08) · Verify independently before quoting.

Disclosure: We build fastCRW. This post teaches a measurement method and then shows our own result on it. The whole point of publishing the method is that you can reproduce the number rather than take our word for it — and we name where a competitor genuinely beats us.

How to measure web scraper accuracy: start with truth-recall

The most common mistake in evaluating a web scraper is treating "it returned a page" as "it returned the right content." Those are different measurements, and conflating them is how vendors quietly inflate accuracy claims. To measure web scraper accuracy honestly you need a metric that compares what the scraper returned against what the page was actually supposed to contain. That metric is truth-recall: of the known, labeled ground-truth content for a set of URLs, how much did the scraper actually recover?

Truth-recall is the lead accuracy number we publish for fastCRW — 63.74% of 819 labeled URLs, the highest of the three tools tested on Firecrawl's own public dataset (diagnose_3way.py, 2026-05-08). Below is exactly what that means, why the denominator is 819 and not 1,000, and how to run the same measurement yourself.

Recall vs precision in extraction

Borrow the two words from information retrieval. Recall asks: of everything that should have been returned, how much was? Precision asks: of everything that was returned, how much was correct? A scraper that returns an empty string has perfect precision on the zero things it returned and catastrophic recall. For scraping-for-RAG and extraction pipelines, recall is usually the metric that hurts you when it is low: missing content silently degrades every downstream answer. Truth-recall is a recall metric scoped to a labeled ground-truth set, which is why it is the right headline number for accuracy.

Why "it returned something" is not accuracy

A request can succeed at the HTTP level, return a 200, hand back well-formed Markdown, and still miss the article body — capturing only a cookie banner, a navigation menu, or a paywall stub. None of those failures show up as errors. They show up as thin content. If your only metric is "did the request complete," you will rate a scraper that returns boilerplate identically to one that returns the actual article. Accuracy has to be measured against what the page should have contained, not against whether bytes came back.

The role of labeled ground truth

To score recall you need an answer key: for each URL, a labeled record of the content that page genuinely holds. That label is the ground truth. With it, every scraper's output for that URL can be scored against the same target. Without it, "accuracy" is just self-assessment — the scraper grading its own homework. The hard part of an honest accuracy benchmark is not the scraping; it is having a credible, shared, labeled dataset to score against.

How to build a ground-truth accuracy benchmark

You can build a defensible truth-recall benchmark in three steps: choose a labeled dataset, fix the denominator, and define how a returned page is scored against its label.

Choosing a labeled dataset (Firecrawl's public 819-URL set)

Use a dataset you did not author. We score against Firecrawl's own public scrape-content-dataset-v1 — 1,000 URLs, of which 819 carry labeled ground truth (diagnose_3way.py, 2026-05-08). Scoring against a competitor's published set removes the obvious objection that the benchmark author hand-picked URLs that flatter their own engine. If you build your own labeled set, publish it; a private dataset is an unverifiable claim no matter how good the methodology.

Defining the denominator: labeled URLs, not requests

This is where most accuracy numbers go wrong. Recall is a fraction, and the fraction is only meaningful if the denominator is the number of items that have a known correct answer. For this dataset that is 819 labeled URLs — not the 1,000 total URLs, and definitely not the 3,000 requests issued across three tools. Quoting recall "out of 3,000" or "out of 1,000" silently shrinks the percentage or changes what it measures. fastCRW's 63.74% is always stated as "of 819 labeled URLs." If a vendor will not tell you their denominator, you cannot trust their recall.

Scoring a returned page against its label

For each labeled URL, the harness fetches the page through the scraper, then compares the returned content against the label to decide whether the ground-truth content was recovered. Truth-recall is the count of URLs where the labeled content was successfully recovered, divided by 819. The scoring rule must be identical for every tool in the comparison — same harness, same threshold, same run — or you are comparing measurement noise, not scrapers. See our full benchmark write-up for the per-tool breakdown and the /benchmarks page for the headline numbers.

Truth-recall vs scrape-success rate: two different numbers

Scrape-success rate and truth-recall are easy to confuse and measure opposite ends of the same request. Success rate asks did a usable page come back at all; truth-recall asks was it the right content. A tool can win one and lose the other, which is exactly what happened on this dataset.

Why a page can succeed but miss the truth

Scrape-success counts requests that returned content without throwing. A page that returns a cookie wall, a soft 404, or just the header and footer counts as a success — the request completed and bytes came back — while its truth-recall for that URL is zero because none of the labeled body was recovered. This is why a high success rate can sit on top of mediocre recall: success measures liveness, recall measures correctness. For more on how thin returns slip past success checks, see our note on LLM-ready Markdown extraction.

Pairing scrape-success with 0 errors honestly

On the same run, fastCRW posted 91.8% scrape-success (877 of 955 reachable URLs) with 0 thrown errors across 3,000 requests (diagnose_3way.py, 2026-05-08). We always state those two numbers together. "0 errors" alone overstates the result — it sounds like a flawless run when it only means nothing crashed; the URLs that did not yield usable content are part of the honest picture. Scrape-success and truth-recall are distinct metrics — that is what an honest scoreboard looks like.

A real 3-way result you can reproduce

Here is the full accuracy scoreboard from the run of record, so the method above is grounded in numbers you can check rather than asserted in the abstract.

Tool	Truth-recall (of 819 labeled)	Thrown errors (of 3,000)
fastCRW	63.74% (522)	0
Crawl4AI	59.95% (491)	0
Firecrawl	56.04% (459)	0

Source: Firecrawl public scrape-content-dataset-v1 (819 labeled URLs), diagnose_3way.py, single run, 3,000 requests, 2026-05-08.

fastCRW 63.74%, Crawl4AI 59.95%, Firecrawl 56.04%

fastCRW recovered the labeled ground truth on 522 of 819 URLs (63.74%) — the highest of the three — versus Crawl4AI's 491 (59.95%) and Firecrawl's 459 (56.04%). That is +3.79 percentage points over Crawl4AI and +7.70 over Firecrawl on an identical set. Note the inversion: Firecrawl returned more live pages but recovered the least labeled content, which is the clearest possible illustration that success and recall are different metrics. Accuracy is bought in the recall column, not the success column. fastCRW also logged 0 thrown errors across all 3,000 requests and 91.8% scrape-success of reachable URLs — and in fast mode posted p90 latency of 4348 ms, the lowest of the three tools tested.

How to run diagnose_3way.py yourself

The benchmark is reproducible by design. Point diagnose_3way.py at the public scrape-content-dataset-v1, run all three scrapers against the same URLs in one pass, and score each returned page against its label with the 819-URL denominator. Use a single run with a fixed dataset snapshot so every tool sees identical inputs, and report the date — the live web changes, so an accuracy number with no date is meaningless. Our result is a single point-in-time run on 2026-05-08; reproduce it on your own date and your numbers may differ as the underlying pages change.

(One housekeeping note for anyone digging through the repo: an older harness, run_bench.py, produced a different recall figure under a different scoring rule. It is quarantined and superseded — diagnose_3way.py is the harness of record. Do not mix figures from the two.)

How to read any vendor's accuracy claim

Whether the vendor is us or anyone else, three questions separate a measurable accuracy claim from marketing.

Demand the dataset, method, and date

Every accuracy number should arrive with the dataset it was measured on, the harness or method used to score it, and the date it was run. "63.74% truth-recall on Firecrawl's public 819-labeled-URL dataset, diagnose_3way.py, 2026-05-08" is checkable. "Industry-leading accuracy" is not. If a vendor cannot name all three, treat the number as unverified — including ours.

Watch for averaged or denominator-shifted numbers

Two tricks inflate accuracy quietly. The first is shifting the denominator — quoting recall against total URLs or total requests instead of labeled URLs, which dilutes the fraction in the vendor's favor. The second is averaging across runs or URL classes to smooth out a weak category. Ask what the denominator is, ask whether the figure is a single run or an average, and ask which URLs were excluded. An honest accuracy benchmark answers all three without flinching and hands you the script to check the rest.

Sources

fastCRW canonical fact sheet — scrape benchmark of record (truth-recall, scrape-success, latency).
Benchmark run of record: bench/server-runs/RESULT_3WAY_1000_FULL.md and diagnose_3way.py (single run, 3,000 requests, 2026-05-08).
Dataset: Firecrawl's public scrape-content-dataset-v1 — 1,000 URLs, 819 labeled — firecrawl.dev (verified 2026-05-18).
fastCRW repo and benchmarks: github.com/us/crw · /benchmarks