Skip to main content
Engineering

Web Scraping Benchmark Methodology: Why p50/p90/p99

Our web scraping benchmark methodology: shared public dataset, percentile latency, labeled ground-truth recall, disclosed gaps. Why we never publish averages.

fastcrw
By RecepJune 24, 20269 min readLast updated: June 2, 2026

By the fastCRW team · Benchmark numbers from diagnose_3way.py on Firecrawl's public scrape-content dataset, single run 2026-05-08; pricing/competitor facts verified 2026-05-18 · Verify independently before quoting.

Disclosure: we build fastCRW, so this is a vendor describing its own the numbers. We've written it so you can hold us — and every other vendor — to the same bar, including the places where our numbers are the worst of the field.

Our web scraping benchmark methodology in one page

A web scraping benchmark methodology is only worth reading if you can tell, from the page alone, what was measured, on what data, and when. Most vendor benchmarks fail that test: they quote one flattering number with no dataset, no date, and no distribution. Ours rests on three rules that we apply to every figure we publish, and that you should demand from anyone selling you a scraper.

First, every benchmark runs on a shared public dataset, not a private one we curated to win. Second, latency is reported as a percentile split (p50/p90/p99), never as a single average. Third, every number carries an inline provenance phrase — dataset, method, and date — so it can be traced and reproduced. The rest of this post is just those three rules, defended.

Shared public dataset, not a private one

Our scrape benchmark of record runs on Firecrawl's own public scrape-content-dataset-v1: 1,000 URLs, of which 819 carry labeled ground truth. Using a competitor's published dataset removes the most common benchmark cheat — picking pages your engine happens to handle well. If the dataset belongs to the tool you're comparing against, you can't be accused of stacking the deck.

Percentiles over averages

We publish p50, p90, and p99 together for every latency claim. A median tells you what a typical request feels like; the tail tells you what your worst one-in-ten or one-in-a-hundred requests cost. Collapsing both into one "average" throws away the only number that decides whether a synchronous agent call times out. More on why averages are actively misleading below.

Inline provenance on every number

Each statistic is written as "X on dataset Y (method Z, date)". For example: 63.74% truth-recall of 819 labeled URLs (diagnose_3way.py, 2026-05-08). The provenance is part of the claim, not a footnote — strip it and the number is no longer something we said. If a benchmark figure on this site lacks its dataset, method, and date, treat it as a bug and tell us.

Why a single average is a benchmark anti-pattern

Averages are the default way scraper speed gets reported, and they are the easiest number to mislead with. The problem is structural, not malicious: a scraper's latency distribution is heavily right-skewed, so one slow tail drags the mean far away from the experience of a typical request.

One slow tail destroys the mean

Web scraping latency is dominated by a long tail — a handful of pages that need a full headless browser, retries, or anti-bot recovery take ten to fifty times longer than a clean static fetch. A mean averages those outliers into every request, so the "average" describes no request anyone actually made. The median (p50) is robust to that skew; the mean is not. That's why a single average flatters slow-but-occasionally-catastrophic engines and punishes fast-but-honest ones in equal measure, depending on which way the tail leans.

A sub-second mean is impossible against any honest p90

This isn't abstract for us. An earlier, superseded run of our own produced an implausibly low sub-second average that briefly circulated. It is arithmetically impossible given any realistic p90: in fast mode our p90 on the canonical run is 4348 ms (diagnose_3way.py, 2026-05-08) — the lowest of the three tools tested. Even that is enough that a sub-second mean cannot coexist with it — the slow decile alone would pull the average well above one second. That figure was a measurement artifact, and the fact that it sounds great is exactly why an average is dangerous: it looked like our best number and it was simply wrong.

We dropped that figure sitewide

We removed that sub-second average — and the related single speed-multiple claim — from every page on this site. We don't quote them in any context, even where the comparison would favor us, because once you accept a misleading average as a marketing line you've lost the right to criticize anyone else's. The honest replacement is the full distribution: p50 1914 ms (fastest of the three), p90 4348 ms in fast mode (lowest of the three: Crawl4AI 4754 ms, Firecrawl 6937 ms), p99 15012 ms (better than Firecrawl's 21107 ms). All three, every time.

Accuracy needs a labeled denominator

Latency is only half the story. The other half — did the scraper return the right content — needs a different kind of rigor, because "it returned something" is not accuracy. Accuracy requires comparing what came back against a known ground truth, which means you need a labeled denominator.

819 labeled URLs, not 3,000 requests

Our truth-recall metric measures how much labeled ground-truth content the scraper actually returned, scored against the 819 URLs in the dataset that carry labels. fastCRW's 63.74% is "of 819 labeled URLs" — never "of 3,000 requests" and never "of 1,000". The denominator matters enormously: divide the same wins by 3,000 requests and you'd manufacture a much smaller-looking percentage; divide by a hand-picked subset and you'd inflate it. Fixing the denominator to the labeled set is what makes the 63.74% (highest of fastCRW, Crawl4AI's 59.95%, and Firecrawl's 56.04%) comparable across tools. We cover the metric itself in depth in our benchmark write-up.

Pairing 0 errors with scrape-success, never alone

Across 3,000 requests our run threw 0 errors. That number is real, but quoted alone it understates the full picture. The companion is scrape-success of reachable URLs: 91.8% (877 of 955 reachable URLs). The two describe different things: 0 errors means nothing crashed or threw an exception; 91.8% of reachable URLs means the engine successfully extracted usable content from the reachable portion of the set. We always pair them — "91.8% scrape-success of reachable URLs, 0 errors" — because either one in isolation gives an incomplete picture of what the run actually produced.

Quarantining bad runs

The hardest part of an honest methodology isn't reporting the run you like — it's deciding what to do with the run you don't. Benchmarks are noisy; harnesses have bugs; an early implementation can produce a number that's simply wrong. The temptation is to keep whichever run flatters you. The discipline is to pick one harness of record and quarantine the rest.

Why run_bench.py's 43.7% is not cited

An earlier harness, run_bench.py, produced a truth-recall figure of 43.7%. We do not cite it anywhere, and you won't find it on this site, because it was superseded by a corrected harness. Quarantining it cuts both ways: 43.7% is lower than our canonical 63.74%, so suppressing it actually removes a number that makes us look worse. We quarantine it anyway, because the rule has to be "cite the harness of record, not the run with the convenient number" — otherwise the same logic would let us cherry-pick a high run later.

diagnose_3way.py is the harness of record

diagnose_3way.py is the single canonical harness for the scrape benchmark: one run, 3,000 requests, three tools, identical inputs, 2026-05-08. Every scrape number we publish traces to that file. When a benchmark has exactly one source of record, "which run did you mean?" stops being a way to dodge. If we re-run it and the numbers move, we'll date the new run and supersede the old one in public — the same way we retired that superseded sub-second average.

How to hold any vendor to this bar

You don't have to take our methodology on faith — you can use it as a checklist against everyone, us included. Three questions separate a benchmark you can trust from a marketing slide.

Demand dataset, method, and date

Ask: what URLs, measured how, and when? A benchmark without a named dataset is unfalsifiable. A benchmark without a date is stale by default — the web, anti-bot systems, and the tools themselves all change month to month. If a vendor can't answer all three for a given number, the number is decoration.

Demand the full percentile split

If you see one latency number, ask for p50, p90, and p99. A vendor that publishes only a median is hiding the tail; one that publishes only an average is hiding everything. The honest signal is a vendor that volunteers the full percentile split — as we do: p50 1914 ms (fastest), p90 4348 ms in fast mode (lowest of the three), which comes from the chrome-stealth fallback that recovers the 34 URLs others miss. Our deeper treatment of the median-versus-tail trade lives in why scraping latency varies so much and the percentile-by-percentile breakdown in p50 vs p90 vs p99 explained.

Demand reproducibility

The strongest benchmark is one you can re-run. Because our dataset is Firecrawl's public set and our harness is named, you can point diagnose_3way.py at the same URLs and check our numbers yourself. A benchmark you can reproduce is the only kind that survives the question "says who?" — and it's the bar we think every scraper vendor, including the one writing this, should be held to. The live results are at /benchmarks, and self-hosting the engine to reproduce them is free under AGPL-3.0 (see /pricing for the managed tiers).

Where this methodology costs us

It would be dishonest to present percentile-and-provenance discipline as pure virtue. It has a price, and the price is that our headline numbers are less impressive than a cherry-picked average would be. "p50 1914 ms, p90 4348 ms in fast mode" is a less punchy landing-page line than "sub-second average," even though it's the truer claim. Vendors who report a single flattering mean will always look faster at a glance. We've decided that being checkable beats looking fast — but you should know that's the trade, and that an engine optimizing for the benchmark headline rather than the distribution could post a prettier number on the same data.

Sources

  • Benchmark result of record — bench/server-runs/RESULT_3WAY_1000_FULL.md, harness diagnose_3way.py, run 2026-05-08
  • Dataset — Firecrawl public scrape-content-dataset-v1 (1,000 URLs / 819 labeled): firecrawl.dev
  • fastCRW repo and pricing — github.com/us/crw · fastcrw.com

Related: The fastCRW benchmark, explained · Why scraping latency varies · p50 vs p90 vs p99 in web scraping

FAQ

Frequently asked questions

Why does fastCRW publish percentiles instead of an average latency?
Because web scraping latency is heavily right-skewed: a handful of slow pages drag the mean far from a typical request. The median (p50) describes the typical request and the tail (p90/p99) describes the worst cases that break synchronous agent calls. A single average hides both. We publish p50 1914 ms (fastest of three), p90 4348 ms in fast mode (lowest of three), and p99 15012 ms together (diagnose_3way.py, 2026-05-08) rather than collapsing them into one misleading number.
What dataset and date does fastCRW's benchmark use?
The scrape benchmark runs on Firecrawl's own public scrape-content-dataset-v1 — 1,000 URLs, of which 819 carry labeled ground truth — via the diagnose_3way.py harness, a single run of 3,000 requests on 2026-05-08. Using a competitor's public dataset removes the suspicion that we hand-picked URLs our engine handles well.
Why is the 43.7% figure not used anywhere?
43.7% came from an earlier harness, run_bench.py, that was superseded by the corrected diagnose_3way.py harness. We quarantine it rather than cite it, even though it's lower than our canonical 63.74% truth-recall and suppressing it makes us look better, because the rule has to be 'cite the harness of record, not the convenient run.' Otherwise the same logic would let us cherry-pick a high run later.
How should I evaluate a vendor's benchmark methodology?
Ask three questions of every number: what dataset, measured how, and when (no named dataset and date means the claim is unfalsifiable or stale); does it show the full p50/p90/p99 split (one average hides the tail); and can you reproduce it (a named dataset plus a named harness lets you re-run and check). If a vendor can't answer all three, treat the number as decoration.
Why is the accuracy denominator 819 and not 1,000 URLs?
Only 819 of the dataset's 1,000 URLs carry labeled ground truth, and truth-recall measures how much of that known content the scraper returned — so the denominator must be the labeled set. fastCRW's 63.74% is 'of 819 labeled URLs', never 'of 3,000 requests' or 'of 1,000', because changing the denominator would manufacture a different-looking percentage and make the three-way comparison incomparable.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More engineering posts

View category archive