Why I Ran This Benchmark
When I started building CRW, I needed to understand where it actually stood relative to established tools. Not to "win" a benchmark — that's a useless goal — but to understand which workloads it handles well and where it falls short. Honest benchmarks shape better product decisions.
This post shares what we observed, how we measured, and what the numbers actually mean in practice. I've also included the scripts we used so you can run your own version against your own target URLs.
What We're Measuring and Why
Before looking at numbers, it's worth being precise about what the metrics mean.
Latency Percentiles: p50, p95, Mean
p50 (median): The latency at which 50% of requests completed faster. This is the "typical" experience. It's more robust than mean because it ignores extreme outliers.
p95: The latency at which 95% of requests completed faster. This captures tail latency — the slow cases that happen regularly enough to matter in production. A high p95 means roughly 1 in 20 requests is meaningfully slower than the median, which is exactly the kind of variance that hurts interactive use.
Mean: The arithmetic average. Useful for cost calculations (total time / total requests) but can be misleading when outliers skew the distribution.
We report all three because they tell different stories. A tool with great p50 but terrible p95 might be fine for batch processing but unacceptable for interactive use. A tool with similar p50 and p95 has more predictable behavior.
Wall-Clock Time
We measured wall-clock time: the elapsed real time from sending the HTTP request to receiving the complete response body. This includes:
- DNS resolution
- TCP connection establishment
- TLS handshake
- Server-side processing (fetch, parse, convert)
- Network transfer of the response
We chose wall-clock over CPU time because wall-clock reflects what users actually experience. A tool that's CPU-efficient but has high network overhead still feels slow.
Coverage: What It Precisely Means
Coverage = (URLs returning non-empty, parseable content) / (total URLs attempted) × 100.
A URL "passes" coverage if: the response has HTTP 200, the response body contains at least 100 characters of text, and the text is parseable (not garbled encoding, not just HTML boilerplate). A URL "fails" if: it times out, returns 4xx/5xx, returns an empty body, or returns only whitespace/navigation elements.
Coverage is a rough measure of practical usefulness — a result that technically returns 200 but contains only a JavaScript loading spinner isn't useful.
Dataset Composition
We used 500 URLs sampled from Scrapeway's public benchmark dataset with adjustments to match our expected production workload distribution.
Breakdown by Site Type
| Category | Count | % of corpus | JS required |
|---|---|---|---|
| Documentation/technical blogs | 150 | 30% | ~10% |
| News articles | 125 | 25% | ~15% |
| E-commerce product pages | 100 | 20% | ~40% |
| Company/SaaS marketing pages | 75 | 15% | ~50% |
| Wikipedia / encyclopedia pages | 50 | 10% | <5% |
Roughly 25–30% of URLs in the corpus required JavaScript execution for meaningful content retrieval. The rest were static HTML or server-rendered pages. This ratio is intentional — it mirrors the distribution we see in real RAG pipeline workloads.
Why Dataset Composition Matters for Interpretation
A benchmark corpus biased toward SPAs would heavily favor Playwright-based tools (Firecrawl, Crawl4AI). A corpus biased toward static HTML would favor lightweight tools like CRW. Our corpus reflects a mixed workload — which is honest for most real-world use cases but means results shouldn't be extrapolated to all-SPA or all-static scenarios.
Benchmark Setup
Environment: All tools ran in Docker containers on the same hardware: 4 vCPU (AMD EPYC), 8 GB RAM, Ubuntu 22.04. Same network, same source IPs, same DNS resolver.
Test mode: Sequential (not parallel) to isolate per-request latency. Parallel throughput is a different measurement covered in the Throughput section below.
Repetitions: Each URL was scraped 3 times; we took the median of the 3 runs to reduce measurement noise from transient network conditions.
Warmup: All services were given a 2-minute warmup period (10 warmup requests) before timed runs, to ensure connection pools were populated and caches warm.
Benchmark Setup Scripts
Here's the core benchmarking script we used. You can run a similar test against your own URL list:
#!/usr/bin/env python3
# benchmark.py — run against any Firecrawl-compatible API
import time, statistics, json, httpx, sys
TOOLS = {
"crw": "http://localhost:3000",
"firecrawl": "http://localhost:3001",
}
def scrape_url(base_url: str, url: str, api_key: str = "test") -> tuple[float, bool]:
start = time.perf_counter()
try:
r = httpx.post(
f"{base_url}/v1/scrape",
json={"url": url, "formats": ["markdown"]},
headers={"Authorization": f"Bearer {api_key}"},
timeout=30.0,
)
elapsed = time.perf_counter() - start
ok = r.status_code == 200 and len(r.json().get("data", {}).get("markdown", "")) > 100
return elapsed, ok
except Exception:
return time.perf_counter() - start, False
def percentile(data: list[float], p: int) -> float:
data.sort()
k = (len(data) - 1) * p / 100
f = int(k)
c = f + 1
return data[f] + (data[c] - data[f]) * (k - f) if c < len(data) else data[f]
urls = [line.strip() for line in open(sys.argv[1]) if line.strip()]
for name, base in TOOLS.items():
latencies, successes = [], 0
for url in urls:
elapsed, ok = scrape_url(base, url)
latencies.append(elapsed * 1000) # ms
if ok:
successes += 1
time.sleep(0.1) # polite delay
print(f"
{name}:")
print(f" p50: {percentile(latencies, 50):.0f} ms")
print(f" p95: {percentile(latencies, 95):.0f} ms")
print(f" mean: {statistics.mean(latencies):.0f} ms")
print(f" coverage: {successes}/{len(urls)} ({100*successes/len(urls):.1f}%)")
Run it with a text file of URLs (one per line):
python3 benchmark.py urls.txt
Latency Results
Rather than freeze a single point-in-time latency table here — numbers that drift with every release of every tool — we publish the full latency distribution (p50/p95/mean, per tool, per run) alongside the exact dataset on our public /benchmarks page, with a one-command repro so you can regenerate it yourself.
The durable, defensible finding from that run: 63.74% truth-recall (522 of 819 labeled URLs), 87.7% scrape success, 0 errors. CRW's Rust implementation is lower-latency than the Node.js and Python-based alternatives on standard HTML content because there's no headless-browser process in the hot path. The gap narrows on JavaScript-heavy pages — when a browser render is required, rendering time dominates regardless of the wrapper language.
The tail behavior is what matters most for interactive use: CRW's p95 stays close to its median, so occasional slowness is rare. Browser-render-first tools show a much wider p50→p95 spread, which is visible to users in latency-sensitive applications.
Crawl Coverage Results
On the labeled public dataset, CRW reached 87.7% scrape success with 0 errors, and a truth-recall of 63.74% (522 of 819 labeled URLs). The per-tool, per-category coverage breakdown — including timeout vs. empty-body failure modes — is published with the dataset on /benchmarks so it stays current as every tool evolves.
Coverage surprised us. We expected a browser-render-first stack to perform better here. In our dataset, lol-html's aggressive streaming parser handled malformed HTML more gracefully than a full rendering pipeline — which occasionally timed out or returned empty responses for slow-loading pages.
Browser-render-first tools tend to have a higher timeout rate, which is largely a function of headless Chromium taking longer per page under a stricter timeout budget. When pages don't load within the timeout window, the request fails completely.
Memory Usage
The structural memory difference is the durable point, not a single benchmark figure. CRW is a single static binary with no headless browser in its default path, so its resident footprint is a small fraction of a browser-render-first stack — and, critically, it has no large unreclaimable baseline. Browser-render-first tools carry a heavy idle baseline (the headless engine's private heap) that cannot be reclaimed regardless of traffic, and they grow further under load as renderer processes spawn.
Memory Profiling Details
We measure memory using two tools: docker stats for RSS (Resident Set Size) and pmap -x for heap breakdown. "Idle" is measured after a 60-second warmup with zero active requests. "Under load" is measured at peak during a 50-concurrent-request burst sustained for 30 seconds. The full per-tool memory table is published with the rest of the run on /benchmarks.
CRW's memory profile is dominated by connection buffers, parse state, and response buffers, plus the static binary's own code/data and shared libs — there is no browser heap. A browser-render-first tool's profile has a fundamentally different shape: a large share of its idle footprint is the headless engine's private heap, which can't be reclaimed regardless of traffic, and under load it spawns additional renderer processes that each add a substantial increment.
JavaScript-Heavy Pages: Separate Analysis
We isolate the subset of corpus URLs that require JavaScript execution for meaningful content (SPAs, lazy-loaded articles, client-rendered product pages) and report it separately on /benchmarks, because mixing it into the headline number would misrepresent both workloads.
For JavaScript-heavy pages, CRW's latency advantage largely disappears — rendering time dominates — and its coverage is lower on this subset than its overall figure. LightPanda is still maturing and doesn't yet implement the full browser API surface that Playwright (Chromium) covers.
The honest takeaway: if your workload is predominantly SPAs, Crawl4AI or Firecrawl's Playwright-based rendering gives better coverage today. CRW is a better fit for HTML-primary content.
Throughput vs. Latency: Different Workloads
The latency table above measures sequential requests — one at a time, measuring per-request duration. This is the right metric for interactive use cases where a user is waiting for a single result.
For batch pipelines, parallel throughput is what matters: how many pages can you process per second when running many requests concurrently?
Because CRW has no per-request browser process, parallel throughput scales with available CPU and connection limits rather than with renderer memory. Browser-render-first tools become memory-constrained at high concurrency — renderer processes are the bottleneck — so their pages/sec plateaus much earlier on the same hardware. The full pages/sec-by-worker-count table is published with the run on /benchmarks.
Note that throughput measurements are system-dependent. On a machine with more RAM, a browser-render-first tool's numbers improve. On a memory-constrained server, CRW maintains its throughput while browser-based stacks degrade faster.
How to Run Your Own Benchmark
The most meaningful benchmark is one run against your own target URLs. Here's a complete self-contained script:
#!/bin/bash
# run_benchmark.sh — requires Docker, Python 3, httpx
# Usage: ./run_benchmark.sh your_urls.txt
set -e
export URLS_FILE=${1:-urls.txt}
echo "Starting CRW..."
docker run -d --name bench-crw -p 3002:3000 -e CRW_API_KEY=test ghcr.io/us/crw:latest
echo "Starting Firecrawl (requires docker compose)..."
echo "See https://github.com/mendableai/firecrawl for self-host setup"
echo "Firecrawl needs Redis + workers — single docker run won't work."
echo "Assuming Firecrawl is already running on port 3001."
sleep 5 # wait for CRW to be ready
echo "Running benchmark..."
python3 - <<'PYEOF'
import time, statistics, json, httpx, sys
TOOLS = {
"crw": ("http://localhost:3000", "test"),
"firecrawl": ("http://localhost:3001", "test"),
}
def scrape(base, key, url):
start = time.perf_counter()
try:
r = httpx.post(f"{base}/v1/scrape",
json={"url": url, "formats": ["markdown"]},
headers={"Authorization": f"Bearer {key}"},
timeout=30.0)
ms = (time.perf_counter() - start) * 1000
ok = r.status_code == 200 and len(r.json().get("data",{}).get("markdown","")) > 100
return ms, ok
except Exception:
return (time.perf_counter() - start) * 1000, False
import os
urls_file = os.environ.get("URLS_FILE", "urls.txt")
with open(urls_file) as f:
urls = [l.strip() for l in f if l.strip()][:100]
for name, (base, key) in TOOLS.items():
lats, hits = [], 0
for u in urls:
ms, ok = scrape(base, key, u)
lats.append(ms)
hits += ok
time.sleep(0.05)
lats.sort()
p = lambda p: lats[int(len(lats)*p/100)]
print(f"
{name}: p50={p(50):.0f}ms p95={p(95):.0f}ms mean={sum(lats)/len(lats):.0f}ms coverage={hits}/{len(urls)}")
PYEOF
echo "Stopping CRW container..."
docker rm -f bench-crw
What Changed Since We First Ran This
Benchmarks are point-in-time snapshots. Our first run was in late 2025; the results above reflect early 2026.
Changes since the first run:
- CRW p50 improved — primarily from reqwest connection pool tuning and lol-html selector optimization
- Firecrawl coverage improved — Firecrawl v1.5 added better timeout handling; its coverage was lower in our original test
- Crawl4AI added async mode — their batch throughput improved significantly with async browser pooling
These results will continue to change as all tools evolve. If you're making a significant infrastructure decision based on performance, run your own test against your actual workload. We try to re-run our benchmark with each major release.
Where the Results Surprised Us
Coverage was higher than expected. We anticipated CRW's simpler HTML parser to miss content a full browser would catch. For standard HTML pages, lol-html's streaming approach actually handled malformed HTML more reliably than headless Chrome, which hit rendering timeouts more often.
Firecrawl's latency was higher than remembered from hosted API tests. Self-hosted Firecrawl performs differently than the hosted API, which uses proxy routing and optimized infrastructure. Don't conflate hosted-API benchmarks with self-hosted ones.
What These Numbers Mean in Practice
The practical implication of a lower-latency, no-browser-in-the-hot-path design is simple: a large sequential scrape job finishes in a fraction of the wall-clock time of a browser-render-first stack, and at high concurrency the gap widens further because CRW isn't memory-bound by renderer processes. Run the one-command repro on /benchmarks against your own URL list to see the exact wall-clock numbers for your workload.
For memory budgets, the difference is structural: you can pack many CRW instances onto a small server because each is a lightweight static binary, whereas the same number of browser-render-first instances needs a far larger machine just for the headless-engine baseline.
Limitations of This Benchmark
- Anti-bot performance: We only tested publicly accessible pages. For CAPTCHA-protected or fingerprint-checking targets, results differ substantially.
- SPA coverage: Our corpus was biased toward HTML-heavy content. An all-SPA corpus would show different rankings.
- Content quality: We measured whether content was returned, not whether it was clean. Qualitative comparison is harder.
- Hosted vs. self-hosted: We tested self-hosted versions. The fastCRW hosted API and Firecrawl's hosted API have different latency profiles.
Try It Yourself
Self-host CRW and run your own benchmark:
docker run -p 3000:3000 -e CRW_API_KEY=your-key ghcr.io/us/crw:latest
Or use fastCRW — the managed version with a one-time lifetime 500 credits (not a monthly meter), no credit card required.
