By the fastCRW team · Benchmark and credit figures verified 2026-05-18 · fastCRW launch pricing expires 2026-06-01 · Verify independently before relying on numbers.
Building a salary benchmarking web scraping tool
A salary benchmarking web scraping tool turns scattered, pay-disclosing job postings and public compensation pages into one structured dataset you can query: median and percentile pay by role, level, and location. The build is a pipeline — crawl the sources, extract pay-range fields into a schema, normalize currencies and pay periods, then aggregate into percentiles. This guide walks each stage and is honest about the two things that erode a benchmark's credibility: per-page cost (it caps how wide a sample you can afford) and missed postings (silently dropped pages shrink the sample without telling you).
Both of those pressures point at the same two engine properties: a flat per-page price and high extraction recall. fastCRW crawls at 1 credit per page with a free, self-hostable AGPL-3.0 engine, and on Firecrawl's public 819-URL labeled dataset it posted the highest truth-recall of the three tools tested — 63.74% (diagnose_3way.py, 2026-05-08) — so fewer pay-disclosing postings fall out of your sample before they reach the aggregation step.
What a salary benchmarking dataset needs
Role, location, and pay-range fields
Every usable record needs a small, consistent core: normalized role title, seniority/level, location (and remote flag), employer, currency, pay period (hourly/monthly/annual), and a min/max pay range. The raw postings will give you none of this cleanly — titles are free text, ranges are written a dozen ways ("$120k–$150k", "120,000-150,000 USD/yr", "£60k+"), and many postings bury pay in prose. The extraction step exists to coerce all of that into the same shape so the aggregation step can trust it.
Sample size and freshness
A benchmark's credibility scales with two things: how many comparable postings sit behind each percentile, and how recent they are. A "Senior Backend Engineer, Berlin" benchmark built on eight stale postings is noise; the same cell with several hundred postings refreshed weekly is a signal. That is why both cost-per-page and recall matter so directly here — they govern how big and how current the sample can be without the bill or the gaps getting out of hand.
Collecting compensation signals
Crawl postings that disclose pay ranges
Start with /v1/crawl against listing index pages so the engine discovers and fetches posting URLs for you (async BFS, returns a job ID; poll /v1/crawl/:id for results). crawl accepts maxDepth (cap 10) and maxPages (cap 1000), so a single job tops out at 1,000 pages — for a large multi-board pass, run several scoped jobs rather than one unbounded crawl. If you already have a feed of posting URLs, iterate /v1/scrape across them concurrently instead; there is no batch endpoint, so concurrency is how you go wide (more on that below). For the upstream board-scraping mechanics, see the job board scraper guide and list crawling for structured data.
Extract structured pay fields with a JSON schema
On each posting, ask for structured output with formats: ["json"] plus a jsonSchema — this is the LLM-extraction path (5 credits per request). A schema for a comp record looks like:
role_title,seniority— stringslocation,remote— string + booleancurrency,pay_period— enums ("USD"/"EUR"…, "hour"/"month"/"year")pay_min,pay_max— numberspay_disclosed— boolean, so you can keep "no pay listed" as a first-class outcome instead of guessing
LLM-based JSON extraction is a managed feature available on paid plans — fastCRW runs the model for you, so there is nothing to operate yourself (the FREE plan has no LLM features). The deeper schema-design patterns (enums, nested objects, required fields, retry on schema-validation failure) live in the JSON-schema extraction guide.
Normalizing currencies and periods
Extraction gets you typed fields; it does not get you comparable ones. Do the normalization in your own code after extraction: convert hourly and monthly figures to a single annualized base (e.g. multiply hourly by your assumed annual hours), convert currencies to one reporting currency with a dated FX rate you store alongside the record, and snap free-text titles to a small controlled taxonomy. Keep the raw extracted values too — when an FX rate or annualization assumption changes, you want to recompute from source rather than re-crawl.
Aggregating into benchmarks
Percentiles by role and location
Group normalized records by (role, level, location) and compute p25/p50/p75/p90 on the annualized midpoint (or on min/max separately if you want a range view). Attach a sample count to every cell and suppress cells below a floor — a percentile over three postings is not a benchmark, and showing it is worse than showing nothing. This is also where the recall number pays off: at 63.74% truth-recall on the labeled set (diagnose_3way.py, 2026-05-08), more of the pay-disclosing postings you crawled actually make it into the cells, which is the difference between a p50 backed by 200 postings and one backed by 120.
Storing rolling snapshots yourself (stateless)
fastCRW is stateless per request — it scrapes and returns, it does not remember prior runs. That is a feature for a benchmarking tool: you own the history. Write each run's normalized records into your own store with a captured-at timestamp so you can compute rolling windows (trailing-90-day percentiles), show how a cell moved quarter over quarter, and audit exactly which postings backed any published number.
Keeping the dataset fresh
Scheduled incremental crawls
Re-run the crawl on a schedule — weekly is a sensible default for comp data, which moves slowly relative to, say, prices. Drive it from cron or your orchestrator; see scheduled crawls with cron for the scheduling pattern. Because you store snapshots yourself, each run is just "fetch current, extract, upsert into the rolling window."
Detecting changed and expired postings
Key records by a stable posting identifier (canonical URL or board-specific ID). On each run, diff against your last snapshot: new IDs are inserts, missing IDs are likely expired (age them out of the active window rather than deleting — expiry is itself a signal that a role filled), and changed pay ranges are updates worth logging. Statelessness means this diff logic is yours to define, which also means you decide what "expired" means for your benchmark.
Cost and scale
Credit math for recurring multi-source crawls
The per-record cost is forecastable. Crawling a posting is 1 credit per page; extracting its pay fields with formats: ["json"] is 5 credits. So a posting you both crawl and extract costs ~6 credits. A weekly pass over, say, 5,000 pay-disclosing postings is roughly 5,000 × 6 = 30,000 credits per run on managed cloud — sized against the live tiers on the pricing page (don't hard-code tier numbers; they revert from launch pricing on 2026-06-01). The Free tier's 500 one-time lifetime credits is enough to prototype the schema and one small crawl, not to run the production sample.
Self-host for unlimited passes
If recurring volume is the binding constraint, self-host the AGPL-3.0 engine: the per-page cloud credit goes away and you pay only your own server. For a benchmarking tool that re-crawls the same wide sample every week, that turns an ongoing metered cost into a fixed VPS line item — which is often the difference between sampling a few thousand postings and sampling tens of thousands.
Honest limits
Not all postings disclose pay
This is a data problem, not a tooling problem, but it shapes the whole pipeline. In many markets a large share of postings list no pay at all, so your effective sample is "postings that disclose" — usually a fraction of postings crawled. Keep pay_disclosed as an explicit field, report disclosure rate alongside every benchmark cell, and never silently treat "no pay listed" as a missing value to impute.
Anti-bot and single-URL extraction
Two engine limits to plan around. First, fastCRW has no Fire-engine-style built-in anti-bot — hardened boards that aggressively block crawlers will need your own handling, and that's exactly the kind of cloud-only specialty where Firecrawl genuinely wins; if your target sources are heavily bot-protected, weigh that honestly. Second, extraction is single-URL: there is no multi-URL batch /v1/extract, so you either iterate /v1/scrape concurrently across your posting list or lean on /v1/crawl to enumerate pages. For most benchmarking pipelines, concurrent iteration over a discovered URL set is the right shape — just size your concurrency to your plan's rate limits.
Sources
- fastCRW canonical facts: scrape benchmark (truth-recall 63.74% of 819 labeled URLs,
diagnose_3way.py, 2026-05-08), credit costs, API surface, honest gaps — github.com/us/crw - fastCRW plans and Free-tier credits: fastcrw.com/pricing (launch pricing reverts 2026-06-01)
- Scrape benchmark of record:
bench/server-runs/RESULT_3WAY_1000_FULL.md(single run, 3,000 requests, 2026-05-08)
Related: Job board scraper · List crawling for structured data · JSON-schema extraction · Scheduled crawls with cron
