How do I scrape salary data for compensation benchmarking?

Crawl pay-disclosing job postings with /v1/crawl (or iterate /v1/scrape concurrently over a known URL list), then extract pay fields on each posting with formats: ['json'] plus a jsonSchema (role, location, currency, pay period, pay min/max). Normalize currencies and periods in your own code, then aggregate into percentiles by role and location. With fastCRW the crawl is 1 credit per page and the JSON extraction is 5 credits per request.

How do I extract pay ranges from job postings?

Send each posting to /v1/scrape with formats: ['json'] and a jsonSchema that captures pay_min, pay_max, currency, pay_period, and a pay_disclosed boolean. The managed LLM extraction (a paid-plan feature; the FREE plan has no LLM features) coerces the many ways ranges are written into typed fields. Keep pay_disclosed explicit so postings with no listed pay stay a first-class outcome rather than a guessed value.

How big a sample do I need for a salary benchmark?

There is no universal number, but attach a sample count to every (role, level, location) cell and suppress cells below a floor you set — a percentile over a handful of postings is noise. Sample size is gated by cost-per-page and by extraction recall; fastCRW's flat 1-credit-per-page crawl and 63.74% truth-recall on Firecrawl's 819-URL labeled dataset (diagnose_3way.py, 2026-05-08, highest of the three tools tested) help you keep cells well-populated without runaway cost.

How do I keep a compensation dataset fresh?

Re-run the crawl on a schedule (weekly is a sensible default for comp data) and key records by a stable posting ID. Diff each run against the last snapshot: insert new IDs, age out missing ones as likely expired, and log changed pay ranges. fastCRW is stateless per request, so you store the rolling snapshots yourself — which means you own the history and define exactly what 'expired' means.

Is scraping salary data from job postings allowed?

It depends on the source's terms and your jurisdiction — this is not legal advice. fastCRW respects robots.txt by default and only overrides it when the caller has the legal right to do so. Many job boards restrict automated access, so check each source's terms, prefer official feeds or APIs where offered, and keep your crawl rate polite. Self-hosting the AGPL-3.0 engine also keeps the scraped postings on your own infrastructure.

Salary Benchmarking Web Scraping Tool: Build Guide

By the fastCRW team · Benchmark and credit figures verified 2026-05-18 · fastCRW launch pricing expires 2026-06-01 · Verify independently before relying on numbers.

Building a salary benchmarking web scraping tool

A salary benchmarking web scraping tool turns scattered, pay-disclosing job postings and public compensation pages into one structured dataset you can query: median and percentile pay by role, level, and location. The build is a pipeline — crawl the sources, extract pay-range fields into a schema, normalize currencies and pay periods, then aggregate into percentiles. This guide walks each stage and is honest about the two things that erode a benchmark's credibility: per-page cost (it caps how wide a sample you can afford) and missed postings (silently dropped pages shrink the sample without telling you).

Both of those pressures point at the same two engine properties: a flat per-page price and high extraction recall. fastCRW crawls at 1 credit per page with a free, self-hostable AGPL-3.0 engine, and on Firecrawl's public 819-URL labeled dataset it posted the highest truth-recall of the three tools tested — 63.74% (diagnose_3way.py, 2026-05-08) — so fewer pay-disclosing postings fall out of your sample before they reach the aggregation step.

What a salary benchmarking dataset needs

Role, location, and pay-range fields

Every usable record needs a small, consistent core: normalized role title, seniority/level, location (and remote flag), employer, currency, pay period (hourly/monthly/annual), and a min/max pay range. The raw postings will give you none of this cleanly — titles are free text, ranges are written a dozen ways ("$120k–$150k", "120,000-150,000 USD/yr", "£60k+"), and many postings bury pay in prose. The extraction step exists to coerce all of that into the same shape so the aggregation step can trust it.

Sample size and freshness

A benchmark's credibility scales with two things: how many comparable postings sit behind each percentile, and how recent they are. A "Senior Backend Engineer, Berlin" benchmark built on eight stale postings is noise; the same cell with several hundred postings refreshed weekly is a signal. That is why both cost-per-page and recall matter so directly here — they govern how big and how current the sample can be without the bill or the gaps getting out of hand.

Collecting compensation signals

Crawl postings that disclose pay ranges

Start with /v1/crawl against listing index pages so the engine discovers and fetches posting URLs for you (async BFS, returns a job ID; poll /v1/crawl/:id for results). crawl accepts maxDepth (cap 10) and maxPages (cap 1000), so a single job tops out at 1,000 pages — for a large multi-board pass, run several scoped jobs rather than one unbounded crawl. If you already have a feed of posting URLs, iterate /v1/scrape across them concurrently instead; there is no batch endpoint, so concurrency is how you go wide (more on that below). For the upstream board-scraping mechanics, see the job board scraper guide and list crawling for structured data.

Extract structured pay fields with a JSON schema

On each posting, ask for structured output with formats: ["json"] plus a jsonSchema — this is the LLM-extraction path (5 credits per request). A schema for a comp record looks like:

role_title, seniority — strings
location, remote — string + boolean
currency, pay_period — enums ("USD"/"EUR"…, "hour"/"month"/"year")
pay_min, pay_max — numbers
pay_disclosed — boolean, so you can keep "no pay listed" as a first-class outcome instead of guessing

LLM-based JSON extraction is a managed feature available on paid plans — fastCRW runs the model for you, so there is nothing to operate yourself (the FREE plan has no LLM features). The deeper schema-design patterns (enums, nested objects, required fields, retry on schema-validation failure) live in the JSON-schema extraction guide.

Normalizing currencies and periods

Extraction gets you typed fields; it does not get you comparable ones. Do the normalization in your own code after extraction: convert hourly and monthly figures to a single annualized base (e.g. multiply hourly by your assumed annual hours), convert currencies to one reporting currency with a dated FX rate you store alongside the record, and snap free-text titles to a small controlled taxonomy. Keep the raw extracted values too — when an FX rate or annualization assumption changes, you want to recompute from source rather than re-crawl.

Aggregating into benchmarks

Percentiles by role and location

Group normalized records by (role, level, location) and compute p25/p50/p75/p90 on the annualized midpoint (or on min/max separately if you want a range view). Attach a sample count to every cell and suppress cells below a floor — a percentile over three postings is not a benchmark, and showing it is worse than showing nothing. This is also where the recall number pays off: at 63.74% truth-recall on the labeled set (diagnose_3way.py, 2026-05-08), more of the pay-disclosing postings you crawled actually make it into the cells, which is the difference between a p50 backed by 200 postings and one backed by 120.

Storing rolling snapshots yourself (stateless)

fastCRW is stateless per request — it scrapes and returns, it does not remember prior runs. That is a feature for a benchmarking tool: you own the history. Write each run's normalized records into your own store with a captured-at timestamp so you can compute rolling windows (trailing-90-day percentiles), show how a cell moved quarter over quarter, and audit exactly which postings backed any published number.

Keeping the dataset fresh

Scheduled incremental crawls

Re-run the crawl on a schedule — weekly is a sensible default for comp data, which moves slowly relative to, say, prices. Drive it from cron or your orchestrator; see scheduled crawls with cron for the scheduling pattern. Because you store snapshots yourself, each run is just "fetch current, extract, upsert into the rolling window."

Detecting changed and expired postings

Key records by a stable posting identifier (canonical URL or board-specific ID). On each run, diff against your last snapshot: new IDs are inserts, missing IDs are likely expired (age them out of the active window rather than deleting — expiry is itself a signal that a role filled), and changed pay ranges are updates worth logging. Statelessness means this diff logic is yours to define, which also means you decide what "expired" means for your benchmark.

Cost and scale

Credit math for recurring multi-source crawls

The per-record cost is forecastable. Crawling a posting is 1 credit per page; extracting its pay fields with formats: ["json"] is 5 credits. So a posting you both crawl and extract costs ~6 credits. A weekly pass over, say, 5,000 pay-disclosing postings is roughly 5,000 × 6 = 30,000 credits per run on managed cloud — sized against the live tiers on the pricing page (don't hard-code tier numbers; they revert from launch pricing on 2026-06-01). The Free tier's 500 one-time lifetime credits is enough to prototype the schema and one small crawl, not to run the production sample.

Self-host for unlimited passes

If recurring volume is the binding constraint, self-host the AGPL-3.0 engine: the per-page cloud credit goes away and you pay only your own server. For a benchmarking tool that re-crawls the same wide sample every week, that turns an ongoing metered cost into a fixed VPS line item — which is often the difference between sampling a few thousand postings and sampling tens of thousands.

Honest limits

Not all postings disclose pay

This is a data problem, not a tooling problem, but it shapes the whole pipeline. In many markets a large share of postings list no pay at all, so your effective sample is "postings that disclose" — usually a fraction of postings crawled. Keep pay_disclosed as an explicit field, report disclosure rate alongside every benchmark cell, and never silently treat "no pay listed" as a missing value to impute.

Anti-bot and single-URL extraction

Two engine limits to plan around. First, fastCRW has no Fire-engine-style built-in anti-bot — hardened boards that aggressively block crawlers will need your own handling, and that's exactly the kind of cloud-only specialty where Firecrawl genuinely wins; if your target sources are heavily bot-protected, weigh that honestly. Second, extraction is single-URL: there is no multi-URL batch /v1/extract, so you either iterate /v1/scrape concurrently across your posting list or lean on /v1/crawl to enumerate pages. For most benchmarking pipelines, concurrent iteration over a discovered URL set is the right shape — just size your concurrency to your plan's rate limits.

Sources

fastCRW canonical facts: scrape benchmark (truth-recall 63.74% of 819 labeled URLs, diagnose_3way.py, 2026-05-08), credit costs, API surface, honest gaps — github.com/us/crw
fastCRW plans and Free-tier credits: fastcrw.com/pricing (launch pricing reverts 2026-06-01)
Scrape benchmark of record: bench/server-runs/RESULT_3WAY_1000_FULL.md (single run, 3,000 requests, 2026-05-08)