By the fastCRW team · Pricing/footprint claims verified 2026-05-18 · fastCRW launch pricing expires 2026-06-01 · Verify independently before relying on any number.
What local-first web scraping means
Local-first web scraping means the scraping engine runs inside your own network boundary, so the target URLs you fetch and the content you extract never transit a third-party vendor. The "first" matters: the default execution path is your own infrastructure, and the cloud is an opt-in, not a requirement. It is the data-collection equivalent of local-first software generally — your machine is the source of truth, and any remote service is an accessory you can switch off.
This is distinct from how most modern AI web-data tools work. A cloud scraping API takes your URL, fetches it from the vendor's servers, optionally runs an extraction step on the vendor's machines, and returns the result. That is convenient, but it means three things leave your boundary: the URL (what you are interested in), the content (what you got), and any LLM extraction prompt (how you interrogate it).
Local-first vs cloud-API scraping
The cleanest way to see the difference is to ask where the fetch originates. With a cloud API, the outbound request to the target site comes from the vendor's IP space, and the result lands in the vendor's process before it reaches you. With a local-first engine, the outbound request comes from your own egress, and the result never exists anywhere but your network. Same job, different trust topology.
Where the data and the target URLs actually live
In a local-first setup, both the queue of URLs and the resulting markdown/JSON live on disk in your environment. There is no vendor-side log of "this customer scraped these pages," because there is no vendor in the loop. For teams whose URL list is itself sensitive — a competitive-intelligence target set, an internal intranet crawl, a list of customers' sites — the URLs can be as revealing as the content.
Why "local-first" is not the same as "self-hosted only"
A common confusion: local-first is not a vow of never touching the cloud. It is an architecture where local is the primary, fully capable path, and the cloud is optional. fastCRW is local-first and offers a managed cloud at fastcrw.com — the point is that the same engine runs in both places, so you choose per workload rather than being forced into one model. A "self-hosted only" tool, by contrast, gives you no managed escape valve when you do need scale.
How a local-first scraper works
Mechanically, a local-first scraper is just a normal scraping engine that happens to be a self-contained artifact you can run yourself. The design decision that makes this practical is footprint: if "self-host" means standing up a five-service stack, most teams will never actually do it.
The engine runs inside your own network boundary
fastCRW's open-core engine is a single static Rust binary — no Redis, no Node.js, no container orchestration required. As a structural fact (not a benchmark claim), the README lists it as a single ~8 MB Docker image needing 1 container, versus a Firecrawl self-host that ships ~2–3 GB across 5 containers. A single binary is the difference between "self-host is a platform-team project" and "self-host is one docker run on a $5 VPS." Local-first only works if running locally is genuinely easy.
No third-party sees your queries or extracted content
Because the binary does the fetching, the only network traffic that leaves your boundary is the request to the target site itself. No vendor sees your URL list, your crawl schedule, or the content you pulled back. When you add LLM extraction (formats: ["json"] with a JSON schema), you supply your own OpenAI or Anthropic key (BYOK), so even the extraction prompt and the page text go straight from your engine to your provider — not through a scraping vendor first.
Renderer selection without a remote browser fleet
JavaScript-heavy pages are where cloud APIs usually justify themselves, because they run a managed browser fleet. fastCRW handles rendering locally: it picks a renderer with auto by default and a chrome → lightpanda → http fallback. The default Docker Compose ships lightpanda (a lightweight headless renderer); chrome is opt-in and heavier (~500 MB image, ~1 GB resident when enabled). You get JS rendering inside your own boundary instead of renting someone else's browsers.
Local-first vs cloud scraping APIs: the trade-offs
Local-first is not free of cost. It is a real trade, and an honest explainer states both sides.
What you gain: privacy, no per-page metering, no vendor ceiling
Self-hosting the AGPL-3.0 engine is free — you pay only for your own server, which works out to roughly $0 per 1,000 scrapes plus infrastructure. Compare that to a metered cloud, where Firecrawl's hosted tiers run about $0.83–5.33 per 1,000 scrapes (competitor-prices.lock.md, verified 2026-05-18). At high volume the metering, not the privacy, is often what tips teams local. You also gain data residency by construction and the absence of a vendor roadmap that can change your costs.
What you give up: managed proxies and anti-bot depth
The honest gap: fastCRW has no Fire-engine-style anti-bot and no managed residential proxy pool. If your targets are hostile, hardened sites that aggressively block datacenter IPs, a local-first engine running from your own egress will struggle, and a proxy-first vendor like Bright Data or Oxylabs is the right tool. Local-first wins on privacy and cost; it does not magically win on adversarial fetch reliability. The engine is also stateless per request — no built-in sessions — so you own scheduling and audit logging. And there is no screenshot output (a request for formats: ["screenshot"] returns HTTP 422).
Hybrid: same engine local for sensitive jobs, cloud for scale
The most useful pattern is not all-or-nothing. Because fastCRW speaks a Firecrawl-compatible REST API in both self-host and managed cloud, you can route sensitive jobs to the local binary and burst high-volume or anti-bot-heavy jobs to the cloud — switching backends is a base-URL swap, not a rewrite. See self-host vs managed scraping for how teams split that traffic.
When local-first is the right choice
Regulated data and internal/intranet sources
If you scrape internal wikis, intranet pages, or anything behind your own auth, routing those URLs through a cloud vendor is often a non-starter — both for policy and because the vendor's servers cannot reach your private network anyway. A local-first engine sits inside the boundary where those sources already live. The same logic applies to regulated workloads (health, finance, legal) where data residency is a hard requirement, not a preference — see local-first scraping and data privacy.
Air-gapped or zero-egress environments
In a strict zero-egress or air-gapped setup, a cloud scraping API simply cannot run — the call to the vendor would violate the egress policy. A self-contained binary on a collection node, with egress allow-listed to the target sites only, is the only architecture that fits. The single-binary form factor is what makes this deployable at all.
Cost-sensitive high-volume pipelines
At sustained high volume, per-page metering compounds. A pipeline doing millions of scrapes a month hits a different cost class on a metered cloud than on a self-hosted binary whose marginal cost is just CPU and bandwidth. For these workloads the privacy benefit is a bonus; the economics are the driver.
Getting started with a local-first engine
Run the open-core binary on your own box
The fastCRW engine is open source under AGPL-3.0 at github.com/us/crw. You can run the single binary directly or via Docker Compose, point it at /v1/scrape, /v1/crawl, /v1/map, and /v1/search, and you have a fully local scraping API with zero per-page fees. The Python SDK (crw) can even run a self-contained local engine for scripts. For a deeper menu of options, see best self-hosted scrapers.
Keep the same API when you do need the cloud
Because the API shape is Firecrawl-compatible, code you write against the local engine works against the managed cloud unchanged — you swap a base URL. That means "go local-first" is not a one-way door: you start local for privacy or cost, and the day you need managed scale or a feature only the cloud offers, you flip a config value rather than re-architect. Local-first, in other words, is the safe default precisely because leaving it is cheap.
Sources
- fastCRW canonical facts — single static Rust binary, AGPL-3.0, renderer selection, structural footprint, honest gaps (internal fact sheet, verified 2026-05-29)
- fastCRW repo: github.com/us/crw (structural footprint, endpoint table)
- Self-hosted vs hosted cost: $0 per 1,000 self-hosted vs Firecrawl hosted $0.83–5.33 per 1,000 (
competitor-prices.lock.md, verified 2026-05-18) · firecrawl.dev/pricing
Related: Local-first scraping and data privacy · Best self-hosted scrapers · Self-host vs managed scraping · Pricing
