By the fastCRW team · Footprint and benchmark facts verified 2026-05-29 · fastCRW launch pricing reverts 2026-06-01 · Verify independently.
Headless browser scraping, end to end
Headless browser scraping means driving a real browser engine — Chrome, Firefox, or WebKit — without its graphical window, so that JavaScript runs and the page renders before you read the DOM. It is the standard fix for the modern web: React, Vue, and Angular ship a near-empty HTML shell, then build the page client-side. A plain HTTP fetch with requests + BeautifulSoup sees the shell and nothing else, so you need a headless browser to execute the scripts first. This guide teaches that properly in Python with Playwright and Selenium, then is honest about what a browser fleet actually costs to run — and when a managed renderer is the saner call.
Disclosure: we build fastCRW, a Firecrawl-compatible open-core scraping engine. We will teach DIY headless scraping straight, because most of the time it is the right starting point, and only then explain where our managed chrome renderer fits.
Is the site actually dynamic?
Before reaching for a browser, confirm you need one — headless Chrome is the most expensive way to fetch a page. Three quick checks:
- View Source vs Inspect. If "View Page Source" is mostly an empty
<div id="root">but the live DOM in DevTools is full of content, JavaScript is rendering it. - Disable JavaScript in DevTools and reload. If the content vanishes, you need rendering.
- Watch the Network tab. Often the data arrives as a clean JSON XHR you can call directly with
requests— no browser required, and far faster.
If a hidden API exists, use it. Reserve headless browsers for genuinely client-rendered pages, infinite scroll, and interaction-gated content.
Setting up Playwright or Selenium headless
Playwright is the modern default: it bundles browser binaries, auto-waits for elements, and talks to Chrome over the DevTools Protocol. Selenium is older and has the larger community, but it goes through the WebDriver protocol and you manage the driver yourself. We will scrape the JS-rendered practice site quotes.toscrape.com/js with both.
Playwright (recommended for new projects)
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
import json
all_quotes = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={"width": 1920, "height": 1080})
for page_num in range(1, 11):
page.goto(f"https://quotes.toscrape.com/js/page/{page_num}/")
page.locator(".quote").first.wait_for() # wait for render
for quote in page.locator(".quote").all():
all_quotes.append({
"text": quote.locator(".text").text_content().strip("\u201c\u201d"),
"author": quote.locator(".author").text_content().strip(),
"tags": [t.text_content() for t in quote.locator(".tag").all()],
})
browser.close()
json.dump(all_quotes, open("quotes.json", "w"), indent=2)
print(f"Saved {len(all_quotes)} quotes")
Note headless=True and the explicit viewport — some sites render different layouts at small dimensions. The context manager closes the browser for you.
Selenium headless
pip install selenium webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--window-size=1920,1080")
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=options,
)
driver.get("https://quotes.toscrape.com/js/page/1/")
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".quote"))
)
quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
print(f"Found {len(quotes)} quotes")
driver.quit()
The two big server-side gotchas live here: --no-sandbox is usually required inside containers, and a headless Linux box needs Chrome's full dependency tree (fonts, libx11-xcb1, and friends) or the browser silently fails to launch. This "works on my laptop, dies on the server" gap is the first tax of running browsers yourself. For a deeper side-by-side, see Selenium vs CRW and Playwright as a scraper vs fastCRW.
Waiting for content and handling infinite scroll
The single most common headless-scraping bug is reading the DOM before it is ready. Never use a fixed time.sleep() — it is either too short (flaky) or too long (slow). Wait on a condition instead.
- Playwright: locators auto-wait; call
.wait_for()to block until an element exists, orpage.wait_for_load_state("networkidle")for XHR-heavy pages. - Selenium: use
WebDriverWait(...).until(EC.presence_of_element_located(...))as above.
Infinite scroll needs a loop that scrolls, waits for new content, and stops when the page height stops growing:
# Playwright
previous_height = 0
while True:
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # let the next batch load
previous_height = current_height
This pattern recovers lazy-loaded items, but every extra scroll keeps the browser process alive longer and grows its memory footprint — which is exactly where the cost story starts.
The hidden cost of running browser fleets
A script on your laptop is free. Running it thousands of times a day is not. Headless Chrome is heavy and the costs are structural, not incidental.
- Memory. Each headless Chrome instance typically uses on the order of a few hundred MB of RAM, and JavaScript execution spikes CPU. A modest cloud box realistically runs only a handful of concurrent browsers before it starts thrashing.
- Flakiness. Browser processes crash, leak memory, and leave zombies. Chrome auto-updates break drivers every few months. Selectors rot when sites change markup.
- Scaling work. Concurrency means job queues, coordination, restart-survivable sessions, and proxy rotation — all of which you build and then maintain forever.
- Ops salary. The dominant line item is usually not compute; it is the engineer-hours spent babysitting the fleet.
At small scale, DIY wins — you control everything and the marginal cost per page is low. As volume climbs, the equation flips: the infrastructure plus maintenance time can cost more than a managed API that bills per page. The break-even is worth calculating before you commit to infrastructure you maintain indefinitely. For the memory side specifically, we wrote up low-memory scraping with concrete numbers.
Offloading to a managed renderer
If the math says "managed," the goal is to delete the browser fleet from your stack without rewriting your pipeline. fastCRW exposes a Firecrawl-compatible REST API where rendering is a server-side concern: you send a URL, the engine picks a renderer, and you get clean Markdown or structured JSON back. No driver downloads, no xvfb, no zombie processes.
fastCRW's renderer selection (auto by default) tries chrome → lightpanda → http and falls back automatically. You can force the chrome renderer for JS-heavy pages:
from crw import CrwClient
client = CrwClient() # self-contained local engine, or point at the cloud
result = client.scrape(
"https://quotes.toscrape.com/js/page/1/",
formats=["markdown"],
renderer="chrome", # full headless Chrome, managed for you
)
print(result["markdown"])
Already on Firecrawl's SDK? Because the API shape matches, migrating is a base-URL swap — keep your code, point api_url at fastCRW:
from firecrawl import FirecrawlApp
app = FirecrawlApp(
api_key="fc-...",
api_url="https://api.fastcrw.com", # the only change
)
app.scrape_url("https://quotes.toscrape.com/js/page/1/", formats=["markdown"])
The structural difference from a self-hosted fleet is the footprint. fastCRW's engine is a single static Rust binary — roughly an 8 MB image, one container (plus an optional sidecar) — versus a typical Firecrawl-style self-host stack of about 2–3 GB across five containers (structural facts from the README, not a benchmark). That is the difference between "self-host is a platform-team project" and "self-host is one docker run." If you want to keep rendering in-house but stop maintaining a browser farm, that single binary is the point — see single-binary infra.
On accuracy, the managed engine is not just convenience: on Firecrawl's own public 1,000-URL dataset, fastCRW reached 63.74% truth-recall on the 819 labeled URLs — the highest of the three tools tested — with 91.8% scrape-success of reachable URLs and 0 thrown errors (diagnose_3way.py, 2026-05-08). Latency on that run: p50 1914 ms; in fast mode, p90 4348 ms — the lowest of the three (Crawl4AI 4754 ms, Firecrawl 6937 ms). fastCRW also recovers 34 URLs that neither Crawl4AI nor Firecrawl reach — 70% more exclusive recoveries than the other two combined. We publish the full split rather than a single average. See /benchmarks for the complete table.
Headless browser vs scraping API: a decision guide
Neither choice is universally right. Use this to decide per project, not by dogma.
| Pick a self-hosted headless browser when… | Pick a managed scraping API when… |
|---|---|
| You're learning scraping fundamentals | You're running production at scale |
| You need complex auth / multi-step login flows | You want to delete browser-fleet maintenance |
| You're scraping internal tools behind a firewall | You want extraction that survives HTML changes |
| You need full, low-level control of the browser | Engineering time costs more than per-page fees |
| Volume is low and the box never thrashes | You need concurrency without building a queue |
A few honest caveats so you choose with eyes open. fastCRW is stateless per request, so it is not built for persistent interactive sessions — for true multi-step login-and-click flows, a headless browser you drive directly is still the right tool. fastCRW also does not return screenshots (a request for formats: ["screenshot"] returns HTTP 422), has no built-in anti-bot/fingerprint-evasion engine, and its LLM extraction supports OpenAI and Anthropic providers only. If those are hard requirements, keep the browser. If you mainly need rendered content turned into clean Markdown or JSON at volume, the managed path removes the fleet.
On cost: self-hosting the AGPL-3.0 engine is $0 per 1,000 scrapes beyond your own server bill, which gives you a hard worst-case ceiling that a hosted-only model structurally cannot. On the managed cloud, the credit-based pricing is 500 one-time credits and paid tiers start low (Hobby) — current launch pricing reverts to regular price on 2026-06-01, so check live /pricing rather than trusting a number in a blog post.
Recommendation
Start with Playwright headless to learn the fundamentals — waiting on conditions, infinite scroll, and the server-deployment gotchas. Reach for Selenium only when you need its specific browser coverage. Then, the day your fleet's memory thrashing and maintenance hours start outweighing a per-page bill, offload rendering to a managed renderer — and pick one whose API is compatible and whose engine you can also self-host, so the decision stays reversible. Because fastCRW is Firecrawl-compatible on both self-host and cloud, you never have to take this on faith: change one line, measure on your own URL mix, and let the numbers decide.
Sources
- fastCRW benchmark of record — 3-way scrape run, 819 labeled URLs,
diagnose_3way.py, 2026-05-08 — see /benchmarks - fastCRW footprint and renderer selection — github.com/us/crw (README, structural facts)
- Pricing and credits — live /pricing (launch pricing reverts 2026-06-01)
- Playwright Python docs — playwright.dev/python · Selenium docs — selenium.dev
Related: Playwright as a scraper vs fastCRW · Selenium vs CRW · Low-memory scraping · Single-binary infra
