Skip to main content
Tutorial

Headless Browser Scraping: A Practical Guide

Headless browser scraping in Python with Playwright and Selenium: waiting, infinite scroll, the true cost of a browser fleet, and when to offload to an API.

fastcrw
By RecepJune 21, 202612 min readLast updated: June 1, 2026

By the fastCRW team · Footprint and benchmark facts verified 2026-05-29 · fastCRW launch pricing reverts 2026-06-01 · Verify independently.

Headless browser scraping, end to end

Headless browser scraping means driving a real browser engine — Chrome, Firefox, or WebKit — without its graphical window, so that JavaScript runs and the page renders before you read the DOM. It is the standard fix for the modern web: React, Vue, and Angular ship a near-empty HTML shell, then build the page client-side. A plain HTTP fetch with requests + BeautifulSoup sees the shell and nothing else, so you need a headless browser to execute the scripts first. This guide teaches that properly in Python with Playwright and Selenium, then is honest about what a browser fleet actually costs to run — and when a managed renderer is the saner call.

Disclosure: we build fastCRW, a Firecrawl-compatible open-core scraping engine. We will teach DIY headless scraping straight, because most of the time it is the right starting point, and only then explain where our managed chrome renderer fits.

Is the site actually dynamic?

Before reaching for a browser, confirm you need one — headless Chrome is the most expensive way to fetch a page. Three quick checks:

  • View Source vs Inspect. If "View Page Source" is mostly an empty <div id="root"> but the live DOM in DevTools is full of content, JavaScript is rendering it.
  • Disable JavaScript in DevTools and reload. If the content vanishes, you need rendering.
  • Watch the Network tab. Often the data arrives as a clean JSON XHR you can call directly with requests — no browser required, and far faster.

If a hidden API exists, use it. Reserve headless browsers for genuinely client-rendered pages, infinite scroll, and interaction-gated content.

Setting up Playwright or Selenium headless

Playwright is the modern default: it bundles browser binaries, auto-waits for elements, and talks to Chrome over the DevTools Protocol. Selenium is older and has the larger community, but it goes through the WebDriver protocol and you manage the driver yourself. We will scrape the JS-rendered practice site quotes.toscrape.com/js with both.

Playwright (recommended for new projects)

pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
import json

all_quotes = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page(viewport={"width": 1920, "height": 1080})

    for page_num in range(1, 11):
        page.goto(f"https://quotes.toscrape.com/js/page/{page_num}/")
        page.locator(".quote").first.wait_for()   # wait for render

        for quote in page.locator(".quote").all():
            all_quotes.append({
                "text": quote.locator(".text").text_content().strip("\u201c\u201d"),
                "author": quote.locator(".author").text_content().strip(),
                "tags": [t.text_content() for t in quote.locator(".tag").all()],
            })

    browser.close()

json.dump(all_quotes, open("quotes.json", "w"), indent=2)
print(f"Saved {len(all_quotes)} quotes")

Note headless=True and the explicit viewport — some sites render different layouts at small dimensions. The context manager closes the browser for you.

Selenium headless

pip install selenium webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--window-size=1920,1080")

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options,
)

driver.get("https://quotes.toscrape.com/js/page/1/")
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".quote"))
)
quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
print(f"Found {len(quotes)} quotes")
driver.quit()

The two big server-side gotchas live here: --no-sandbox is usually required inside containers, and a headless Linux box needs Chrome's full dependency tree (fonts, libx11-xcb1, and friends) or the browser silently fails to launch. This "works on my laptop, dies on the server" gap is the first tax of running browsers yourself. For a deeper side-by-side, see Selenium vs CRW and Playwright as a scraper vs fastCRW.

Waiting for content and handling infinite scroll

The single most common headless-scraping bug is reading the DOM before it is ready. Never use a fixed time.sleep() — it is either too short (flaky) or too long (slow). Wait on a condition instead.

  • Playwright: locators auto-wait; call .wait_for() to block until an element exists, or page.wait_for_load_state("networkidle") for XHR-heavy pages.
  • Selenium: use WebDriverWait(...).until(EC.presence_of_element_located(...)) as above.

Infinite scroll needs a loop that scrolls, waits for new content, and stops when the page height stops growing:

# Playwright
previous_height = 0
while True:
    current_height = page.evaluate("document.body.scrollHeight")
    if current_height == previous_height:
        break
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    page.wait_for_timeout(2000)   # let the next batch load
    previous_height = current_height

This pattern recovers lazy-loaded items, but every extra scroll keeps the browser process alive longer and grows its memory footprint — which is exactly where the cost story starts.

The hidden cost of running browser fleets

A script on your laptop is free. Running it thousands of times a day is not. Headless Chrome is heavy and the costs are structural, not incidental.

  • Memory. Each headless Chrome instance typically uses on the order of a few hundred MB of RAM, and JavaScript execution spikes CPU. A modest cloud box realistically runs only a handful of concurrent browsers before it starts thrashing.
  • Flakiness. Browser processes crash, leak memory, and leave zombies. Chrome auto-updates break drivers every few months. Selectors rot when sites change markup.
  • Scaling work. Concurrency means job queues, coordination, restart-survivable sessions, and proxy rotation — all of which you build and then maintain forever.
  • Ops salary. The dominant line item is usually not compute; it is the engineer-hours spent babysitting the fleet.

At small scale, DIY wins — you control everything and the marginal cost per page is low. As volume climbs, the equation flips: the infrastructure plus maintenance time can cost more than a managed API that bills per page. The break-even is worth calculating before you commit to infrastructure you maintain indefinitely. For the memory side specifically, we wrote up low-memory scraping with concrete numbers.

Offloading to a managed renderer

If the math says "managed," the goal is to delete the browser fleet from your stack without rewriting your pipeline. fastCRW exposes a Firecrawl-compatible REST API where rendering is a server-side concern: you send a URL, the engine picks a renderer, and you get clean Markdown or structured JSON back. No driver downloads, no xvfb, no zombie processes.

fastCRW's renderer selection (auto by default) tries chrome → lightpanda → http and falls back automatically. You can force the chrome renderer for JS-heavy pages:

from crw import CrwClient

client = CrwClient()  # self-contained local engine, or point at the cloud

result = client.scrape(
    "https://quotes.toscrape.com/js/page/1/",
    formats=["markdown"],
    renderer="chrome",   # full headless Chrome, managed for you
)
print(result["markdown"])

Already on Firecrawl's SDK? Because the API shape matches, migrating is a base-URL swap — keep your code, point api_url at fastCRW:

from firecrawl import FirecrawlApp

app = FirecrawlApp(
    api_key="fc-...",
    api_url="https://api.fastcrw.com",  # the only change
)
app.scrape_url("https://quotes.toscrape.com/js/page/1/", formats=["markdown"])

The structural difference from a self-hosted fleet is the footprint. fastCRW's engine is a single static Rust binary — roughly an 8 MB image, one container (plus an optional sidecar) — versus a typical Firecrawl-style self-host stack of about 2–3 GB across five containers (structural facts from the README, not a benchmark). That is the difference between "self-host is a platform-team project" and "self-host is one docker run." If you want to keep rendering in-house but stop maintaining a browser farm, that single binary is the point — see single-binary infra.

On accuracy, the managed engine is not just convenience: on Firecrawl's own public 1,000-URL dataset, fastCRW reached 63.74% truth-recall on the 819 labeled URLs — the highest of the three tools tested — with 91.8% scrape-success of reachable URLs and 0 thrown errors (diagnose_3way.py, 2026-05-08). Latency on that run: p50 1914 ms; in fast mode, p90 4348 ms — the lowest of the three (Crawl4AI 4754 ms, Firecrawl 6937 ms). fastCRW also recovers 34 URLs that neither Crawl4AI nor Firecrawl reach — 70% more exclusive recoveries than the other two combined. We publish the full split rather than a single average. See /benchmarks for the complete table.

Headless browser vs scraping API: a decision guide

Neither choice is universally right. Use this to decide per project, not by dogma.

Pick a self-hosted headless browser when…Pick a managed scraping API when…
You're learning scraping fundamentalsYou're running production at scale
You need complex auth / multi-step login flowsYou want to delete browser-fleet maintenance
You're scraping internal tools behind a firewallYou want extraction that survives HTML changes
You need full, low-level control of the browserEngineering time costs more than per-page fees
Volume is low and the box never thrashesYou need concurrency without building a queue

A few honest caveats so you choose with eyes open. fastCRW is stateless per request, so it is not built for persistent interactive sessions — for true multi-step login-and-click flows, a headless browser you drive directly is still the right tool. fastCRW also does not return screenshots (a request for formats: ["screenshot"] returns HTTP 422), has no built-in anti-bot/fingerprint-evasion engine, and its LLM extraction supports OpenAI and Anthropic providers only. If those are hard requirements, keep the browser. If you mainly need rendered content turned into clean Markdown or JSON at volume, the managed path removes the fleet.

On cost: self-hosting the AGPL-3.0 engine is $0 per 1,000 scrapes beyond your own server bill, which gives you a hard worst-case ceiling that a hosted-only model structurally cannot. On the managed cloud, the credit-based pricing is 500 one-time credits and paid tiers start low (Hobby) — current launch pricing reverts to regular price on 2026-06-01, so check live /pricing rather than trusting a number in a blog post.

Recommendation

Start with Playwright headless to learn the fundamentals — waiting on conditions, infinite scroll, and the server-deployment gotchas. Reach for Selenium only when you need its specific browser coverage. Then, the day your fleet's memory thrashing and maintenance hours start outweighing a per-page bill, offload rendering to a managed renderer — and pick one whose API is compatible and whose engine you can also self-host, so the decision stays reversible. Because fastCRW is Firecrawl-compatible on both self-host and cloud, you never have to take this on faith: change one line, measure on your own URL mix, and let the numbers decide.

Sources

  • fastCRW benchmark of record — 3-way scrape run, 819 labeled URLs, diagnose_3way.py, 2026-05-08 — see /benchmarks
  • fastCRW footprint and renderer selection — github.com/us/crw (README, structural facts)
  • Pricing and credits — live /pricing (launch pricing reverts 2026-06-01)
  • Playwright Python docs — playwright.dev/python · Selenium docs — selenium.dev

Related: Playwright as a scraper vs fastCRW · Selenium vs CRW · Low-memory scraping · Single-binary infra

FAQ

Frequently asked questions

What is a headless browser?
A headless browser is a real browser engine (Chrome, Firefox, or WebKit) run without its graphical window. It loads a page, executes JavaScript, and renders the DOM exactly as a normal browser would, so you can scrape content that only appears after client-side scripts run — which a plain HTTP request would never see.
Is headless scraping expensive to run at scale?
At small scale it's cheap. At scale the costs are structural: each headless Chrome instance uses a few hundred MB of RAM and spikes CPU, so a single box runs only a handful of concurrent browsers, and you also pay for proxies, monitoring, and the engineer-hours spent fixing crashes, memory leaks, and broken selectors. Past a break-even point, a per-page managed API can cost the same or less.
When should I use a managed API instead of a headless browser?
Use a managed API when you're running production at scale, want to stop maintaining a browser fleet, need extraction that survives HTML changes, or when engineering time costs more than per-page fees. Keep a headless browser for learning, complex authentication or multi-step interactive flows, internal tools behind a firewall, or when you need full low-level control.
How much memory does headless Chrome use?
Each headless Chrome instance typically consumes on the order of a few hundred MB of RAM, with CPU spikes during JavaScript execution. That's why a modest cloud server can only run a handful of concurrent instances before it starts thrashing memory, and why scaling DIY headless scraping forces you into job queues and multiple workers.
Does fastCRW run a headless browser for me?
Yes. fastCRW's engine selects a renderer automatically (chrome → lightpanda → http) and you can force the chrome renderer for JS-heavy pages, so rendering is handled server-side — no driver downloads, no xvfb, no zombie processes. The engine ships as a single ~8 MB Rust binary in one container, versus a typical Firecrawl-style self-host stack of ~2-3 GB across five containers (structural facts, not a benchmark). Note it's stateless per request and returns no screenshots, so true persistent interactive sessions still need a browser you drive directly.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive