The Core Question
Before you install Playwright and spin up Chromium, ask one question: does this page render its content server-side, or does it require JavaScript to show anything meaningful?
If the server sends back HTML with the content already in it (view the page source and the text is there), a plain HTTP request is enough. If the server sends back an empty <div id="root"></div> and all the content arrives via JavaScript after the browser loads and executes bundles, you need a JavaScript runtime — which means a headless browser.
The internet is still mostly server-rendered. Articles, documentation, product pages, news, blog posts, e-commerce listings — most of these are HTML-primary. Browser automation is a useful tool, but it's the heavyweight option. Reach for it when you need it, not by default.
When Raw HTTP (curl / requests / fetch) Is Enough
1. Server-rendered pages
Most content sites, documentation, marketing pages, and e-commerce product listings render their HTML on the server. The full content is in the HTML response. A curl call gets it all:
curl -s "https://docs.example.com/getting-started" -H "User-Agent: Mozilla/5.0 (compatible; MyBot/1.0)"
Python equivalent:
import requests
response = requests.get(
"https://docs.example.com/getting-started",
headers={"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"},
)
html = response.text
2. REST and JSON APIs
If a site has a public API or its frontend fetches data from an API endpoint, calling the API directly is always cleaner than scraping the rendered page. Open DevTools, watch the Network tab, find the JSON fetch — then curl that directly.
curl "https://api.example.com/v1/products?page=1" -H "Accept: application/json"
3. RSS and sitemaps
Many sites expose RSS feeds and sitemaps as structured XML. These are trivially parseable with any HTTP client and never require a browser.
4. High-volume crawling
At scale (hundreds of pages per minute), every millisecond and every megabyte matters. Launching a Chromium instance per page is expensive — even with browser context reuse, each session carries a browser's memory overhead. Plain HTTP at volume is orders of magnitude cheaper on infrastructure.
When You Actually Need Playwright
1. Single-page applications
SPAs built with React, Vue, Angular, or similar frameworks often return an almost empty HTML document and load all content via JavaScript. The raw HTML from a curl call contains no useful content. You need a browser to execute the JavaScript and get the rendered DOM:
import { chromium } from "playwright";
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://app.example.com/dashboard");
await page.waitForSelector(".data-table");
const content = await page.textContent(".data-table");
await browser.close();
2. Interaction-required flows
Login flows, form submissions, infinite scroll pages, click-to-reveal content — anything that requires simulating user actions needs a browser. curl cannot click a button or type into a form.
await page.fill("#email", "user@example.com");
await page.fill("#password", "secret");
await page.click('button[type="submit"]');
await page.waitForNavigation();
3. JavaScript-based anti-bot challenges
Some sites use JavaScript challenges (Cloudflare Turnstile, DataDome, PerimeterX) that require the browser to solve a proof-of-work or render a CAPTCHA before serving content. These are designed to be invisible to HTTP clients without a JavaScript runtime. A headless browser can attempt to pass them (sometimes with stealth plugins); curl cannot.
4. Screenshots and visual capture
If your workflow requires a visual snapshot of the rendered page, a headless browser is the only option. Playwright can capture full-page screenshots or specific element screenshots directly.
await page.screenshot({ path: "page.png", fullPage: true });
The Real Cost of Playwright at Scale
The browser automation mental model is: one Playwright instance = one browser process = one full Chromium session. In practice:
| Approach | Memory per concurrent scrape | Latency shape | Infrastructure for 50 concurrent |
|---|---|---|---|
| curl / requests / fetch | Negligible (network buffer only) | Network round-trip only | Any small server |
| Playwright (browser reuse) | One Chromium process shared across contexts | Network + render cycle per page | Need enough RAM for Chromium + contexts |
| Playwright (one browser per URL) | Full Chromium per scrape | Browser cold start + render | Large-memory server or cluster |
Browser context reuse (one Chromium process, many browser contexts) reduces the memory penalty significantly — this is how Playwright handles concurrency well. But even with context reuse, the idle Chromium process holds memory, and each page's render cycle adds latency that a pure HTTP request avoids entirely.
A Third Path: curl-Simple Calls, Browser-Grade Results
There is a middle ground between "raw curl" and "full Playwright" that is worth knowing about, especially for AI-pipeline workloads where you want clean markdown output without managing a browser yourself.
fastCRW is a Rust-based scraping API that accepts a single POST (as simple as a curl call) and returns clean markdown, HTML, links, or structured JSON. Under the hood, it uses lol-html — Cloudflare's streaming HTML parser — for HTML-primary pages, and falls back to LightPanda (a lightweight headless browser) for pages that need JavaScript rendering. You make one API call; the engine decides which renderer to use.
Single curl call, browser-grade output
curl -X POST https://api.fastcrw.com/v1/scrape -H "Authorization: Bearer fc-YOUR_API_KEY" -H "Content-Type: application/json" -d '{"url": "https://docs.example.com/getting-started", "formats": ["markdown"]}'
Response:
{
"success": true,
"data": {
"markdown": "# Getting Started
This guide walks you through...",
"metadata": { "title": "Getting Started", "statusCode": 200 }
}
}
Python equivalent (using the Firecrawl SDK — fastCRW is API-compatible):
from firecrawl import FirecrawlApp
# Point the SDK at fastCRW (cloud or self-hosted)
app = FirecrawlApp(
api_key="fc-YOUR_API_KEY",
api_url="https://api.fastcrw.com", # or http://localhost:3000 self-hosted
)
result = app.scrape_url(
"https://docs.example.com/getting-started",
formats=["markdown"],
)
print(result.markdown)
The key difference from raw curl: fastCRW handles noise removal, markdown conversion, and falls back to browser rendering when needed — all without you managing Playwright, Chromium, or browser contexts. The key difference from Playwright: you're not managing a browser process; the rendering decision is server-side and transparent to your code.
On benchmark data
On Firecrawl's own 1,000-URL public benchmark dataset (819 labeled), fastCRW reached 63.74% truth-recall with 87.7% scrape-success and 0 errors (diagnose_3way.py, 2026-05-08). p50 latency was 1,914 ms. p90 was 14,157 ms — the wide tail is caused by the chrome-stealth fallback that recovers the harder pages. Full distribution and a one-command repro are on /benchmarks.
Decision Framework
Use this to pick the right tool for a given scraping job:
| Situation | Best tool |
|---|---|
| Static HTML, server-rendered content | curl / requests — simplest and fastest |
| Public JSON API | curl / fetch — call the API directly |
| HTML-primary but want clean markdown for LLMs | fastCRW — one call, noise removed, markdown out |
| Unknown content type, want browser fallback handled for you | fastCRW — auto-selects renderer |
| True SPA with complex client-side routing | Playwright |
| Login / form interaction required | Playwright |
| JavaScript-based anti-bot challenge | Playwright + stealth plugins |
| Screenshot or visual capture | Playwright |
| AI agent needing live web context via MCP | fastCRW — MCP built-in, zero config |
| High-volume HTML crawling on constrained infra | fastCRW or plain HTTP — no browser baseline |
Combining the Approaches
For production pipelines that handle a mix of URL types, the practical architecture is a router:
import requests
def scrape(url: str) -> str:
"""
Try fastCRW first (handles HTML-primary + JS fallback automatically).
Fall back to Playwright only for known SPA domains or after a failed scrape.
"""
response = requests.post(
"https://api.fastcrw.com/v1/scrape",
headers={"Authorization": "Bearer fc-YOUR_API_KEY"},
json={"url": url, "formats": ["markdown"]},
)
data = response.json()
# If the content looks empty, escalate to Playwright for this URL
if not data.get("data", {}).get("markdown", "").strip():
return scrape_with_playwright(url) # your Playwright fallback
return data["data"]["markdown"]
This gives you the performance and simplicity of HTTP-based scraping for the majority of URLs, with Playwright available as a targeted fallback — rather than paying the Playwright overhead universally.
Self-Hosting fastCRW
fastCRW is AGPL-3.0 open-source. Run it on your own server with one Docker command — no Redis, no multi-container setup, no Playwright install:
docker run -p 3000:3000 ghcr.io/us/crw:latest
Then your curl calls go to http://localhost:3000/v1/scrape instead. Same API as the cloud version. Source on GitHub · Compare to Firecrawl and Crawl4AI
