Is curl good enough for web scraping?

curl (or any raw HTTP client like Python requests or fetch) is good enough for a large portion of the web: static HTML pages, server-rendered content, REST APIs that return JSON, documentation sites, news articles, and product pages rendered server-side. It fails when the page requires JavaScript execution to show meaningful content — SPAs, lazy-loaded feeds, or content behind a JavaScript challenge.

When do I actually need Playwright instead of curl?

Use Playwright when: (1) the page is a single-page application that renders content entirely via JavaScript; (2) you need to interact with the page — click, type, scroll, wait for elements; (3) the site uses a JavaScript-based anti-bot challenge like Cloudflare Turnstile; or (4) you need a screenshot of the rendered page. For everything else — articles, docs, product pages, APIs — a plain HTTP request is usually faster and uses far less memory.

What is the memory cost of running Playwright vs curl?

A plain curl or requests call uses no additional resident memory beyond the process making it. A Playwright instance runs a full Chromium browser process, which typically holds 150–400 MB of RAM per concurrent session. At high concurrency (50+ parallel scrapes), that difference compounds: curl-based scrapers scale on a single small server; Playwright-based scrapers need dedicated large-memory infrastructure.

What is fastCRW and why is it mentioned here?

fastCRW is a Rust-based web scraping API that gives you curl-simple API calls (one POST, clean markdown back) with the quality of a browser-grade scraper for HTML-primary pages. It uses lol-html (Cloudflare's streaming parser) for most pages and falls back to LightPanda for JavaScript rendering — without you managing a browser. On Firecrawl's 1,000-URL public benchmark (819 labeled), it reached 63.74% truth-recall — the highest of the three tools — with 91.8% scrape success of reachable URLs and 0 errors (diagnose_3way.py, 2026-05-08). It ships as a single ~8 MB binary with a built-in MCP server.

Can I mix curl-style requests and Playwright in the same pipeline?

Yes, and that is often the best architecture. Route most URLs through a fast HTTP-based scraper and use Playwright only for URLs that genuinely need JavaScript rendering. A simple classifier — domain allowlist, content-type check, or a failed-request retry — can gate which path each URL takes. This keeps infrastructure costs low for the common case while preserving browser capability for the exceptions.

curl vs Playwright for Web Scraping: When Raw HTTP Is Enough (2026)

The Core Question

Before you install Playwright and spin up Chromium, ask one question: does this page render its content server-side, or does it require JavaScript to show anything meaningful?

If the server sends back HTML with the content already in it (view the page source and the text is there), a plain HTTP request is enough. If the server sends back an empty <div id="root"></div> and all the content arrives via JavaScript after the browser loads and executes bundles, you need a JavaScript runtime — which means a headless browser.

The internet is still mostly server-rendered. Articles, documentation, product pages, news, blog posts, e-commerce listings — most of these are HTML-primary. Browser automation is a useful tool, but it's the heavyweight option. Reach for it when you need it, not by default.

When Raw HTTP (curl / requests / fetch) Is Enough

1. Server-rendered pages

Most content sites, documentation, marketing pages, and e-commerce product listings render their HTML on the server. The full content is in the HTML response. A curl call gets it all:

curl -s "https://docs.example.com/getting-started"   -H "User-Agent: Mozilla/5.0 (compatible; MyBot/1.0)"

Python equivalent:

import requests

response = requests.get(
    "https://docs.example.com/getting-started",
    headers={"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"},
)
html = response.text

2. REST and JSON APIs

If a site has a public API or its frontend fetches data from an API endpoint, calling the API directly is always cleaner than scraping the rendered page. Open DevTools, watch the Network tab, find the JSON fetch — then curl that directly.

curl "https://api.example.com/v1/products?page=1"   -H "Accept: application/json"

3. RSS and sitemaps

Many sites expose RSS feeds and sitemaps as structured XML. These are trivially parseable with any HTTP client and never require a browser.

4. High-volume crawling

At scale (hundreds of pages per minute), every millisecond and every megabyte matters. Launching a Chromium instance per page is expensive — even with browser context reuse, each session carries a browser's memory overhead. Plain HTTP at volume is orders of magnitude cheaper on infrastructure.

When You Actually Need Playwright

1. Single-page applications

SPAs built with React, Vue, Angular, or similar frameworks often return an almost empty HTML document and load all content via JavaScript. The raw HTML from a curl call contains no useful content. You need a browser to execute the JavaScript and get the rendered DOM:

import { chromium } from "playwright";

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://app.example.com/dashboard");
await page.waitForSelector(".data-table");
const content = await page.textContent(".data-table");
await browser.close();

2. Interaction-required flows

Login flows, form submissions, infinite scroll pages, click-to-reveal content — anything that requires simulating user actions needs a browser. curl cannot click a button or type into a form.

await page.fill("#email", "user@example.com");
await page.fill("#password", "secret");
await page.click('button[type="submit"]');
await page.waitForNavigation();

3. JavaScript-based anti-bot challenges

Some sites use JavaScript challenges (Cloudflare Turnstile, DataDome, PerimeterX) that require the browser to solve a proof-of-work or render a CAPTCHA before serving content. These are designed to be invisible to HTTP clients without a JavaScript runtime. A headless browser can attempt to pass them (sometimes with stealth plugins); curl cannot.

4. Screenshots and visual capture

If your workflow requires a visual snapshot of the rendered page, a headless browser is the only option. Playwright can capture full-page screenshots or specific element screenshots directly.

await page.screenshot({ path: "page.png", fullPage: true });

The Real Cost of Playwright at Scale

The browser automation mental model is: one Playwright instance = one browser process = one full Chromium session. In practice:

Approach	Memory per concurrent scrape	Latency shape	Infrastructure for 50 concurrent
curl / requests / fetch	Negligible (network buffer only)	Network round-trip only	Any small server
Playwright (browser reuse)	One Chromium process shared across contexts	Network + render cycle per page	Need enough RAM for Chromium + contexts
Playwright (one browser per URL)	Full Chromium per scrape	Browser cold start + render	Large-memory server or cluster

Browser context reuse (one Chromium process, many browser contexts) reduces the memory penalty significantly — this is how Playwright handles concurrency well. But even with context reuse, the idle Chromium process holds memory, and each page's render cycle adds latency that a pure HTTP request avoids entirely.

A Third Path: curl-Simple Calls, Browser-Grade Results

There is a middle ground between "raw curl" and "full Playwright" that is worth knowing about, especially for AI-pipeline workloads where you want clean markdown output without managing a browser yourself.

fastCRW is a Rust-based scraping API that accepts a single POST (as simple as a curl call) and returns clean markdown, HTML, links, or structured JSON. Under the hood, it uses lol-html — Cloudflare's streaming HTML parser — for HTML-primary pages, and falls back to LightPanda (a lightweight headless browser) for pages that need JavaScript rendering. You make one API call; the engine decides which renderer to use.

Single curl call, browser-grade output

curl -X POST https://api.fastcrw.com/v1/scrape   -H "Authorization: Bearer fc-YOUR_API_KEY"   -H "Content-Type: application/json"   -d '{"url": "https://docs.example.com/getting-started", "formats": ["markdown"]}'

Response:

{
  "success": true,
  "data": {
    "markdown": "# Getting Started

This guide walks you through...",
    "metadata": { "title": "Getting Started", "statusCode": 200 }
  }
}

Python equivalent (using the Firecrawl SDK — fastCRW is API-compatible):

from firecrawl import FirecrawlApp

# Point the SDK at fastCRW (cloud or self-hosted)
app = FirecrawlApp(
    api_key="fc-YOUR_API_KEY",
    api_url="https://api.fastcrw.com",  # or http://localhost:3000 self-hosted
)

result = app.scrape_url(
    "https://docs.example.com/getting-started",
    formats=["markdown"],
)
print(result.markdown)

The key difference from raw curl: fastCRW handles noise removal, markdown conversion, and falls back to browser rendering when needed — all without you managing Playwright, Chromium, or browser contexts. The key difference from Playwright: you're not managing a browser process; the rendering decision is server-side and transparent to your code.

On benchmark data

On Firecrawl's own 1,000-URL public benchmark dataset (819 labeled), fastCRW reached 63.74% truth-recall — the highest of the three tools — with 91.8% scrape success of reachable URLs and 0 errors (diagnose_3way.py, 2026-05-08). p50 latency was 1,914 ms. In fast mode, p90 is 4,348 ms — the lowest of the three. Full distribution and a one-command repro are on /benchmarks.

Decision Framework

Use this to pick the right tool for a given scraping job:

Situation	Best tool
Static HTML, server-rendered content	curl / requests — simplest and fastest
Public JSON API	curl / fetch — call the API directly
HTML-primary but want clean markdown for LLMs	fastCRW — one call, noise removed, markdown out
Unknown content type, want browser fallback handled for you	fastCRW — auto-selects renderer
True SPA with complex client-side routing	Playwright
Login / form interaction required	Playwright
JavaScript-based anti-bot challenge	Playwright + stealth plugins
Screenshot or visual capture	Playwright
AI agent needing live web context via MCP	fastCRW — MCP built-in, zero config
High-volume HTML crawling on constrained infra	fastCRW or plain HTTP — no browser baseline

Combining the Approaches

For production pipelines that handle a mix of URL types, the practical architecture is a router:

import requests

def scrape(url: str) -> str:
    """
    Try fastCRW first (handles HTML-primary + JS fallback automatically).
    Fall back to Playwright only for known SPA domains or after a failed scrape.
    """
    response = requests.post(
        "https://api.fastcrw.com/v1/scrape",
        headers={"Authorization": "Bearer fc-YOUR_API_KEY"},
        json={"url": url, "formats": ["markdown"]},
    )
    data = response.json()

    # If the content looks empty, escalate to Playwright for this URL
    if not data.get("data", {}).get("markdown", "").strip():
        return scrape_with_playwright(url)  # your Playwright fallback

    return data["data"]["markdown"]

This gives you the performance and simplicity of HTTP-based scraping for the majority of URLs, with Playwright available as a targeted fallback — rather than paying the Playwright overhead universally.

Self-Hosting fastCRW

fastCRW is AGPL-3.0 open-source. Run it on your own server with one Docker command — no Redis, no multi-container setup, no Playwright install:

docker run -p 3000:3000 ghcr.io/us/crw:latest

Then your curl calls go to http://localhost:3000/v1/scrape instead. Same API as the cloud version. Source on GitHub · Compare to Firecrawl and Crawl4AI