Short Answer
If you need full browser interaction — clicking buttons, filling forms, navigating SPAs — Playwright is the most capable option. If you're already in the Google ecosystem and want a lighter browser tool — Puppeteer works. If you need fast, structured data extraction from web pages for AI pipelines without the overhead of running browsers — CRW is the better fit.
- Better fit for complex browser interactions: Playwright (multi-browser, auto-wait, codegen)
- Better fit for simple Chromium-only automation: Puppeteer (lighter API, Chrome-native)
- Better fit for AI/RAG data pipelines: CRW (API-first, markdown output, MCP built-in)
- Better fit for self-hosted scraping on constrained infra: CRW (6.6 MB idle RAM, 8 MB Docker image)
| Playwright | Puppeteer | CRW | |
|---|---|---|---|
| Approach | Browser automation | Browser automation | API-first scraping |
| Language support | JS, Python, Java, C# | JS only | Any (REST API) |
| Browser engines | Chromium, Firefox, WebKit | Chromium only | No browser needed* |
| Idle RAM | ~200–400 MB | ~150–300 MB | 6.6 MB |
| Avg latency per page | 2–5 seconds | 2–5 seconds | 833 ms |
| Docker image size | ~1.5 GB | ~1 GB | ~8 MB |
| Markdown output | Manual (you parse) | Manual (you parse) | ✅ Built-in |
| MCP server | Community packages | No | ✅ Built-in |
| Structured extraction | Manual (you code it) | Manual (you code it) | ✅ JSON schema via API |
| JavaScript rendering | ✅ Full browser | ✅ Full browser | Via LightPanda |
| Anti-bot bypass | Stealth plugins | Stealth plugins | Partial |
| License | Apache 2.0 | Apache 2.0 | AGPL-3.0 |
* CRW uses lol-html (Cloudflare's streaming parser) for most pages. LightPanda is used for JS-heavy pages when needed.
What Is Playwright?
Playwright is Microsoft's browser automation framework. It controls Chromium, Firefox, and WebKit through a single API, supports multiple languages (JavaScript, Python, Java, C#), and includes features like auto-waiting, network interception, and test code generation. Originally designed for end-to-end testing, it's widely adopted for web scraping because it can handle any page a real browser can render.
For scraping, Playwright's strength is universality: if a human can see it in a browser, Playwright can extract it. The tradeoff is resource cost — every Playwright instance runs a full browser process with hundreds of megabytes of RAM, and each page load takes seconds rather than milliseconds.
What Is Puppeteer?
Puppeteer is Google's Node.js library for controlling Chromium (and experimentally Firefox) via the Chrome DevTools Protocol. It predates Playwright — in fact, Playwright's original authors came from the Puppeteer team at Google before moving to Microsoft.
Puppeteer is simpler than Playwright: it targets Chromium only, has a JavaScript-only API, and lacks some of Playwright's advanced features like multi-browser support and built-in auto-waiting. For Chromium-specific scraping tasks, it's a lighter alternative. But the core tradeoff is the same: you're running a full browser, which means high memory usage and slow page loads.
What Is CRW?
CRW is an open-source web scraping API written in Rust. Instead of running a browser, it uses lol-html — Cloudflare's streaming HTML rewriter — to parse pages directly at the HTTP level. This means no Chromium, no browser process, no GPU memory. The result is dramatically lower resource usage (6.6 MB idle RAM) and faster response times (833 ms average across 500 URLs in benchmarks).
CRW exposes a Firecrawl-compatible REST API, so it works with existing Firecrawl SDKs and integrations. It outputs clean markdown, supports structured JSON extraction via LLM schemas, and includes a built-in MCP server for AI agent integration. For pages that genuinely require JavaScript execution, CRW falls back to LightPanda — a lightweight headless browser that avoids Chromium's overhead.
The Core Architecture Difference
The fundamental distinction isn't between Playwright and Puppeteer — they're both browser automation tools with similar tradeoffs. The real split is between browser-based scraping and API-first scraping.
Browser-based (Playwright/Puppeteer)
- Launches a real browser process for every scraping session
- Executes all JavaScript, renders CSS, loads images
- Can interact with the page: click, type, scroll, wait for elements
- Consumes 150–400 MB RAM per browser instance
- Each page load takes 2–5 seconds including render time
- You write code to extract data from the rendered DOM
API-first (CRW)
- No browser process — parses HTML at the HTTP response level
- Streaming parser processes HTML as it arrives (no full DOM construction)
- Returns clean markdown, HTML, links, or structured JSON via REST API
- 6.6 MB idle RAM, handles many concurrent requests without memory pressure
- 833 ms average response time (5.5× faster than browser-based approaches)
- Data extraction is declarative — pass a JSON schema, get structured output
When Browser Automation Wins
Browser automation is the right choice when you genuinely need what a browser provides. Here are the concrete scenarios:
1. Single-page applications with complex client-side routing
If the page you're scraping is a React/Vue/Angular SPA where content loads entirely via JavaScript and the HTML response is just an empty <div id="root">, a browser is the most reliable way to get the rendered content. CRW handles many SPAs via LightPanda, but for very complex routing and state management, Playwright is more mature.
2. Authenticated flows requiring login interaction
If you need to log in — typing a username, clicking a button, handling MFA redirects — Playwright gives you programmatic control over the full interaction. CRW doesn't simulate user interactions; it scrapes content available at a URL.
3. Pages behind anti-bot systems
Some sites use advanced bot detection (Cloudflare Turnstile, DataDome, PerimeterX) that requires a real browser fingerprint to pass. Playwright with stealth plugins can sometimes bypass these. CRW's anti-bot handling is functional but less sophisticated for heavily protected sites.
4. Visual testing or screenshot capture
If your workflow requires taking screenshots of rendered pages, browser automation is the only option. CRW does not currently support screenshot capture.
When API-First Scraping Wins
For the majority of web scraping use cases — especially in AI and data pipeline contexts — browser automation is overkill. Here's where CRW's approach is a better fit:
1. AI agent pipelines and RAG
AI agents need web content as clean text, not as a rendered DOM. CRW outputs markdown directly, which is what LLMs consume. With Playwright, you'd scrape the page, then write custom logic to extract text, strip navigation, remove ads, and convert to a format the LLM can use. CRW handles all of that automatically.
# With CRW — one API call, clean markdown output
curl -X POST http://localhost:3000/v1/scrape -H "Content-Type: application/json" -d '{"url": "https://docs.example.com/api-reference", "formats": ["markdown"]}'
2. High-volume scraping
If you're scraping hundreds or thousands of pages, browser automation becomes a resource bottleneck. Each Playwright instance uses 200–400 MB of RAM. At 50 concurrent scrapes, that's 10–20 GB of RAM just for browser processes. CRW handles the same load with ~120 MB total.
3. Scraping on constrained infrastructure
Running Playwright on a $5 VPS is painful — the browser alone may consume all available RAM. CRW's 8 MB Docker image and 6.6 MB idle footprint means it runs comfortably on the smallest VPS tiers. See our post on running CRW on a $5 VPS for a walkthrough.
4. Content sites, docs, articles, and product pages
The vast majority of content on the web — news articles, documentation, blog posts, product listings — is server-rendered HTML. These pages don't need JavaScript execution to extract content. Using a browser for these pages is like driving a truck to the corner store. CRW parses them in milliseconds.
5. Structured data extraction
CRW supports JSON schema-based extraction via its API. Pass a schema describing the data you want, and CRW returns structured JSON. With Playwright, you'd write custom selectors and parsing logic for every page structure.
// Structured extraction with CRW
const result = await app.scrapeUrl("https://example.com/product", {
formats: ["extract"],
extract: {
schema: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
in_stock: { type: "boolean" },
},
required: ["name", "price"],
},
},
});
console.log(result.extract?.name); // "Widget Pro"
console.log(result.extract?.price); // 29.99
Playwright vs Puppeteer: Head-to-Head
If you've decided that browser automation is the right approach for your use case, here's how Playwright and Puppeteer compare directly:
| Playwright | Puppeteer | |
|---|---|---|
| Multi-browser | ✅ Chromium, Firefox, WebKit | Chromium only (Firefox experimental) |
| Auto-waiting | ✅ Built-in | Manual (waitForSelector) |
| Parallel contexts | ✅ Browser contexts (lightweight) | Incognito contexts |
| Code generation | ✅ codegen CLI tool | No |
| Network interception | ✅ Route API | ✅ Page.setRequestInterception |
| Language support | JS, Python, Java, C# | JS only |
| Stealth/anti-detection | playwright-extra + stealth | puppeteer-extra + stealth |
| Maintenance | Active (Microsoft) | Active (Google) |
Recommendation: If you need browser automation, Playwright is the better choice for new projects. It has broader browser support, better auto-waiting, a more modern API, and stronger multi-language support. Puppeteer is fine if you're already using it and don't need multi-browser testing, but there's little reason to choose it for new work.
Code Comparison: Scraping a Product Page
Let's compare what it takes to extract product data from the same page using all three tools.
Playwright (Node.js)
import { chromium } from "playwright";
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com/product");
const product = {
name: await page.textContent("h1.product-title"),
price: parseFloat(
(await page.textContent(".price"))?.replace("$", "") ?? "0"
),
description: await page.textContent(".product-description"),
inStock: (await page.textContent(".stock-status"))?.includes("In Stock"),
};
await browser.close();
console.log(product);
// ~3 seconds, ~300 MB RAM for the browser process
Puppeteer
import puppeteer from "puppeteer";
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com/product", { waitUntil: "networkidle2" });
const product = await page.evaluate(() => ({
name: document.querySelector("h1.product-title")?.textContent,
price: parseFloat(
document.querySelector(".price")?.textContent?.replace("$", "") ?? "0"
),
description: document.querySelector(".product-description")?.textContent,
inStock: document
.querySelector(".stock-status")
?.textContent?.includes("In Stock"),
}));
await browser.close();
console.log(product);
// ~3 seconds, ~250 MB RAM for the browser process
CRW
import FirecrawlApp from "@mendable/firecrawl-js";
const app = new FirecrawlApp({
apiKey: "your-key",
apiUrl: "http://localhost:3000", // self-hosted CRW
});
const result = await app.scrapeUrl("https://example.com/product", {
formats: ["extract"],
extract: {
schema: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
description: { type: "string" },
inStock: { type: "boolean" },
},
required: ["name", "price"],
},
},
});
console.log(result.extract);
// ~833 ms, CRW server uses ~6.6 MB idle RAM
The CRW approach is declarative — you describe what you want, not how to find it. This is a significant advantage for AI pipelines where the extraction logic shouldn't be tightly coupled to CSS selectors that break when the page redesigns.
MCP Integration: AI Agents and Web Context
For teams building AI agents that need web access, the MCP (Model Context Protocol) integration is a key differentiator. CRW ships with a built-in MCP server — configure it in Claude Desktop or Cursor and your agent can scrape web pages directly:
{
"mcpServers": {
"crw": {
"command": "docker",
"args": ["run", "--rm", "-i", "ghcr.io/us/crw:latest", "mcp"]
}
}
}
Playwright and Puppeteer don't have native MCP support. You'd need to wrap them in a custom MCP server, handle browser lifecycle management, and deal with the memory overhead of running browsers alongside your AI agent. CRW's built-in MCP makes web access a zero-configuration addition to any AI workflow.
Performance Under Load
The performance gap between browser-based and API-first scraping widens dramatically under concurrent load:
| Concurrent requests | Playwright RAM | CRW RAM | Playwright time (50 URLs) | CRW time (50 URLs) |
|---|---|---|---|---|
| 1 | ~300 MB | ~10 MB | ~150 seconds | ~42 seconds |
| 10 | ~3 GB | ~30 MB | ~15 seconds | ~5 seconds |
| 50 | ~15 GB | ~120 MB | ~10 seconds | ~2 seconds |
At 50 concurrent requests, Playwright needs a machine with at least 16 GB of RAM. CRW handles the same load on a $5 VPS. This isn't theoretical — it's the practical difference between running scraping as a sidecar service and needing dedicated scraping infrastructure. See our benchmark post for full methodology.
When to Use Each: Decision Framework
Here's a practical decision tree:
Use Playwright when:
- You need to interact with the page (click, type, scroll, navigate)
- The page is a complex SPA with client-side routing that CRW can't handle
- You need to bypass advanced anti-bot systems that require a real browser fingerprint
- You need screenshots or visual testing alongside scraping
- You're already using Playwright for E2E testing and want to reuse that infrastructure
Use Puppeteer when:
- You need browser automation but only target Chromium
- You have an existing Puppeteer codebase and migration to Playwright isn't worth the effort
- You want a simpler API for straightforward Chromium-only tasks
Use CRW when:
- You're building AI agent pipelines that need web content as markdown
- You're building RAG systems that ingest web pages
- You need to scrape at scale without massive infrastructure
- You want structured data extraction via JSON schema rather than CSS selectors
- You're running on constrained infrastructure (small VPS, edge deployments)
- You want an MCP-compatible scraping tool for AI agents
- The pages you're scraping are content-heavy (articles, docs, product pages) rather than interaction-heavy
The Hybrid Approach
In practice, many teams use both approaches. CRW handles the 90% of pages that are content-oriented (articles, docs, listings), while Playwright handles the 10% that genuinely require browser interaction (login flows, complex SPAs, anti-bot bypasses).
Because CRW exposes a Firecrawl-compatible REST API, it's easy to build a routing layer that sends requests to CRW by default and falls back to a Playwright-based scraper for specific domains or patterns. This gives you the performance and efficiency of API-first scraping for the common case, with browser automation available when you need it.
Try CRW
Open-Source Path — Self-Host for Free
CRW is AGPL-3.0 licensed. Run it on your own infrastructure at zero cost:
docker run -p 3000:3000 ghcr.io/us/crw:latest
View the source on GitHub · Read the docs
Hosted Path — Use fastCRW
Don't want to manage servers? fastCRW is the managed cloud version — same Firecrawl-compatible API, same low-latency engine, with infrastructure and scaling handled for you. Start with 500 free credits, no credit card required.