Port a TypeScript scraper to Python without rewriting the browser glue
The usual reason to port a TypeScript scraper to Python is consolidation: your scraper started life as a standalone Node service, and now it needs to live inside a Python ML or RAG codebase where the rest of the data pipeline already runs. Rewriting a Playwright or Puppeteer script line-for-line in Python works, but it carries every fragile part of the original forward — the waits, the selector drift, the headless-detection workarounds — into a second language you now have to maintain twice during the transition.
There are two honest paths. Path one is the manual port: rewrite the browser automation in Python's Playwright bindings, which have near-identical API parity. Path two is to stop maintaining browser glue at all and call the same HTTP scraping contract from both languages, so the "migration" becomes a base-URL change plus one JSON schema. This guide covers both, and is explicit about where each one breaks down.
Why port a TypeScript scraper to Python at all
The decision is rarely about the scraper itself. It's about where the scraped data goes next. If your retrieval, embedding, and evaluation code is Python, a Node scraper means a process boundary, a serialization step, and two CI pipelines for one logical job. Folding the scrape into the Python codebase removes that seam.
Before you commit to a rewrite, separate the two things a browser-automation scraper actually does:
- Navigation and rendering — loading the page, waiting for JavaScript, getting final HTML. This is the part that's painful to port and painful to maintain.
- Extraction — turning that HTML into the fields or Markdown you need. This is portable logic, not browser logic.
If your script only needs the rendered content (most read-only scrapers do), you can move the navigation/rendering burden behind an API and only port the extraction intent. If your script genuinely interacts — clicks, logins, multi-step forms — the manual port is the right call, and we'll say so plainly below.
Manual port: Playwright TypeScript to Playwright Python
Playwright is the one case where porting is genuinely low-friction, because the Python bindings mirror the Node API closely. The shapes line up almost one-to-one:
| TypeScript (Node) | Python |
|---|---|
const browser = await chromium.launch() | browser = await playwright.chromium.launch() |
await page.goto(url) | await page.goto(url) |
await page.waitForSelector(sel) | await page.wait_for_selector(sel) |
await page.$eval(sel, fn) | await page.eval_on_selector_all(sel, fn) |
await page.content() | await page.content() |
The naming convention flips from camelCase to snake_case, and you choose between async_playwright and the sync API. The async bindings map most directly onto an existing async/await TypeScript script, so prefer them if your original used promises throughout.
Where the manual port still hurts
The API parity is real, but it does not save you from the parts that made the original fragile in the first place:
- Environment — you re-pin a browser binary, re-solve headless flags, and re-install system libraries in your Python image. The browser fleet does not get lighter by changing languages.
- Timeouts and flakiness — every
waitForSelectorand network-idle heuristic ports across as-is, including the ones that flake. A rewrite is a chance to fix them, but it is not a fix by itself. - Selector drift — CSS/XPath selectors are the same brittle strings in either language; the site changes and both versions break together.
If you are porting Puppeteer rather than Playwright, there is no first-party Python Puppeteer; you are effectively rewriting onto Playwright Python anyway. At that point the rewrite is already most of the work — which is exactly when path two starts to look better.
Skip the rewrite: one API, both languages
If your scraper is read-only — navigate, render, extract content — you can avoid porting browser code entirely by calling a scraping API that returns clean content directly. fastCRW exposes a POST /v1/scrape endpoint that takes a URL and returns Markdown by default (1 credit), handling the rendering decision server-side. The same HTTP contract is called identically from TypeScript and Python, so "porting" becomes pointing both languages at one endpoint.
Because fastCRW implements a Firecrawl-compatible REST API, this is a drop-in after a base-URL swap. If your TypeScript code already uses the Firecrawl SDK, you change the API base URL and keep the rest. The Node side stays exactly as it was; the Python side calls the same endpoint with the same request body. There is no second browser stack to stand up in either runtime.
The Python path has one extra convenience worth knowing: the crw Python SDK on PyPI ships a CrwClient() that runs a self-contained local engine. You do not need to deploy a separate server first to start scraping from Python — the SDK runs the engine itself. That removes the "stand up infrastructure before I can test" friction that usually shows up mid-migration.
For background on each language's entry point, see the Python scraping quickstart and the Node.js scraping quickstart. If you're moving off a heavier browser-automation stack rather than a hand-rolled script, the dedicated Puppeteer/Playwright-to-API migration guide walks the function-by-function mapping.
Choosing the renderer
A browser-automation script implies you needed JavaScript execution. fastCRW's renderers are auto (default), http, lightpanda, and chrome, with auto falling back chrome → lightpanda → http. For pages that needed a full browser in your original script, request the chrome renderer (2 credits instead of 1). Pages that were rendering JavaScript "just in case" often work on the lighter renderers — worth testing, because it halves the per-page cost.
Field extraction that survives the move
The most portable part of a scraper is its extraction intent — "I want the title, price, and SKU from this page." Rather than re-port CSS selectors into Python, you can define that intent once as a JSON schema and reuse it across both languages. Call /v1/scrape with formats: ["json"] and a jsonSchema, and the engine fills the schema from the page content. The same schema string is sent from TypeScript and from Python — it is data, not code, so it ports for free.
Two honest specifics to plan around:
- Cost: a JSON-extraction request is 5 credits, versus 1 credit for a plain Markdown scrape. If you only need clean text, skip JSON and take Markdown.
- Providers: LLM-backed extraction supports OpenAI and Anthropic only. If your stack standardizes on a different extraction model, that's a constraint to weigh before you lean on schema extraction.
For the full schema design pattern — required vs optional fields, nesting, handling missing data — see structured extraction with JSON schema. Defining the schema once and consuming it from either language is the single biggest reason the TS-to-Python move stops being a rewrite.
What you give up vs a hand-rolled browser script
An API is not a full browser, and pretending otherwise would set you up for a failed migration. State these limits plainly before you cut over:
- Stateless per request. fastCRW holds no session between calls. If your scraper logs in, clicks through a wizard, or carries state across pages, that interaction logic cannot move to a stateless scrape endpoint — it stays in a real browser.
- No screenshot output. A request for
formats: ["screenshot"]returns HTTP 422. If your TypeScript script captured screenshots, that responsibility stays in Playwright/Puppeteer. - No built-in anti-bot. There is no Fire-engine-style anti-bot layer. Heavily protected sites your stealth Playwright setup defeated may still need a real browser plus a proxy.
When the port should keep a real browser
Keep the browser automation — port it manually to Playwright Python or leave it in Node behind a thin interface — when the scraper does any of the following: authenticates and maintains a session, fills and submits forms, drives multi-step interactive flows, or fights aggressive bot protection. For everything that is fundamentally "load this page and give me the content," the API path removes the browser-glue maintenance from both languages at once. A pragmatic migration often does both: read-only scrapes move to /v1/scrape, the handful of genuinely interactive flows keep a browser.
The migration in three honest steps
- Triage your scrapers. Split them into read-only (navigate + extract) and interactive (click/login/form). Read-only is the candidate for the API path; interactive stays a browser.
- Move read-only to the API. Swap the base URL to point the Firecrawl-compatible SDK at fastCRW, or use the
crwPython SDK's local engine. Take Markdown for content, or define ajsonSchemafor fields and pay the 5-credit JSON cost. - Manually port what's left. For genuinely interactive scrapers, port TypeScript Playwright to Playwright Python using the near-1:1 API, accepting that environment and flakiness work carries over.
You can self-host the engine for $0 under AGPL-3.0 during the transition, or use the managed cloud and pay per credit — see /pricing for the current tiers.
Sources
- fastCRW canonical fact sheet — endpoints, renderers, credits, honest gaps.
- fastCRW open-core repo and SDKs: github.com/us/crw · Python SDK
crw(PyPI) · managed cloud fastcrw.com. - Playwright Python vs Node API reference: playwright.dev/python (verified independently).
Related: Python scraping quickstart · Node.js scraping quickstart · Migrate Puppeteer/Playwright to an API · Structured extraction with JSON schema
