Skip to main content
Integrations/Integration / Python

Python Web Scraping API — fastCRW [Firecrawl-Compatible]

Scrape, crawl, and search the web from Python with fastCRW — a Firecrawl-compatible REST API backed by a single Rust binary. Async httpx, asyncio.TaskGroup, and the crw Python SDK. AGPL-3.0, self-host free.

Published
June 13, 2026
Updated
June 13, 2026
Category
integrations
Verdict

Call fastCRW from Python with plain httpx or the crw SDK — the same Firecrawl-compatible REST surface, a static Rust binary under the hood, and clean Markdown out the other side. asyncio.TaskGroup + a Semaphore keeps batch jobs bounded and fast.

Firecrawl-compatible REST API — one base-URL swap from existing Firecrawl codecrw Python SDK (PyPI) — CrwClient() runs a self-contained local engine, no server neededasyncio.TaskGroup + Semaphore for structured, bounded concurrency63.74% truth-recall on Firecrawl''s public 1,000-URL dataset (diagnose_3way.py, 2026-05-08) — highest of three tools testedSingle ~8 MB Rust binary — no Redis, no containers beyond one Docker image

Verdict

fastCRW is a Firecrawl-compatible web scraping API — POST /v1/scrape, get back clean Markdown. For Python teams that means two paths: the crw PyPI SDK, which runs a self-contained local engine with no separate server, or an httpx client pointed at https://api.fastcrw.com (or your self-hosted instance). Either path gives you the same Firecrawl-shaped REST surface, a static Rust binary under the hood, and results you can feed directly into text splitters, embeddings, or a React/Vue front-end without an HTML-to-text pass.

Who This Is For

  • Python developers building scrapers — you want a clean Markdown API instead of parsing raw HTML.
  • RAG / AI pipeline engineers — you need live web content turned into embeddable text with high fidelity.
  • Teams migrating off Firecrawl — your existing scrape() / crawl() calls work unchanged with an api_url override.
  • Self-hosting shops — you want the whole ingestion path on your own infrastructure under AGPL-3.0 at $0 per 1,000 scrapes.

Setup

1. Install

pip install crw          # PyPI — includes the local engine
# or, with uv:
uv add crw

For the REST-only path (managed cloud or self-hosted Docker):

pip install httpx        # or: uv add httpx

2. Get an API key

Sign up at fastcrw.com, copy the API key from the dashboard, and export it:

export FASTCRW_API_KEY="fcrw_..."

The our pricing ships 500 one-time lifetime credits — enough to validate a pipeline. Plain scrape is 1 credit; crawl is 1 credit per page; search is 1 credit per query.

Quickstart: Scrape a Page

Using the crw SDK (local engine)

from crw import CrwClient

client = CrwClient()  # starts the Rust engine in-process

result = client.scrape("https://example.com", formats=["markdown"])
print(result["data"]["markdown"])

Using httpx against the managed cloud

import os
import httpx

API_KEY = os.environ["FASTCRW_API_KEY"]
BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

def scrape(url: str) -> str:
    r = httpx.post(
        f"{BASE}/v1/scrape",
        headers=HEADERS,
        json={"url": url, "formats": ["markdown"], "onlyMainContent": True},
        timeout=30,
    )
    r.raise_for_status()
    return r.json()["data"]["markdown"]

markdown = scrape("https://docs.fastcrw.com")
print(markdown[:500])

Crawl a Whole Site

/v1/crawl starts an async breadth-first crawl and returns a job ID. Poll until complete:

import os
import time
import httpx

API_KEY = os.environ["FASTCRW_API_KEY"]
BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}


def crawl(seed_url: str, limit: int = 50, max_depth: int = 3) -> list[dict]:
    """Crawl a site and return a list of page dicts with markdown and metadata."""
    # Start the async crawl job
    r = httpx.post(
        f"{BASE}/v1/crawl",
        headers=HEADERS,
        json={
            "url": seed_url,
            "limit": limit,          # cap: 1000
            "maxDepth": max_depth,   # cap: 10
            "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
        },
        timeout=30,
    )
    r.raise_for_status()
    job_id = r.json()["id"]

    # Poll until complete
    while True:
        poll = httpx.get(f"{BASE}/v1/crawl/{job_id}", headers=HEADERS, timeout=30)
        poll.raise_for_status()
        data = poll.json()
        if data["status"] == "completed":
            return data["data"]
        time.sleep(2)


pages = crawl("https://docs.fastcrw.com", limit=25)
for page in pages:
    url = page.get("metadata", {}).get("sourceURL")
    words = len((page.get("markdown") or "").split())
    print(f"{words:>6} words  {url}")
import os
import httpx

API_KEY = os.environ["FASTCRW_API_KEY"]
BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}


def search(query: str, limit: int = 5) -> list[dict]:
    r = httpx.post(
        f"{BASE}/v1/search",
        headers=HEADERS,
        json={"query": query, "limit": limit},
        timeout=30,
    )
    r.raise_for_status()
    return r.json()["data"]


for result in search("python web scraping api 2026"):
    print(result["title"], "→", result["url"])

Async Batch Scraping with TaskGroup and Semaphore

asyncio.TaskGroup (stable since 3.11, the recommended pattern in Python 3.13) combined with an asyncio.Semaphore gives you structured, bounded concurrency. Without the semaphore, unbounded fan-out exhausts file descriptors and trips rate limits:

import asyncio
import os
import httpx

API_KEY = os.environ["FASTCRW_API_KEY"]
BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# p90 latency on the 3-way benchmark was 14157 ms — set a generous timeout
# so chrome-stealth recoveries are not killed prematurely.
REQUEST_TIMEOUT = 25.0
MAX_CONCURRENCY = 8  # tune from Little's Law: concurrency ≈ rps × p90s


async def scrape_one(
    client: httpx.AsyncClient,
    sem: asyncio.Semaphore,
    url: str,
) -> dict:
    async with sem:
        r = await client.post(
            f"{BASE}/v1/scrape",
            headers=HEADERS,
            json={"url": url, "formats": ["markdown"], "onlyMainContent": True},
            timeout=REQUEST_TIMEOUT,
        )
        r.raise_for_status()
        data = r.json()["data"]
        return {
            "url": url,
            "chars": len(data.get("markdown") or ""),
            "markdown": data.get("markdown", ""),
        }


async def batch_scrape(urls: list[str]) -> list[dict]:
    sem = asyncio.Semaphore(MAX_CONCURRENCY)
    results: list[dict] = []

    async with httpx.AsyncClient() as client:
        async with asyncio.TaskGroup() as tg:
            tasks = [tg.create_task(scrape_one(client, sem, url)) for url in urls]

    # TaskGroup awaits all tasks; exceptions are raised as ExceptionGroup
    return [t.result() for t in tasks]


urls = [
    "https://docs.fastcrw.com",
    "https://fastcrw.com/pricing",
    "https://fastcrw.com/integrations/langchain",
]

results = asyncio.run(batch_scrape(urls))
for r in results:
    print(f"{r['chars']:>8} chars  {r['url']}")

Latency note: On Firecrawl's public 1,000-URL scrape-content-dataset-v1 (diagnose_3way.py, 2026-05-08), fastCRW's p50 was 1914 ms and p90 was 14157 ms — the highest truth-recall of three (63.74% of 819 labeled URLs), but also the widest tail. Set REQUEST_TIMEOUT above the p90; the slow tail is the chrome-stealth fallback recovering pages the others miss. The full p50/p90/p99 breakdown is on /benchmarks/firecrawl-dataset.

Structured JSON Extraction

Pass formats: ["json"] with a JSON Schema to extract typed records instead of prose:

import os
import httpx

API_KEY = os.environ["FASTCRW_API_KEY"]
BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

schema = {
    "type": "object",
    "properties": {
        "productName": {"type": "string"},
        "priceUsd": {"type": "number", "description": "Current price in USD"},
        "inStock": {"type": "boolean"},
    },
    "required": ["productName", "priceUsd"],
}

r = httpx.post(
    f"{BASE}/v1/scrape",
    headers=HEADERS,
    json={
        "url": "https://example.com/products/widget",
        "formats": ["json"],
        "jsonSchema": schema,
    },
    timeout=60,
)
r.raise_for_status()
product = r.json()["data"]["json"]
print(product)

Cost: formats: ["json"] is a 5-credit operation vs 1 credit for markdown. LLM extraction supports OpenAI and Anthropic providers only. There is no batch /v1/extract endpoint — iterate /v1/scrape concurrently or use /v1/crawl.

MCP Setup

fastCRW ships an Model Context Protocol server (crw-mcp on npm) for AI agents that need live web data. It exposes scrape, crawl, map, and search as MCP tools — no separate HTTP client code needed:

{
  "mcpServers": {
    "fastcrw": {
      "command": "npx",
      "args": ["-y", "crw-mcp@latest"],
      "env": {
        "FASTCRW_API_KEY": "fcrw_...",
        "FASTCRW_API_URL": "https://api.fastcrw.com"
      }
    }
  }
}

See /integrations/mcp for full configuration options.

Limits and Honest Gaps

  • No screenshot outputformats: ["screenshot"] returns HTTP 422.
  • Stateless per request — no session is carried across calls; multi-step authenticated flows must be reconstructed in your Python code.
  • LLM extraction — supports OpenAI and Anthropic only.
  • No /v1/batch/scrape — iterate /v1/scrape concurrently or use /v1/crawl.

Continue exploring

More from Integrations

View all integrations

Related hubs

Keep the crawl path moving