Skip to main content
Tutorial

CRW Python Quickstart (2026): Scrape, Crawl, Map, Search in 15 Minutes

A from-zero Python quickstart for CRW: install, scrape a page to markdown, crawl a site, map URLs, search the web, and extract JSON. Async batch included. Self-host free under AGPL-3.0.

fastCRW logofastcrw
May 27, 202613 min read

What You'll Have in 15 Minutes

By the end of this quickstart you'll scrape a page to markdown, crawl a whole site, map its URLs, run a web search, extract structured JSON, and batch many URLs concurrently — all from Python against CRW. CRW is Firecrawl-API compatible, so you use the official SDK and just point it at your instance.

Step 1: Run CRW

docker run -p 3000:3000 ghcr.io/us/crw:latest

Verify it is up:

curl -s http://localhost:3000/health

Step 2: Install the SDK

pip install firecrawl-py

Step 3: Connect

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="http://localhost:3000")
# fastCRW managed cloud instead:
# app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="https://api.fastcrw.com")

Step 4: Scrape a Page to Markdown

doc = app.scrape_url(
    "https://example.com",
    params={"formats": ["markdown"], "onlyMainContent": True},
)
print(doc["markdown"])
print("---")
print(doc["metadata"]["title"])

onlyMainContent strips nav, footers, and cookie banners — exactly what you want for LLM input or storage.

Step 5: Crawl a Whole Site

job = app.crawl_url(
    "https://example.com",
    params={
        "limit": 25,
        "maxDepth": 2,
        "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
    },
)
for page in job["data"]:
    url = page["metadata"]["sourceURL"]
    words = len(page.get("markdown", "").split())
    print(f"{words:6d} words  {url}")

CRW handles link discovery, deduplication, and depth limits server-side — one call, no orchestration code.

Step 6: Map a Site's URLs (Fast, No Content)

When you only need the URL inventory — for a sitemap audit or to pick what to scrape — map is far cheaper than crawl:

urls = app.map_url("https://example.com")
print(f"Discovered {len(urls)} URLs")
for u in list(urls)[:10]:
    print(" ", u)

Step 7: Search the Web

results = app.search(
    "best open source web scraper 2026",
    params={"limit": 5},
)
for r in results["data"]:
    print(r["title"], "->", r["url"])

Add scrapeOptions to get full page content with each result in the same call — ideal for RAG and answer engines.

Step 8: Extract Structured JSON

schema = {
    "type": "object",
    "properties": {
        "headline": {"type": "string"},
        "author": {"type": "string"},
        "published_date": {"type": "string"},
        "tags": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["headline"],
}

data = app.extract(
    urls=["https://example.com/blog/some-post"],
    params={"prompt": "Extract the article metadata.", "schema": schema},
)
print(data["data"])

No CSS selectors — the schema describes the data and CRW's LLM extraction reads the page semantically.

Step 9: Batch Many URLs Concurrently

For real workloads you scrape lists of URLs. A bounded thread pool keeps throughput high without overwhelming CRW or the targets:

from concurrent.futures import ThreadPoolExecutor, as_completed


def scrape_one(url: str) -> tuple[str, str]:
    try:
        d = app.scrape_url(url, params={"formats": ["markdown"],
                                        "onlyMainContent": True})
        return url, d.get("markdown", "")
    except Exception as e:
        return url, f"ERROR: {e}"


def batch(urls: list[str], workers: int = 8) -> dict[str, str]:
    out: dict[str, str] = {}
    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = {pool.submit(scrape_one, u): u for u in urls}
        for fut in as_completed(futures):
            url, md = fut.result()
            out[url] = md
            print(f"done: {url} ({len(md)} chars)")
    return out


if __name__ == "__main__":
    targets = [
        "https://example.com",
        "https://example.org",
        "https://example.net",
    ]
    results = batch(targets)
    print(f"Scraped {len(results)} pages")

Step 10: Handle Errors Like Production Code

import time


def robust_scrape(url: str, attempts: int = 3) -> dict | None:
    delay = 2.0
    for i in range(1, attempts + 1):
        try:
            d = app.scrape_url(url, params={"formats": ["markdown"]})
            if d and d.get("markdown"):
                return d
        except Exception as e:
            print(f"attempt {i} failed: {e}")
        time.sleep(delay)
        delay *= 2
    return None

Understanding the Five Endpoints

CRW is not one scraper — it is five operations that compose. Knowing which to reach for is most of the skill. Scrape fetches a single known URL and is the workhorse for enrichment and one-off extraction. Crawl takes a seed URL, discovers links itself, and returns many pages — use it when you want a whole site or section and do not have the URL list. Map returns just the URL inventory of a site with no content, which is the cheapest way to scope work or build a frontier you will scrape selectively. Search takes a query instead of a URL and returns ranked web results, optionally with full page content in the same call — this is what answer engines and research agents use. Extract layers an LLM with a JSON schema over a scrape, returning typed structured data instead of text.

A useful mental model: scrape and search are synchronous request/response; crawl is asynchronous (you start a job and poll, though the SDK hides this behind a blocking call by default); map is a fast read; extract is scrape plus a typed transform. Most production systems use two or three together — for example, map to discover, crawl to ingest, and extract to structure the high-value pages.

Choosing Formats: markdown vs html vs links

CRW's formats parameter decides what you get back, and the right choice depends on what consumes the data. markdown is the default for anything that feeds an LLM — it preserves headings, lists, and tables as clean text with minimal token overhead, which is why every RAG and agent example uses it. Request html when you need to run your own DOM parsing or preserve exact structure (for example, extracting a specific table by its position). Request links when you want the page's outbound URLs without its prose, which is the cheapest way to seed a custom crawl frontier. You can ask for several at once — {"formats": ["markdown", "links"]} — and CRW returns each under its own key in the response, so a single fetch can serve both your text pipeline and your link graph.

A common mistake is requesting html "just in case" and then only ever reading markdown. That inflates response size and your storage bill for no benefit. Decide what the downstream step actually parses and request exactly that. If you are unsure, start with markdown plus onlyMainContent: True; it is the right answer for the large majority of AI workloads.

Reading Metadata Reliably

Every scrape and crawl result carries a metadata object with title, sourceURL, statusCode, and often description and language. Treat these as the canonical identity of a document — store sourceURL (not the URL you requested, which may have redirected) as your primary key, and use statusCode to drop soft-404 pages that return HTTP 200 with an error body:

def is_useful(doc: dict) -> bool:
    meta = doc.get("metadata", {})
    if meta.get("statusCode", 200) >= 400:
        return False
    md = doc.get("markdown", "")
    # soft-404 heuristic: real pages have substance
    if len(md.split()) < 50:
        return False
    title = (meta.get("title") or "").lower()
    if any(bad in title for bad in ("not found", "404", "error")):
        return False
    return True


good = [d for d in (robust_scrape(u) for u in some_urls) if d and is_useful(d)]
print(f"{len(good)} usable documents")

This single filter prevents a surprising amount of garbage from reaching your vector store or database. Sites frequently serve a styled "page not found" with a 200 status; without the check, that boilerplate gets embedded and pollutes retrieval.

Why CRW for Python Scraping

  • Drop-in SDK — the official Firecrawl Python client works with one URL change.
  • Fast — open-core Rust, small single binary, lower-latency than browser-based scrapers, fast cold start.
  • No lock-in — AGPL-3.0 self-host free, or managed cloud with the same API.

A Small CLI to Tie It Together

Wrap the pieces in a one-file CLI so the quickstart is something you actually keep and run, not just paste once. Standard-library argparse is enough:

import argparse, json, sys


def cli():
    p = argparse.ArgumentParser(prog="crw")
    sub = p.add_subparsers(dest="cmd", required=True)

    s = sub.add_parser("scrape"); s.add_argument("url")
    c = sub.add_parser("crawl"); c.add_argument("url")
    c.add_argument("--limit", type=int, default=25)
    m = sub.add_parser("search"); m.add_argument("query")

    args = p.parse_args()

    if args.cmd == "scrape":
        d = robust_scrape(args.url)
        print(d["markdown"] if d else "FAILED", file=sys.stdout)
    elif args.cmd == "crawl":
        job = app.crawl_url(args.url, params={
            "limit": args.limit,
            "scrapeOptions": {"formats": ["markdown"],
                              "onlyMainContent": True}})
        for pg in job["data"]:
            print(pg["metadata"]["sourceURL"])
    elif args.cmd == "search":
        res = app.search(args.query, params={"limit": 5})
        print(json.dumps([r["url"] for r in res["data"]], indent=2))


if __name__ == "__main__":
    cli()
# python crw.py scrape https://example.com
# python crw.py crawl https://example.com --limit 10
# python crw.py search "rust scraping api"

Now the quickstart is a usable tool: a single file that scrapes, crawls, or searches from the shell against your CRW instance, with the production error handling already wired in. It is the smallest thing that demonstrates all the moving parts working together and is genuinely worth keeping in your toolbox.

Next Steps

Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.

FAQ

Frequently asked questions

Which SDK do I use with CRW from Python?
The official firecrawl-py SDK. Install it with pip and pass api_url pointing at your CRW instance (http://localhost:3000 self-host or https://api.fastcrw.com cloud). All scrape, crawl, map, search, and extract methods work unchanged because CRW is Firecrawl-API compatible.
When should I use map instead of crawl?
Use map when you only need the list of URLs on a site — for a sitemap audit or to decide what to scrape. It is much cheaper and faster than crawl because it does not fetch and clean each page's content.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive