What You'll Have in 15 Minutes
By the end of this quickstart you'll scrape a page to markdown, crawl a whole site, map its URLs, run a web search, extract structured JSON, and batch many URLs concurrently — all from Python against CRW. CRW is Firecrawl-API compatible, so you use the official SDK and just point it at your instance.
Step 1: Run CRW
docker run -p 3000:3000 ghcr.io/us/crw:latest
Verify it is up:
curl -s http://localhost:3000/health
Step 2: Install the SDK
pip install firecrawl-py
Step 3: Connect
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="http://localhost:3000")
# fastCRW managed cloud instead:
# app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="https://api.fastcrw.com")
Step 4: Scrape a Page to Markdown
doc = app.scrape_url(
"https://example.com",
params={"formats": ["markdown"], "onlyMainContent": True},
)
print(doc["markdown"])
print("---")
print(doc["metadata"]["title"])
onlyMainContent strips nav, footers, and cookie banners — exactly what you want for LLM input or storage.
Step 5: Crawl a Whole Site
job = app.crawl_url(
"https://example.com",
params={
"limit": 25,
"maxDepth": 2,
"scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
},
)
for page in job["data"]:
url = page["metadata"]["sourceURL"]
words = len(page.get("markdown", "").split())
print(f"{words:6d} words {url}")
CRW handles link discovery, deduplication, and depth limits server-side — one call, no orchestration code.
Step 6: Map a Site's URLs (Fast, No Content)
When you only need the URL inventory — for a sitemap audit or to pick what to scrape — map is far cheaper than crawl:
urls = app.map_url("https://example.com")
print(f"Discovered {len(urls)} URLs")
for u in list(urls)[:10]:
print(" ", u)
Step 7: Search the Web
results = app.search(
"best open source web scraper 2026",
params={"limit": 5},
)
for r in results["data"]:
print(r["title"], "->", r["url"])
Add scrapeOptions to get full page content with each result in the same call — ideal for RAG and answer engines.
Step 8: Extract Structured JSON
schema = {
"type": "object",
"properties": {
"headline": {"type": "string"},
"author": {"type": "string"},
"published_date": {"type": "string"},
"tags": {"type": "array", "items": {"type": "string"}},
},
"required": ["headline"],
}
data = app.extract(
urls=["https://example.com/blog/some-post"],
params={"prompt": "Extract the article metadata.", "schema": schema},
)
print(data["data"])
No CSS selectors — the schema describes the data and CRW's LLM extraction reads the page semantically.
Step 9: Batch Many URLs Concurrently
For real workloads you scrape lists of URLs. A bounded thread pool keeps throughput high without overwhelming CRW or the targets:
from concurrent.futures import ThreadPoolExecutor, as_completed
def scrape_one(url: str) -> tuple[str, str]:
try:
d = app.scrape_url(url, params={"formats": ["markdown"],
"onlyMainContent": True})
return url, d.get("markdown", "")
except Exception as e:
return url, f"ERROR: {e}"
def batch(urls: list[str], workers: int = 8) -> dict[str, str]:
out: dict[str, str] = {}
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = {pool.submit(scrape_one, u): u for u in urls}
for fut in as_completed(futures):
url, md = fut.result()
out[url] = md
print(f"done: {url} ({len(md)} chars)")
return out
if __name__ == "__main__":
targets = [
"https://example.com",
"https://example.org",
"https://example.net",
]
results = batch(targets)
print(f"Scraped {len(results)} pages")
Step 10: Handle Errors Like Production Code
import time
def robust_scrape(url: str, attempts: int = 3) -> dict | None:
delay = 2.0
for i in range(1, attempts + 1):
try:
d = app.scrape_url(url, params={"formats": ["markdown"]})
if d and d.get("markdown"):
return d
except Exception as e:
print(f"attempt {i} failed: {e}")
time.sleep(delay)
delay *= 2
return None
Understanding the Five Endpoints
CRW is not one scraper — it is five operations that compose. Knowing which to reach for is most of the skill. Scrape fetches a single known URL and is the workhorse for enrichment and one-off extraction. Crawl takes a seed URL, discovers links itself, and returns many pages — use it when you want a whole site or section and do not have the URL list. Map returns just the URL inventory of a site with no content, which is the cheapest way to scope work or build a frontier you will scrape selectively. Search takes a query instead of a URL and returns ranked web results, optionally with full page content in the same call — this is what answer engines and research agents use. Extract layers an LLM with a JSON schema over a scrape, returning typed structured data instead of text.
A useful mental model: scrape and search are synchronous request/response; crawl is asynchronous (you start a job and poll, though the SDK hides this behind a blocking call by default); map is a fast read; extract is scrape plus a typed transform. Most production systems use two or three together — for example, map to discover, crawl to ingest, and extract to structure the high-value pages.
Choosing Formats: markdown vs html vs links
CRW's formats parameter decides what you get back, and the right choice depends on what consumes the data. markdown is the default for anything that feeds an LLM — it preserves headings, lists, and tables as clean text with minimal token overhead, which is why every RAG and agent example uses it. Request html when you need to run your own DOM parsing or preserve exact structure (for example, extracting a specific table by its position). Request links when you want the page's outbound URLs without its prose, which is the cheapest way to seed a custom crawl frontier. You can ask for several at once — {"formats": ["markdown", "links"]} — and CRW returns each under its own key in the response, so a single fetch can serve both your text pipeline and your link graph.
A common mistake is requesting html "just in case" and then only ever reading markdown. That inflates response size and your storage bill for no benefit. Decide what the downstream step actually parses and request exactly that. If you are unsure, start with markdown plus onlyMainContent: True; it is the right answer for the large majority of AI workloads.
Reading Metadata Reliably
Every scrape and crawl result carries a metadata object with title, sourceURL, statusCode, and often description and language. Treat these as the canonical identity of a document — store sourceURL (not the URL you requested, which may have redirected) as your primary key, and use statusCode to drop soft-404 pages that return HTTP 200 with an error body:
def is_useful(doc: dict) -> bool:
meta = doc.get("metadata", {})
if meta.get("statusCode", 200) >= 400:
return False
md = doc.get("markdown", "")
# soft-404 heuristic: real pages have substance
if len(md.split()) < 50:
return False
title = (meta.get("title") or "").lower()
if any(bad in title for bad in ("not found", "404", "error")):
return False
return True
good = [d for d in (robust_scrape(u) for u in some_urls) if d and is_useful(d)]
print(f"{len(good)} usable documents")
This single filter prevents a surprising amount of garbage from reaching your vector store or database. Sites frequently serve a styled "page not found" with a 200 status; without the check, that boilerplate gets embedded and pollutes retrieval.
Why CRW for Python Scraping
- Drop-in SDK — the official Firecrawl Python client works with one URL change.
- Fast — open-core Rust, small single binary, lower-latency than browser-based scrapers, fast cold start.
- No lock-in — AGPL-3.0 self-host free, or managed cloud with the same API.
A Small CLI to Tie It Together
Wrap the pieces in a one-file CLI so the quickstart is something you actually keep and run, not just paste once. Standard-library argparse is enough:
import argparse, json, sys
def cli():
p = argparse.ArgumentParser(prog="crw")
sub = p.add_subparsers(dest="cmd", required=True)
s = sub.add_parser("scrape"); s.add_argument("url")
c = sub.add_parser("crawl"); c.add_argument("url")
c.add_argument("--limit", type=int, default=25)
m = sub.add_parser("search"); m.add_argument("query")
args = p.parse_args()
if args.cmd == "scrape":
d = robust_scrape(args.url)
print(d["markdown"] if d else "FAILED", file=sys.stdout)
elif args.cmd == "crawl":
job = app.crawl_url(args.url, params={
"limit": args.limit,
"scrapeOptions": {"formats": ["markdown"],
"onlyMainContent": True}})
for pg in job["data"]:
print(pg["metadata"]["sourceURL"])
elif args.cmd == "search":
res = app.search(args.query, params={"limit": 5})
print(json.dumps([r["url"] for r in res["data"]], indent=2))
if __name__ == "__main__":
cli()
# python crw.py scrape https://example.com
# python crw.py crawl https://example.com --limit 10
# python crw.py search "rust scraping api"
Now the quickstart is a usable tool: a single file that scrapes, crawls, or searches from the shell against your CRW instance, with the production error handling already wired in. It is the smallest thing that demonstrates all the moving parts working together and is genuinely worth keeping in your toolbox.
Next Steps
Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.