Skip to main content
Tutorial

SERP Scraping in 2026: Search the Web With CRW's Search API (Python)

Stop scraping Google's HTML. Use CRW's /v1/search to get ranked results plus full page content in one call. Build a SERP monitor and keyword rank tracker — runnable Python, self-host free under AGPL-3.0.

fastcrw
By RecepJune 27, 202614 min read

Why Not Scrape Google Directly

Scraping Google's results HTML is a losing game: rotating layouts, aggressive bot defenses, CAPTCHA walls, and legal grey area. CRW exposes a /v1/search endpoint that returns ranked results and can fetch the full content of each result in a single call — exactly what an AI agent or a rank tracker needs, without you touching a SERP page.

What We're Building

  • A reusable search helper over CRW's search API
  • A keyword rank tracker that records your domain's position over time
  • A "search + read" function that returns results with full markdown for RAG / answer engines

Prerequisites

  • CRW running: docker run -p 3000:3000 ghcr.io/us/crw:latest
  • Python 3.10+
pip install requests

Step 1: The Search Endpoint

CRW's search API is Firecrawl-compatible. You can call it with the SDK or plain HTTP. We'll use requests so the contract is explicit:

import requests

CRW_BASE = "http://localhost:3000"          # self-host
# CRW_BASE = "https://api.fastcrw.com"       # fastCRW cloud
API_KEY = "fc-YOUR-KEY"


def search(query: str, limit: int = 10, scrape: bool = False) -> list[dict]:
    """Return ranked results. If scrape=True, include full page markdown."""
    body = {"query": query, "limit": limit}
    if scrape:
        body["scrapeOptions"] = {"formats": ["markdown"], "onlyMainContent": True}

    resp = requests.post(
        f"{CRW_BASE}/v1/search",
        headers={"Authorization": f"Bearer {API_KEY}",
                 "Content-Type": "application/json"},
        json=body,
        timeout=60,
    )
    resp.raise_for_status()
    return resp.json().get("data", [])

Step 2: Inspect Results

for r in search("best rust web scraper 2026", limit=5):
    print(r["title"])
    print("  ", r["url"])
    print("  ", r.get("description", "")[:120])

Each result has at least title, url, and description. With scrape=True, results also carry a markdown field with the cleaned page body.

Step 3: Keyword Rank Tracker

Track where a target domain ranks for a set of keywords over time. This is the core of a lightweight SEO rank monitor:

import sqlite3
from datetime import datetime
from urllib.parse import urlparse

DB = "ranks.db"


def init_db():
    with sqlite3.connect(DB) as c:
        c.execute("""CREATE TABLE IF NOT EXISTS ranks (
            keyword TEXT, domain TEXT, position INTEGER,
            url TEXT, checked_at TEXT)""")


def find_rank(keyword: str, domain: str, depth: int = 50) -> dict:
    results = search(keyword, limit=depth)
    for i, r in enumerate(results, start=1):
        host = urlparse(r["url"]).netloc.replace("www.", "")
        if host.endswith(domain):
            return {"position": i, "url": r["url"]}
    return {"position": None, "url": None}


def track(keywords: list[str], domain: str):
    init_db()
    rows = []
    for kw in keywords:
        res = find_rank(kw, domain)
        pos = res["position"]
        print(f"{kw!r}: {'#' + str(pos) if pos else 'not in top 50'}")
        rows.append((kw, domain, pos, res["url"],
                     datetime.now().isoformat()))
    with sqlite3.connect(DB) as c:
        c.executemany("INSERT INTO ranks VALUES (?,?,?,?,?)", rows)


def rank_history(keyword: str, domain: str) -> list[tuple]:
    with sqlite3.connect(DB) as c:
        return c.execute(
            """SELECT checked_at, position FROM ranks
               WHERE keyword=? AND domain=? ORDER BY checked_at""",
            (keyword, domain),
        ).fetchall()

Step 4: Run the Tracker

if __name__ == "__main__":
    track(
        keywords=[
            "open source web scraper",
            "firecrawl alternative",
            "rust scraping api",
        ],
        domain="fastcrw.com",
    )
    for ts, pos in rank_history("firecrawl alternative", "fastcrw.com"):
        print(ts, "->", pos)

Step 5: Search + Read for Answer Engines

For a RAG or answer-engine use case, you want results and their content in one round trip. scrape=True does exactly that — no second fetch loop:

def research(question: str, k: int = 5) -> str:
    """Build an LLM context block from the top k results."""
    hits = search(question, limit=k, scrape=True)
    blocks = []
    for h in hits:
        md = h.get("markdown", "")[:4000]
        blocks.append(f"### {h['title']}\nSOURCE: {h['url']}\n\n{md}")
    context = "\n\n---\n\n".join(blocks)
    # Feed `context` + `question` to your LLM with a cite-your-sources prompt
    return context


if __name__ == "__main__":
    ctx = research("what makes a web scraper memory-efficient")
    print(ctx[:1500])

Scheduling the Rank Tracker

# crontab -e — run daily at 06:00
0 6 * * *  cd /opt/rank && /usr/bin/python3 tracker.py >> rank.log 2>&1

The Cost of Rolling Your Own SERP Scraper

It is worth being concrete about what "just scrape Google" actually costs, because the alternative looks free until you try it. Search results pages are among the most aggressively defended surfaces on the web: rotating DOM structures (so your selectors break weekly), IP-based and behavioral bot detection, interstitial CAPTCHAs, and consent walls that differ by region. To make a homegrown scraper survive you end up maintaining a residential proxy pool, a CAPTCHA-solving integration, headless-browser fingerprint randomization, and parsers for several result layouts — and you re-do that work every time the layout shifts. That is a dedicated project, not a helper function, and the legal posture is shaky on top of it. A search API exists so you can spend your engineering time on the product instead of an arms race you do not win.

Designing a Rank Tracker That Doesn't Lie to You

Rank data is noisy, and a naive tracker produces graphs that look dramatic but mean nothing. Three disciplines keep it honest. First, fix the query set and the depth: comparing today's "top 50" against last week's "top 10" invents movement that is not real, so always request the same depth. Second, record the matched URL, not just the position — a domain "ranking #3" with a different page than last week is a content cannibalization signal you would otherwise miss, which is why find_rank stores url alongside position. Third, sample on a stable cadence and treat single-day swings as noise; trend over a rolling window is the signal. The schema in this tutorial already captures what you need; the analysis layer just has to resist over-reading a single data point:

def trend(keyword: str, domain: str, window: int = 7) -> str:
    hist = rank_history(keyword, domain)[-window:]
    points = [p for _, p in hist if p is not None]
    if len(points) < 2:
        return "insufficient data"
    delta = points[0] - points[-1]   # positive = improved (lower number)
    if abs(delta) < 2:
        return f"stable around #{points[-1]}"
    return f"{'up' if delta > 0 else 'down'} {abs(delta)} positions over {len(points)} samples"

This turns a jittery position series into a statement a human can act on, and it is honest about uncertainty when there is not enough data yet.

Search + Read: One Call Instead of an N+1

The pattern that makes the search API genuinely powerful for AI is the combined search-and-scrape. The naive answer-engine implementation searches, gets ten URLs, then loops issuing ten more scrape requests — an N+1 that doubles latency and request count. Passing scrapeOptions in the search body collapses this into one round trip: CRW fetches and cleans each result server-side and returns the markdown inline. For a latency-sensitive agent this is the difference between a snappy answer and a multi-second stall, and it is why the research() helper builds its context block directly from one response with no second fetch loop. When you only need titles and snippets (a rank tracker), omit scrapeOptions and the call stays minimal — request exactly the work you need.

Why CRW's Search API

  • No SERP HTML parsing — you get structured results without touching Google's anti-bot defenses.
  • Search + scrape in one call — ideal for answer engines; CRW averages ~880 ms search latency.
  • Open-core, no lock-in — small single Rust binary, lower-latency, local-first, AGPL-3.0 + Managed Cloud.
  • Predictable cost — self-host unlimited; fastCRW cloud fastCRW pricing is a one-time lifetime 500 credits, never a monthly meter you can blow through silently.

Caching Search Results to Cut Cost and Latency

Many search workloads repeat queries — a rank tracker checks the same keywords daily, an agent re-asks similar questions. A short-lived cache keyed on the query removes redundant calls and makes repeated lookups instant:

import sqlite3, json, time, hashlib

CACHE_DB = "search_cache.db"
TTL = 3600  # seconds


def _key(query: str, limit: int, scrape: bool) -> str:
    return hashlib.sha256(f"{query}|{limit}|{scrape}".encode()).hexdigest()


def cached_search(query: str, limit: int = 10, scrape: bool = False):
    k = _key(query, limit, scrape)
    with sqlite3.connect(CACHE_DB) as c:
        c.execute("""CREATE TABLE IF NOT EXISTS cache
                     (k TEXT PRIMARY KEY, ts REAL, payload TEXT)""")
        row = c.execute("SELECT ts, payload FROM cache WHERE k=?",
                        (k,)).fetchone()
        if row and time.time() - row[0] < TTL:
            return json.loads(row[1])

    data = search(query, limit, scrape)   # the real call from Step 1
    with sqlite3.connect(CACHE_DB) as c:
        c.execute("INSERT OR REPLACE INTO cache VALUES (?,?,?)",
                  (k, time.time(), json.dumps(data)))
    return data

For a rank tracker this is mostly a latency and politeness win; for an answer engine fielding overlapping user questions it is a direct cost reduction. Tune TTL to how fast the underlying results actually move — an hour is reasonable for general queries, shorter for news-sensitive ones. The point is to never pay twice for an answer that has not changed.

Next Steps

Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.

FAQ

Frequently asked questions

Is CRW's search API a legal way to get Google-style results?
CRW's /v1/search returns ranked web results through its own pipeline, so you never scrape Google's results HTML or bypass its bot defenses yourself. It is the same approach Firecrawl-compatible clients use, and you can self-host the engine for full control.
How do I get result content without a second request?
Pass scrapeOptions with formats: ['markdown'] in the search body (scrape=True in this tutorial). CRW fetches and cleans each result page server-side and returns the markdown alongside each hit in the same response.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive