Why Not Scrape Google Directly
Scraping Google's results HTML is a losing game: rotating layouts, aggressive bot defenses, CAPTCHA walls, and legal grey area. CRW exposes a /v1/search endpoint that returns ranked results and can fetch the full content of each result in a single call — exactly what an AI agent or a rank tracker needs, without you touching a SERP page.
What We're Building
- A reusable search helper over CRW's search API
- A keyword rank tracker that records your domain's position over time
- A "search + read" function that returns results with full markdown for RAG / answer engines
Prerequisites
- CRW running:
docker run -p 3000:3000 ghcr.io/us/crw:latest - Python 3.10+
pip install requests
Step 1: The Search Endpoint
CRW's search API is Firecrawl-compatible. You can call it with the SDK or plain HTTP. We'll use requests so the contract is explicit:
import requests
CRW_BASE = "http://localhost:3000" # self-host
# CRW_BASE = "https://api.fastcrw.com" # fastCRW cloud
API_KEY = "fc-YOUR-KEY"
def search(query: str, limit: int = 10, scrape: bool = False) -> list[dict]:
"""Return ranked results. If scrape=True, include full page markdown."""
body = {"query": query, "limit": limit}
if scrape:
body["scrapeOptions"] = {"formats": ["markdown"], "onlyMainContent": True}
resp = requests.post(
f"{CRW_BASE}/v1/search",
headers={"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"},
json=body,
timeout=60,
)
resp.raise_for_status()
return resp.json().get("data", [])
Step 2: Inspect Results
for r in search("best rust web scraper 2026", limit=5):
print(r["title"])
print(" ", r["url"])
print(" ", r.get("description", "")[:120])
Each result has at least title, url, and description. With scrape=True, results also carry a markdown field with the cleaned page body.
Step 3: Keyword Rank Tracker
Track where a target domain ranks for a set of keywords over time. This is the core of a lightweight SEO rank monitor:
import sqlite3
from datetime import datetime
from urllib.parse import urlparse
DB = "ranks.db"
def init_db():
with sqlite3.connect(DB) as c:
c.execute("""CREATE TABLE IF NOT EXISTS ranks (
keyword TEXT, domain TEXT, position INTEGER,
url TEXT, checked_at TEXT)""")
def find_rank(keyword: str, domain: str, depth: int = 50) -> dict:
results = search(keyword, limit=depth)
for i, r in enumerate(results, start=1):
host = urlparse(r["url"]).netloc.replace("www.", "")
if host.endswith(domain):
return {"position": i, "url": r["url"]}
return {"position": None, "url": None}
def track(keywords: list[str], domain: str):
init_db()
rows = []
for kw in keywords:
res = find_rank(kw, domain)
pos = res["position"]
print(f"{kw!r}: {'#' + str(pos) if pos else 'not in top 50'}")
rows.append((kw, domain, pos, res["url"],
datetime.now().isoformat()))
with sqlite3.connect(DB) as c:
c.executemany("INSERT INTO ranks VALUES (?,?,?,?,?)", rows)
def rank_history(keyword: str, domain: str) -> list[tuple]:
with sqlite3.connect(DB) as c:
return c.execute(
"""SELECT checked_at, position FROM ranks
WHERE keyword=? AND domain=? ORDER BY checked_at""",
(keyword, domain),
).fetchall()
Step 4: Run the Tracker
if __name__ == "__main__":
track(
keywords=[
"open source web scraper",
"firecrawl alternative",
"rust scraping api",
],
domain="fastcrw.com",
)
for ts, pos in rank_history("firecrawl alternative", "fastcrw.com"):
print(ts, "->", pos)
Step 5: Search + Read for Answer Engines
For a RAG or answer-engine use case, you want results and their content in one round trip. scrape=True does exactly that — no second fetch loop:
def research(question: str, k: int = 5) -> str:
"""Build an LLM context block from the top k results."""
hits = search(question, limit=k, scrape=True)
blocks = []
for h in hits:
md = h.get("markdown", "")[:4000]
blocks.append(f"### {h['title']}\nSOURCE: {h['url']}\n\n{md}")
context = "\n\n---\n\n".join(blocks)
# Feed `context` + `question` to your LLM with a cite-your-sources prompt
return context
if __name__ == "__main__":
ctx = research("what makes a web scraper memory-efficient")
print(ctx[:1500])
Scheduling the Rank Tracker
# crontab -e — run daily at 06:00
0 6 * * * cd /opt/rank && /usr/bin/python3 tracker.py >> rank.log 2>&1
The Cost of Rolling Your Own SERP Scraper
It is worth being concrete about what "just scrape Google" actually costs, because the alternative looks free until you try it. Search results pages are among the most aggressively defended surfaces on the web: rotating DOM structures (so your selectors break weekly), IP-based and behavioral bot detection, interstitial CAPTCHAs, and consent walls that differ by region. To make a homegrown scraper survive you end up maintaining a residential proxy pool, a CAPTCHA-solving integration, headless-browser fingerprint randomization, and parsers for several result layouts — and you re-do that work every time the layout shifts. That is a dedicated project, not a helper function, and the legal posture is shaky on top of it. A search API exists so you can spend your engineering time on the product instead of an arms race you do not win.
Designing a Rank Tracker That Doesn't Lie to You
Rank data is noisy, and a naive tracker produces graphs that look dramatic but mean nothing. Three disciplines keep it honest. First, fix the query set and the depth: comparing today's "top 50" against last week's "top 10" invents movement that is not real, so always request the same depth. Second, record the matched URL, not just the position — a domain "ranking #3" with a different page than last week is a content cannibalization signal you would otherwise miss, which is why find_rank stores url alongside position. Third, sample on a stable cadence and treat single-day swings as noise; trend over a rolling window is the signal. The schema in this tutorial already captures what you need; the analysis layer just has to resist over-reading a single data point:
def trend(keyword: str, domain: str, window: int = 7) -> str:
hist = rank_history(keyword, domain)[-window:]
points = [p for _, p in hist if p is not None]
if len(points) < 2:
return "insufficient data"
delta = points[0] - points[-1] # positive = improved (lower number)
if abs(delta) < 2:
return f"stable around #{points[-1]}"
return f"{'up' if delta > 0 else 'down'} {abs(delta)} positions over {len(points)} samples"
This turns a jittery position series into a statement a human can act on, and it is honest about uncertainty when there is not enough data yet.
Search + Read: One Call Instead of an N+1
The pattern that makes the search API genuinely powerful for AI is the combined search-and-scrape. The naive answer-engine implementation searches, gets ten URLs, then loops issuing ten more scrape requests — an N+1 that doubles latency and request count. Passing scrapeOptions in the search body collapses this into one round trip: CRW fetches and cleans each result server-side and returns the markdown inline. For a latency-sensitive agent this is the difference between a snappy answer and a multi-second stall, and it is why the research() helper builds its context block directly from one response with no second fetch loop. When you only need titles and snippets (a rank tracker), omit scrapeOptions and the call stays minimal — request exactly the work you need.
Why CRW's Search API
- No SERP HTML parsing — you get structured results without touching Google's anti-bot defenses.
- Search + scrape in one call — ideal for answer engines; CRW averages ~880 ms search latency.
- Open-core, no lock-in — small single Rust binary, lower-latency, local-first, AGPL-3.0 + Managed Cloud.
- Predictable cost — self-host unlimited; fastCRW cloud fastCRW pricing is a one-time lifetime 500 credits, never a monthly meter you can blow through silently.
Caching Search Results to Cut Cost and Latency
Many search workloads repeat queries — a rank tracker checks the same keywords daily, an agent re-asks similar questions. A short-lived cache keyed on the query removes redundant calls and makes repeated lookups instant:
import sqlite3, json, time, hashlib
CACHE_DB = "search_cache.db"
TTL = 3600 # seconds
def _key(query: str, limit: int, scrape: bool) -> str:
return hashlib.sha256(f"{query}|{limit}|{scrape}".encode()).hexdigest()
def cached_search(query: str, limit: int = 10, scrape: bool = False):
k = _key(query, limit, scrape)
with sqlite3.connect(CACHE_DB) as c:
c.execute("""CREATE TABLE IF NOT EXISTS cache
(k TEXT PRIMARY KEY, ts REAL, payload TEXT)""")
row = c.execute("SELECT ts, payload FROM cache WHERE k=?",
(k,)).fetchone()
if row and time.time() - row[0] < TTL:
return json.loads(row[1])
data = search(query, limit, scrape) # the real call from Step 1
with sqlite3.connect(CACHE_DB) as c:
c.execute("INSERT OR REPLACE INTO cache VALUES (?,?,?)",
(k, time.time(), json.dumps(data)))
return data
For a rank tracker this is mostly a latency and politeness win; for an answer engine fielding overlapping user questions it is a direct cost reduction. Tune TTL to how fast the underlying results actually move — an hour is reasonable for general queries, shorter for news-sensitive ones. The point is to never pay twice for an answer that has not changed.
Next Steps
Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.
