Use Cases/Use Case / LLM Agents

Web Scraping for LLM Agents

Give your LLM agent a reliable browse-and-extract tool — fastCRW's /v1/search and /v1/scrape over REST or MCP, with the same shape ChatGPT, Claude, and OpenAI agents already understand.

Published

May 27, 2026

Updated

May 27, 2026

Who this is for

Teams building LLM agents — research copilots, customer-support bots, internal "ask my data" assistants — that need a reliable way to put live web content in the model's context window. An agent without a browse tool is a chatbot with a 2024 cut-off; one with a flaky browse tool is worse, because it confidently cites pages that timed out.

fastCRW gives the agent two clean tool surfaces — search and scrape — that return predictable markdown the model can reason over.

Why fastCRW for agents

Agents care about three things in a browse tool: the call returns, it returns fast enough that the planner does not give up, and the payload is small enough that the next turn's tokens stay cheap. fastCRW is built around those three constraints.

POST /v1/search (docs.fastcrw.com/api-reference/search/) returns web results with optional inline content scraping, so the agent can go from query to grounded text in one tool call. POST /v1/scrape (docs.fastcrw.com/api-reference/scrape/) takes a single URL and hands back clean markdown. Both endpoints share the Firecrawl response envelope, so an agent already wired to Firecrawl's /v1/scrape only needs a base-URL change.

For MCP-aware hosts (Claude Desktop, Cursor, Cline, Windsurf), the crw-mcp package (dist-tag latest on npm) exposes the same endpoints as MCP tools — no custom transport code on the agent side.

The 5-step recipe

Wire fastCRW as a tool the agent can call. Expose two tools to the agent — search(query) and scrape(url) — backed by POST /v1/search and POST /v1/scrape. If your agent host speaks MCP, point it at crw-mcp@0.6.0 and skip the REST wiring entirely.
Let the planner pick search vs scrape. In the system prompt, tell the model to call search when it needs to discover URLs and scrape when it already has one. Cap iterations so a confused plan cannot loop the tool indefinitely.
Run the tool call and return markdown. Execute the tool against fastCRW. Return only the markdown body and the source URL to the model — not raw HTML, not full response envelopes. Smaller tool outputs mean cheaper next-turn tokens.
Let the model reason over the retrieved content. Feed the tool result back as a tool message. The agent now has grounded text it can quote, summarise, or extract structured fields from.
Cite the source in the final answer. Force the agent to surface the source URL beside any claim it makes. Auditable answers beat confident hallucinations every time.

# agent_loop.py — run with: python3 agent_loop.py
import os
import json
import requests
from anthropic import Anthropic

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
claude = Anthropic()

TOOLS = [
    {
        "name": "search",
        "description": "Search the web. Returns a list of {title, url, snippet}.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
    {
        "name": "scrape",
        "description": "Fetch a URL as clean markdown. Use only after search.",
        "input_schema": {
            "type": "object",
            "properties": {"url": {"type": "string"}},
            "required": ["url"],
        },
    },
]

def run_tool(name: str, args: dict) -> str:
    if name == "search":
        r = requests.post(f"{CRW}/search", json={"query": args["query"], "limit": 5},
                          headers=HEADERS, timeout=30)
        return json.dumps(r.json()["data"]["results"])
    if name == "scrape":
        r = requests.post(f"{CRW}/scrape", json={"url": args["url"], "formats": ["markdown"]},
                          headers=HEADERS, timeout=60)
        return r.json()["data"]["markdown"][:8000]
    raise ValueError(name)

def chat(user_prompt: str, max_turns: int = 4) -> str:
    messages = [{"role": "user", "content": user_prompt}]
    for _ in range(max_turns):
        resp = claude.messages.create(
            model="claude-sonnet-4-6", max_tokens=1024, tools=TOOLS, messages=messages,
        )
        if resp.stop_reason != "tool_use":
            return resp.content[0].text
        tool_use = next(b for b in resp.content if b.type == "tool_use")
        tool_result = run_tool(tool_use.name, tool_use.input)
        messages += [
            {"role": "assistant", "content": resp.content},
            {"role": "user", "content": [{
                "type": "tool_result", "tool_use_id": tool_use.id, "content": tool_result,
            }]},
        ]
    return "Tool budget exhausted."

if __name__ == "__main__":
    print(chat("What is the p50 scrape latency reported in the fastCRW 3-way benchmark?"))

Next steps

The MCP install path and tool reference live at docs.fastcrw.com; pricing for managed-API agent traffic is on fastcrw.com/pricing. Self-host the same binary for $0 per 1,000 scrapes under AGPL-3.0 when your agent runs inside a private network.

fastCRWlive

Scrape any URL, live

Get 500 free credits →

Sources

fastCRW /v1/search reference

https://docs.fastcrw.com/api-reference/search/

fastCRW /v1/scrape reference

https://docs.fastcrw.com/api-reference/scrape/

crw-mcp on npm (MCP server)

https://www.npmjs.com/package/crw-mcp

Model Context Protocol specification

https://modelcontextprotocol.io/

FAQ

REST or MCP — which should my agent use?

If your host is Claude Desktop, Cursor, Windsurf, Cline, or any MCP-aware runtime, install `crw-mcp` and let the host handle the transport. If you are building a custom agent loop in Python or TypeScript, the REST endpoints are usually faster to wire and easier to debug.

How does fastCRW compare to bundling Firecrawl as the tool?

The tool surface is intentionally Firecrawl-compatible — same endpoint shapes, same response envelope — so an agent built against Firecrawl's `/v1/scrape` keeps working after a base-URL swap. fastCRW adds a self-host option and the `crw-mcp` package, both of which Firecrawl Cloud does not.

What stops the agent from scraping the wrong page?

Two guardrails. Cap tool iterations in the agent loop, and require the model to emit a one-line rationale before each scrape call so you can inspect the trace. fastCRW respects robots.txt by default; the override is opt-in for legitimate cases.

Recommended next step

Claim an API key and start shipping.

Move from evaluation to implementation with credits, docs, and a compatibility-first API.

Create Account

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for Deep Research Agents

Next in Use Cases

Web Scraping for RAG Pipelines

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Content Aggregation

Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.

web scraping for content aggregationDiscover all content URLs on any domain with a single `/v1/map` call

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.