Tutorial

How to Build a Web Scraping Agent with LangGraph and CRW

Build a web scraping agent with LangGraph and CRW — graph-based orchestration, state management, and conditional routing.

[Fast]
C
R
W
March 27, 202618 min read

What We're Building

A LangGraph agent that autonomously scrapes websites, extracts structured data, and makes decisions about what to scrape next. The agent uses CRW as its scraping backend — getting clean markdown from any URL in under a second — while LangGraph handles the orchestration: state management, tool routing, and conditional branching.

By the end of this tutorial, you'll have a working agent that can: (1) discover pages on a target site using CRW's /v1/map endpoint, (2) scrape and extract content from selected pages, (3) analyze the content and decide whether to scrape more pages, and (4) compile a structured report from all gathered data.

Prerequisites

  • CRW running locally (docker run -p 3000:3000 ghcr.io/us/crw:latest) or a fastCRW API key
  • Python 3.11+
  • An OpenAI API key (for the LLM powering the agent)
  • pip install langgraph langchain-openai firecrawl-py

Why LangGraph for Web Scraping Agents?

LangGraph models agent logic as a directed graph. Each node is a function, and edges define the flow between them. This is a natural fit for scraping workflows where you need to:

  • Branch conditionally — scrape more pages or stop based on what you've found
  • Maintain state — accumulate scraped data across multiple tool calls
  • Retry on failure — route back to a scrape node if a page fails
  • Human-in-the-loop — pause for approval before scraping sensitive sites

Compared to a simple ReAct loop, LangGraph gives you explicit control over the agent's execution path, making it easier to debug and reason about.

Step 1: Define the Agent State

LangGraph agents operate on a shared state object. Define what the agent needs to track:

from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    target_url: str
    discovered_urls: list[str]
    scraped_pages: list[dict]
    report: str

The messages field uses LangGraph's built-in message reducer so chat history accumulates automatically. The other fields track our scraping progress.

Step 2: Create CRW Scraping Tools

Define tools that the agent can call. We'll use the Firecrawl Python SDK pointed at CRW:

from firecrawl import FirecrawlApp
from langchain_core.tools import tool

# Point at your CRW instance
crw = FirecrawlApp(
    api_key="fc-YOUR-KEY",
    api_url="http://localhost:3000"  # or "https://fastcrw.com/api"
)

@tool
def discover_urls(url: str) -> list[str]:
    """Discover all URLs on a website using CRW's map endpoint."""
    result = crw.map_url(url)
    return result.get("links", [])[:50]  # limit to 50 URLs

@tool
def scrape_page(url: str) -> dict:
    """Scrape a single page and return clean markdown content."""
    result = crw.scrape_url(url, params={"formats": ["markdown"]})
    return {
        "url": url,
        "title": result.get("metadata", {}).get("title", ""),
        "markdown": result.get("markdown", ""),
    }

@tool
def extract_data(url: str, schema: dict) -> dict:
    """Extract structured data from a page using CRW's extract endpoint."""
    result = crw.scrape_url(url, params={
        "formats": ["extract"],
        "extract": {"schema": schema}
    })
    return result.get("extract", {})

Step 3: Build the Agent Graph

Now wire the tools into a LangGraph graph with conditional routing:

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI

# Initialize the LLM with tools
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [discover_urls, scrape_page, extract_data]
llm_with_tools = llm.bind_tools(tools)

# Define graph nodes
def agent_node(state: AgentState) -> dict:
    """The LLM decides what to do next."""
    system_prompt = f"""You are a web scraping agent. Your goal is to gather
    information from {state['target_url']}.

    Strategy:
    1. First discover URLs on the target site
    2. Select the most relevant pages to scrape
    3. Scrape each page for content
    4. When you have enough data, compile a report

    Pages scraped so far: {len(state['scraped_pages'])}
    """
    messages = [{"role": "system", "content": system_prompt}] + state["messages"]
    response = llm_with_tools.invoke(messages)
    return {"messages": [response]}

def compile_report(state: AgentState) -> dict:
    """Compile all scraped data into a final report."""
    pages = state["scraped_pages"]
    report_parts = []
    for page in pages:
        report_parts.append(f"## {page['title']}
Source: {page['url']}

{page['markdown'][:500]}")
    report = "

---

".join(report_parts)
    return {"report": report}

# Routing function
def should_continue(state: AgentState) -> str:
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "compile"

# Build the graph
tool_node = ToolNode(tools)

graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.add_node("compile", compile_report)

graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue, {
    "tools": "tools",
    "compile": "compile",
})
graph.add_edge("tools", "agent")
graph.add_edge("compile", END)

app = graph.compile()

Step 4: Visualize the Graph

LangGraph can render the agent's execution graph, which is helpful for debugging:

# Print ASCII representation
app.get_graph().print_ascii()

# Output:
#   +---------+
#   | agent   |
#   +---------+
#     /     #    v       v
# +-------+ +---------+
# | tools | | compile |
# +-------+ +---------+
#    |            |
#    v            v
# +-------+   +-----+
# | agent |   | END |
# +-------+   +-----+

Step 5: Run the Agent

from langchain_core.messages import HumanMessage

result = app.invoke({
    "messages": [HumanMessage(content="Research this website and gather key information")],
    "target_url": "https://docs.example.com",
    "discovered_urls": [],
    "scraped_pages": [],
    "report": "",
})

print(result["report"])

Step 6: Add State Persistence

For long-running scraping jobs, persist the agent's state so it can resume after interruptions:

from langgraph.checkpoint.memory import MemorySaver

checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)

# Run with a thread ID for persistence
config = {"configurable": {"thread_id": "scrape-job-1"}}

result = app.invoke({
    "messages": [HumanMessage(content="Scrape the documentation site")],
    "target_url": "https://docs.example.com",
    "discovered_urls": [],
    "scraped_pages": [],
    "report": "",
}, config=config)

# Resume later — state is preserved
result = app.invoke({
    "messages": [HumanMessage(content="Now scrape the API reference section too")],
}, config=config)

Step 7: Stream Agent Progress

For real-time visibility into what the agent is doing, use LangGraph's streaming:

async for event in app.astream_events(
    {
        "messages": [HumanMessage(content="Research this website")],
        "target_url": "https://docs.example.com",
        "discovered_urls": [],
        "scraped_pages": [],
        "report": "",
    },
    version="v2",
):
    if event["event"] == "on_tool_start":
        print(f"🔧 Calling: {event['name']}({event['data']['input']})")
    elif event["event"] == "on_tool_end":
        print(f"✅ Result: {str(event['data']['output'])[:200]}")

Advanced: Multi-Site Comparison Agent

Extend the agent to compare data across multiple sites — useful for competitive analysis or price monitoring:

class ComparisonState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    sites: list[str]
    site_data: dict[str, list[dict]]
    comparison_report: str

@tool
def scrape_multiple(urls: list[str]) -> list[dict]:
    """Scrape multiple pages in sequence using CRW."""
    results = []
    for url in urls:
        try:
            result = crw.scrape_url(url, params={"formats": ["markdown"]})
            results.append({
                "url": url,
                "title": result.get("metadata", {}).get("title", ""),
                "markdown": result.get("markdown", ""),
            })
        except Exception as e:
            results.append({"url": url, "error": str(e)})
    return results

Using fastCRW Instead of Self-Hosted

To use the managed fastCRW cloud service instead of self-hosting, just change the API URL:

crw = FirecrawlApp(
    api_key="fc-YOUR-FASTCRW-KEY",
    api_url="https://fastcrw.com/api"
)

Everything else stays exactly the same. fastCRW handles infrastructure and scaling — so your agent can focus on the logic.

Why CRW for LangGraph Agents?

Speed matters for agents. When an LLM agent makes a tool call, the user is waiting. CRW's 833ms average response time means your agent stays responsive — compare that to 4.6 seconds with other scraping APIs. Over a 10-page research task, that's 8 seconds vs 46 seconds of scraping time alone.

Clean markdown improves agent reasoning. CRW strips navigation, ads, and boilerplate automatically. The LLM sees only the content that matters, which reduces token usage and improves the quality of the agent's decisions.

Firecrawl SDK compatibility. CRW works with the existing Firecrawl Python SDK — just change the api_url. If you have existing LangChain/LangGraph code using Firecrawl, switching to CRW is a one-line change.

Next Steps

Get Started

Run CRW locally in one command:

docker run -p 3000:3000 ghcr.io/us/crw:latest

Or sign up for fastCRW to skip the infrastructure and start building your LangGraph agent immediately.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.