What We're Building
A LangGraph agent that autonomously scrapes websites, extracts structured data, and makes decisions about what to scrape next. The agent uses CRW as its scraping backend — getting clean markdown from any URL in under a second — while LangGraph handles the orchestration: state management, tool routing, and conditional branching.
By the end of this tutorial, you'll have a working agent that can: (1) discover pages on a target site using CRW's /v1/map endpoint, (2) scrape and extract content from selected pages, (3) analyze the content and decide whether to scrape more pages, and (4) compile a structured report from all gathered data.
Prerequisites
- CRW running locally (
docker run -p 3000:3000 ghcr.io/us/crw:latest) or a fastCRW API key - Python 3.11+
- An OpenAI API key (for the LLM powering the agent)
pip install langgraph langchain-openai firecrawl-py
Why LangGraph for Web Scraping Agents?
LangGraph models agent logic as a directed graph. Each node is a function, and edges define the flow between them. This is a natural fit for scraping workflows where you need to:
- Branch conditionally — scrape more pages or stop based on what you've found
- Maintain state — accumulate scraped data across multiple tool calls
- Retry on failure — route back to a scrape node if a page fails
- Human-in-the-loop — pause for approval before scraping sensitive sites
Compared to a simple ReAct loop, LangGraph gives you explicit control over the agent's execution path, making it easier to debug and reason about.
Step 1: Define the Agent State
LangGraph agents operate on a shared state object. Define what the agent needs to track:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
from langchain_core.messages import BaseMessage
class AgentState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
target_url: str
discovered_urls: list[str]
scraped_pages: list[dict]
report: str
The messages field uses LangGraph's built-in message reducer so chat history accumulates automatically. The other fields track our scraping progress.
Step 2: Create CRW Scraping Tools
Define tools that the agent can call. We'll use the Firecrawl Python SDK pointed at CRW:
from firecrawl import FirecrawlApp
from langchain_core.tools import tool
# Point at your CRW instance
crw = FirecrawlApp(
api_key="fc-YOUR-KEY",
api_url="http://localhost:3000" # or "https://fastcrw.com/api"
)
@tool
def discover_urls(url: str) -> list[str]:
"""Discover all URLs on a website using CRW's map endpoint."""
result = crw.map_url(url)
return result.get("links", [])[:50] # limit to 50 URLs
@tool
def scrape_page(url: str) -> dict:
"""Scrape a single page and return clean markdown content."""
result = crw.scrape_url(url, params={"formats": ["markdown"]})
return {
"url": url,
"title": result.get("metadata", {}).get("title", ""),
"markdown": result.get("markdown", ""),
}
@tool
def extract_data(url: str, schema: dict) -> dict:
"""Extract structured data from a page using CRW's extract endpoint."""
result = crw.scrape_url(url, params={
"formats": ["extract"],
"extract": {"schema": schema}
})
return result.get("extract", {})
Step 3: Build the Agent Graph
Now wire the tools into a LangGraph graph with conditional routing:
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
# Initialize the LLM with tools
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [discover_urls, scrape_page, extract_data]
llm_with_tools = llm.bind_tools(tools)
# Define graph nodes
def agent_node(state: AgentState) -> dict:
"""The LLM decides what to do next."""
system_prompt = f"""You are a web scraping agent. Your goal is to gather
information from {state['target_url']}.
Strategy:
1. First discover URLs on the target site
2. Select the most relevant pages to scrape
3. Scrape each page for content
4. When you have enough data, compile a report
Pages scraped so far: {len(state['scraped_pages'])}
"""
messages = [{"role": "system", "content": system_prompt}] + state["messages"]
response = llm_with_tools.invoke(messages)
return {"messages": [response]}
def compile_report(state: AgentState) -> dict:
"""Compile all scraped data into a final report."""
pages = state["scraped_pages"]
report_parts = []
for page in pages:
report_parts.append(f"## {page['title']}
Source: {page['url']}
{page['markdown'][:500]}")
report = "
---
".join(report_parts)
return {"report": report}
# Routing function
def should_continue(state: AgentState) -> str:
last_message = state["messages"][-1]
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "tools"
return "compile"
# Build the graph
tool_node = ToolNode(tools)
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.add_node("compile", compile_report)
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue, {
"tools": "tools",
"compile": "compile",
})
graph.add_edge("tools", "agent")
graph.add_edge("compile", END)
app = graph.compile()
Step 4: Visualize the Graph
LangGraph can render the agent's execution graph, which is helpful for debugging:
# Print ASCII representation
app.get_graph().print_ascii()
# Output:
# +---------+
# | agent |
# +---------+
# / # v v
# +-------+ +---------+
# | tools | | compile |
# +-------+ +---------+
# | |
# v v
# +-------+ +-----+
# | agent | | END |
# +-------+ +-----+
Step 5: Run the Agent
from langchain_core.messages import HumanMessage
result = app.invoke({
"messages": [HumanMessage(content="Research this website and gather key information")],
"target_url": "https://docs.example.com",
"discovered_urls": [],
"scraped_pages": [],
"report": "",
})
print(result["report"])
Step 6: Add State Persistence
For long-running scraping jobs, persist the agent's state so it can resume after interruptions:
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)
# Run with a thread ID for persistence
config = {"configurable": {"thread_id": "scrape-job-1"}}
result = app.invoke({
"messages": [HumanMessage(content="Scrape the documentation site")],
"target_url": "https://docs.example.com",
"discovered_urls": [],
"scraped_pages": [],
"report": "",
}, config=config)
# Resume later — state is preserved
result = app.invoke({
"messages": [HumanMessage(content="Now scrape the API reference section too")],
}, config=config)
Step 7: Stream Agent Progress
For real-time visibility into what the agent is doing, use LangGraph's streaming:
async for event in app.astream_events(
{
"messages": [HumanMessage(content="Research this website")],
"target_url": "https://docs.example.com",
"discovered_urls": [],
"scraped_pages": [],
"report": "",
},
version="v2",
):
if event["event"] == "on_tool_start":
print(f"🔧 Calling: {event['name']}({event['data']['input']})")
elif event["event"] == "on_tool_end":
print(f"✅ Result: {str(event['data']['output'])[:200]}")
Advanced: Multi-Site Comparison Agent
Extend the agent to compare data across multiple sites — useful for competitive analysis or price monitoring:
class ComparisonState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
sites: list[str]
site_data: dict[str, list[dict]]
comparison_report: str
@tool
def scrape_multiple(urls: list[str]) -> list[dict]:
"""Scrape multiple pages in sequence using CRW."""
results = []
for url in urls:
try:
result = crw.scrape_url(url, params={"formats": ["markdown"]})
results.append({
"url": url,
"title": result.get("metadata", {}).get("title", ""),
"markdown": result.get("markdown", ""),
})
except Exception as e:
results.append({"url": url, "error": str(e)})
return results
Using fastCRW Instead of Self-Hosted
To use the managed fastCRW cloud service instead of self-hosting, just change the API URL:
crw = FirecrawlApp(
api_key="fc-YOUR-FASTCRW-KEY",
api_url="https://fastcrw.com/api"
)
Everything else stays exactly the same. fastCRW handles infrastructure and scaling — so your agent can focus on the logic.
Why CRW for LangGraph Agents?
Speed matters for agents. When an LLM agent makes a tool call, the user is waiting. CRW's 833ms average response time means your agent stays responsive — compare that to 4.6 seconds with other scraping APIs. Over a 10-page research task, that's 8 seconds vs 46 seconds of scraping time alone.
Clean markdown improves agent reasoning. CRW strips navigation, ads, and boilerplate automatically. The LLM sees only the content that matters, which reduces token usage and improves the quality of the agent's decisions.
Firecrawl SDK compatibility. CRW works with the existing Firecrawl Python SDK — just change the api_url. If you have existing LangChain/LangGraph code using Firecrawl, switching to CRW is a one-line change.
Next Steps
- Build a RAG pipeline with the data your agent collects
- Use CRW's MCP server for direct AI agent integration
- Compare CRW vs Firecrawl in detail
Get Started
Run CRW locally in one command:
docker run -p 3000:3000 ghcr.io/us/crw:latest
Or sign up for fastCRW to skip the infrastructure and start building your LangGraph agent immediately.