Skip to main content
Integrations/Integration / Pydantic AI

Pydantic AI Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Register fastCRW scrape, crawl, and search as Pydantic AI agent tools via the @agent.tool decorator. Typed Python agents fetch live web pages and reason in a single loop. 6.6 MB RAM, 833 ms latency on 1,000-URL benchmark.

Published
May 12, 2026
Updated
May 12, 2026
Category
integrations
Verdict

Decorate fastCRW calls with @agent.tool in Pydantic AI. Your agent can scrape, search, and crawl live pages with full type safety. Responses are Markdown-formatted and ready for agent reasoning.

Register fastCRW as Pydantic AI tools using @agent.tool decoratorFull type safety on tool inputs/outputs via Pydantic modelsWorks with Pydantic AI's message loop, streaming, and dependency injection6.6 MB RAM fastCRW binary, 833 ms average latency, zero infrastructure

Why Pydantic AI + fastCRW

Pydantic AI is a Python agent framework built on type safety. Unlike other tooling frameworks, Pydantic AI uses Pydantic models to validate inputs and outputs, giving you type guarantees and automatic documentation. fastCRW integrates as a decorator-based tool that scrapes web pages and returns validated responses.

The pattern: define a tool that calls fastCRW, attach it to your agent, and let the agent decide when to scrape. The agent sees the tool schema, invokes fastCRW for live page content, and reasons about the results in the same message loop. All inputs and outputs are type-checked via Pydantic validation.

Setup

  1. Install Pydantic AI and dependencies.
  2. Sign up at fastcrw.com for an API key.
  3. Export FASTCRW_API_KEY in your shell.
  4. Define tool functions with @agent.tool that call fastCRW.
  5. Attach tools to your agent and run.
pip install pydantic-ai requests
export FASTCRW_API_KEY="fcrw_..."

Code Example: Basic fastCRW Tool

Create a file scraper_agent.py:

import os
import requests
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext


# Define Pydantic models for type safety
class ScrapedPage(BaseModel):
    """A web page scraped via fastCRW."""
    url: str = Field(..., description="The URL that was scraped")
    markdown: str = Field(..., description="The page content as Markdown")
    status_code: int = Field(default=200, description="HTTP status code")
    load_time_ms: int | None = Field(None, description="Load time in milliseconds")


class SearchResult(BaseModel):
    """A single search result from fastCRW."""
    title: str = Field(..., description="Result title")
    url: str = Field(..., description="Result URL")
    snippet: str = Field(..., description="Brief description")


class SearchResults(BaseModel):
    """Multiple search results."""
    results: list[SearchResult] = Field(..., description="List of search results")
    count: int = Field(..., description="Total results returned")


# Initialize agent with a system prompt
agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a research agent. Use fastCRW tools to scrape and search the web, then synthesize findings into coherent answers.",
)


# Define fastCRW scrape tool
@agent.tool
def scrape_url(url: str) -> ScrapedPage:
    """
    Scrape a single URL via fastCRW and return Markdown content.

    Args:
        url: The URL to scrape.

    Returns:
        ScrapedPage with markdown content and metadata.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    data = response.json()

    return ScrapedPage(
        url=url,
        markdown=data["data"].get("markdown", ""),
        status_code=data["data"].get("status_code", 200),
        load_time_ms=data["data"].get("load_time_ms"),
    )


# Define fastCRW search tool
@agent.tool
def search_web(query: str, limit: int = 5) -> SearchResults:
    """
    Search the web via fastCRW.

    Args:
        query: Search query.
        limit: Max results to return (1-10).

    Returns:
        SearchResults with matching pages.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/search",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json={"query": query, "limit": limit},
        timeout=30,
    )
    response.raise_for_status()
    data = response.json()

    results = [
        SearchResult(
            title=r.get("title", ""),
            url=r.get("url", ""),
            snippet=r.get("snippet", ""),
        )
        for r in data["data"].get("results", [])[:limit]
    ]

    return SearchResults(results=results, count=len(results))


# Run agent
if __name__ == "__main__":
    # Example usage
    result = agent.run_sync("What are the latest Python 3.14 features?")
    print(result.data)

Streaming Example with Dependency Injection

For streaming responses and context-aware tools:

from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
import requests
import os


class UserContext(BaseModel):
    """Dependency context for tools."""
    user_id: str
    session_id: str
    cache: dict = {}


class ScrapeResult(BaseModel):
    url: str
    content: str


# Agent with dependencies
agent = Agent(
    model="openai:gpt-4o-mini",
    deps_type=UserContext,
    system_prompt="Research tool. Scrape URLs and synthesize findings.",
)


@agent.tool
def scrape_with_cache(ctx: RunContext[UserContext], url: str) -> ScrapeResult:
    """
    Scrape a URL with caching via context.

    Args:
        ctx: Dependency context with cache.
        url: URL to scrape.

    Returns:
        ScrapeResult with cached or fresh content.
    """
    # Check cache
    if url in ctx.deps.cache:
        print(f"Cache hit for {url}")
        return ScrapeResult(url=url, content=ctx.deps.cache[url])

    # Call fastCRW
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    markdown = response.json()["data"]["markdown"]

    # Store in cache
    ctx.deps.cache[url] = markdown

    return ScrapeResult(url=url, content=markdown)


# Run with streaming
if __name__ == "__main__":
    context = UserContext(user_id="user_123", session_id="session_abc")

    for chunk in agent.run_stream(
        "Summarize the fastCRW homepage.",
        deps=context,
    ):
        if chunk.event == "message_chunk":
            print(chunk.data, end="", flush=True)

Advanced Example: Multi-Step Research Agent

import os
import requests
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext


class ResearchStep(BaseModel):
    """A single step in research."""
    step: int
    action: str  # "search", "scrape", "synthesize"
    result: str


class ResearchReport(BaseModel):
    """Final research output."""
    topic: str
    findings: str
    sources: list[str]


# Create specialized agents for each step
search_agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a search specialist. Use fastCRW search to find relevant pages.",
)

scrape_agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a content analyst. Extract key information from scraped pages.",
)

synthesis_agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a researcher. Synthesize findings from multiple sources.",
)


@search_agent.tool
def search_web(query: str) -> str:
    """Search for relevant pages."""
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/search",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"query": query, "limit": 5},
        timeout=30,
    )
    response.raise_for_status()
    results = response.json()["data"]["results"]
    return "\n".join([f"- {r['title']}: {r['url']}" for r in results])


@scrape_agent.tool
def scrape_url(url: str) -> str:
    """Scrape a page."""
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["data"]["markdown"]


def run_research(topic: str) -> ResearchReport:
    """Run a multi-step research workflow."""
    # Step 1: Search
    search_results = search_agent.run_sync(
        f"Find the top 5 sources on: {topic}"
    )

    # Step 2: Scrape top results
    urls_to_scrape = search_results.data.split("\n")[:3]
    scraped_content = []
    for url_line in urls_to_scrape:
        # Extract URL from "- Title: URL"
        if ": " in url_line:
            url = url_line.split(": ")[-1].strip()
            content = scrape_agent.run_sync(f"Analyze: {url}")
            scraped_content.append(content.data)

    # Step 3: Synthesize
    synthesis_prompt = f"""
    Topic: {topic}
    Findings from {len(scraped_content)} sources:
    {chr(10).join(scraped_content)}
    Synthesize a coherent research report.
    """
    report = synthesis_agent.run_sync(synthesis_prompt)

    return ResearchReport(
        topic=topic,
        findings=report.data,
        sources=[line.strip() for line in urls_to_scrape],
    )


if __name__ == "__main__":
    report = run_research("Rust programming language")
    print(f"Topic: {report.topic}")
    print(f"Findings: {report.findings}")
    print(f"Sources: {report.sources}")

When to Use This

  • Typed Python agents — fastCRW tools with Pydantic validation for production-grade reliability.
  • Multi-step research workflows — chain scrapes and synthesis with type safety.
  • Content analysis pipelines — scrape, extract, and validate structured data from web pages.
  • LLM-powered dashboards — agents that fetch live data (pricing, news, status pages) for display.
  • Knowledge base builders — agents that crawl documentation and ingest into a typed backend.
  • Competitive monitoring — scrape competitor pages regularly and validate results.

Limits + Gotchas

  • Sequential tool calls — Pydantic AI tools run synchronously. For parallel scrapes, batch them in a single tool call.
  • Timeout handling — set request timeouts in your fastCRW calls to avoid hanging the agent.
  • Error propagation — if a scrape fails, catch the error and return a meaningful message so the agent can retry or skip.
  • Model constraints — some models don't support all Pydantic features. Test with your target model (GPT-4o, Claude, etc.).
  • Large responses — big Markdown pages can exceed token budgets. Summarize or truncate before returning.
  • Rate limiting — implement throttling if your agent scrapes many URLs in a loop.
  • API key security — use environment variables or a secret manager, never hardcode keys.

Performance Notes

  • Tool invocation: ~50–100 ms overhead per tool call (Pydantic validation, serialization).
  • Scrape latency: 833 ms median for HTTP scraping on the Firecrawl benchmark.
  • End-to-end agent loop: Depends on model inference + tool calls. Budget 2–5 seconds for a single research step.
  • Streaming: Tool calls block the stream. For responsive streaming, run long-scraping outside the loop.

Continue exploring

More from Integrations

View all integrations

Related hubs

Keep the crawl path moving