Integrations/Integration / Pydantic AI

Pydantic AI Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Register fastCRW scrape, crawl, and search as Pydantic AI agent tools via the @agent.tool decorator. Typed Python agents fetch live web pages and reason in a single loop. 6.6 MB RAM, 833 ms latency on 1,000-URL benchmark.

Published

May 12, 2026

Updated

May 12, 2026

Why Pydantic AI + fastCRW

Pydantic AI is a Python agent framework built on type safety. Unlike other tooling frameworks, Pydantic AI uses Pydantic models to validate inputs and outputs, giving you type guarantees and automatic documentation. fastCRW integrates as a decorator-based tool that scrapes web pages and returns validated responses.

The pattern: define a tool that calls fastCRW, attach it to your agent, and let the agent decide when to scrape. The agent sees the tool schema, invokes fastCRW for live page content, and reasons about the results in the same message loop. All inputs and outputs are type-checked via Pydantic validation.

Setup

Install Pydantic AI and dependencies.
Sign up at fastcrw.com for an API key.
Export FASTCRW_API_KEY in your shell.
Define tool functions with @agent.tool that call fastCRW.
Attach tools to your agent and run.

pip install pydantic-ai requests
export FASTCRW_API_KEY="fcrw_..."

Code Example: Basic fastCRW Tool

Create a file scraper_agent.py:

import os
import requests
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext


# Define Pydantic models for type safety
class ScrapedPage(BaseModel):
    """A web page scraped via fastCRW."""
    url: str = Field(..., description="The URL that was scraped")
    markdown: str = Field(..., description="The page content as Markdown")
    status_code: int = Field(default=200, description="HTTP status code")
    load_time_ms: int | None = Field(None, description="Load time in milliseconds")


class SearchResult(BaseModel):
    """A single search result from fastCRW."""
    title: str = Field(..., description="Result title")
    url: str = Field(..., description="Result URL")
    snippet: str = Field(..., description="Brief description")


class SearchResults(BaseModel):
    """Multiple search results."""
    results: list[SearchResult] = Field(..., description="List of search results")
    count: int = Field(..., description="Total results returned")


# Initialize agent with a system prompt
agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a research agent. Use fastCRW tools to scrape and search the web, then synthesize findings into coherent answers.",
)


# Define fastCRW scrape tool
@agent.tool
def scrape_url(url: str) -> ScrapedPage:
    """
    Scrape a single URL via fastCRW and return Markdown content.

    Args:
        url: The URL to scrape.

    Returns:
        ScrapedPage with markdown content and metadata.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    data = response.json()

    return ScrapedPage(
        url=url,
        markdown=data["data"].get("markdown", ""),
        status_code=data["data"].get("status_code", 200),
        load_time_ms=data["data"].get("load_time_ms"),
    )


# Define fastCRW search tool
@agent.tool
def search_web(query: str, limit: int = 5) -> SearchResults:
    """
    Search the web via fastCRW.

    Args:
        query: Search query.
        limit: Max results to return (1-10).

    Returns:
        SearchResults with matching pages.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/search",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json={"query": query, "limit": limit},
        timeout=30,
    )
    response.raise_for_status()
    data = response.json()

    results = [
        SearchResult(
            title=r.get("title", ""),
            url=r.get("url", ""),
            snippet=r.get("snippet", ""),
        )
        for r in data["data"].get("results", [])[:limit]
    ]

    return SearchResults(results=results, count=len(results))


# Run agent
if __name__ == "__main__":
    # Example usage
    result = agent.run_sync("What are the latest Python 3.14 features?")
    print(result.data)

Streaming Example with Dependency Injection

For streaming responses and context-aware tools:

from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
import requests
import os


class UserContext(BaseModel):
    """Dependency context for tools."""
    user_id: str
    session_id: str
    cache: dict = {}


class ScrapeResult(BaseModel):
    url: str
    content: str


# Agent with dependencies
agent = Agent(
    model="openai:gpt-4o-mini",
    deps_type=UserContext,
    system_prompt="Research tool. Scrape URLs and synthesize findings.",
)


@agent.tool
def scrape_with_cache(ctx: RunContext[UserContext], url: str) -> ScrapeResult:
    """
    Scrape a URL with caching via context.

    Args:
        ctx: Dependency context with cache.
        url: URL to scrape.

    Returns:
        ScrapeResult with cached or fresh content.
    """
    # Check cache
    if url in ctx.deps.cache:
        print(f"Cache hit for {url}")
        return ScrapeResult(url=url, content=ctx.deps.cache[url])

    # Call fastCRW
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    markdown = response.json()["data"]["markdown"]

    # Store in cache
    ctx.deps.cache[url] = markdown

    return ScrapeResult(url=url, content=markdown)


# Run with streaming
if __name__ == "__main__":
    context = UserContext(user_id="user_123", session_id="session_abc")

    for chunk in agent.run_stream(
        "Summarize the fastCRW homepage.",
        deps=context,
    ):
        if chunk.event == "message_chunk":
            print(chunk.data, end="", flush=True)

Advanced Example: Multi-Step Research Agent

import os
import requests
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext


class ResearchStep(BaseModel):
    """A single step in research."""
    step: int
    action: str  # "search", "scrape", "synthesize"
    result: str


class ResearchReport(BaseModel):
    """Final research output."""
    topic: str
    findings: str
    sources: list[str]


# Create specialized agents for each step
search_agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a search specialist. Use fastCRW search to find relevant pages.",
)

scrape_agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a content analyst. Extract key information from scraped pages.",
)

synthesis_agent = Agent(
    model="openai:gpt-4o-mini",
    system_prompt="You are a researcher. Synthesize findings from multiple sources.",
)


@search_agent.tool
def search_web(query: str) -> str:
    """Search for relevant pages."""
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/search",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"query": query, "limit": 5},
        timeout=30,
    )
    response.raise_for_status()
    results = response.json()["data"]["results"]
    return "\n".join([f"- {r['title']}: {r['url']}" for r in results])


@scrape_agent.tool
def scrape_url(url: str) -> str:
    """Scrape a page."""
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["data"]["markdown"]


def run_research(topic: str) -> ResearchReport:
    """Run a multi-step research workflow."""
    # Step 1: Search
    search_results = search_agent.run_sync(
        f"Find the top 5 sources on: {topic}"
    )

    # Step 2: Scrape top results
    urls_to_scrape = search_results.data.split("\n")[:3]
    scraped_content = []
    for url_line in urls_to_scrape:
        # Extract URL from "- Title: URL"
        if ": " in url_line:
            url = url_line.split(": ")[-1].strip()
            content = scrape_agent.run_sync(f"Analyze: {url}")
            scraped_content.append(content.data)

    # Step 3: Synthesize
    synthesis_prompt = f"""
    Topic: {topic}
    Findings from {len(scraped_content)} sources:
    {chr(10).join(scraped_content)}
    Synthesize a coherent research report.
    """
    report = synthesis_agent.run_sync(synthesis_prompt)

    return ResearchReport(
        topic=topic,
        findings=report.data,
        sources=[line.strip() for line in urls_to_scrape],
    )


if __name__ == "__main__":
    report = run_research("Rust programming language")
    print(f"Topic: {report.topic}")
    print(f"Findings: {report.findings}")
    print(f"Sources: {report.sources}")

When to Use This

Typed Python agents — fastCRW tools with Pydantic validation for production-grade reliability.
Multi-step research workflows — chain scrapes and synthesis with type safety.
Content analysis pipelines — scrape, extract, and validate structured data from web pages.
LLM-powered dashboards — agents that fetch live data (pricing, news, status pages) for display.
Knowledge base builders — agents that crawl documentation and ingest into a typed backend.
Competitive monitoring — scrape competitor pages regularly and validate results.

Limits + Gotchas

Sequential tool calls — Pydantic AI tools run synchronously. For parallel scrapes, batch them in a single tool call.
Timeout handling — set request timeouts in your fastCRW calls to avoid hanging the agent.
Error propagation — if a scrape fails, catch the error and return a meaningful message so the agent can retry or skip.
Model constraints — some models don't support all Pydantic features. Test with your target model (GPT-4o, Claude, etc.).
Large responses — big Markdown pages can exceed token budgets. Summarize or truncate before returning.
Rate limiting — implement throttling if your agent scrapes many URLs in a loop.
API key security — use environment variables or a secret manager, never hardcode keys.

Performance Notes

Tool invocation: ~50–100 ms overhead per tool call (Pydantic validation, serialization).
Scrape latency: 833 ms median for HTTP scraping on the Firecrawl benchmark.
End-to-end agent loop: Depends on model inference + tool calls. Budget 2–5 seconds for a single research step.
Streaming: Tool calls block the stream. For responsive streaming, run long-scraping outside the loop.

Sources

Pydantic AI documentation

https://ai.pydantic.dev

Pydantic AI tools guide

https://ai.pydantic.dev/agents/#tools

fastCRW REST API documentation

/docs/rest-api

Pydantic model validation

https://docs.pydantic.dev/latest/

FAQ

How do I add fastCRW as a Pydantic AI tool?

Use the @agent.tool decorator on a function that calls fastCRW's REST API. Include type hints and a docstring. Pydantic AI will infer the tool schema from the function signature.

Can I use Pydantic models for fastCRW responses?

Yes. Define a Pydantic model for the fastCRW response (e.g., ScrapedPage with fields: url, title, markdown). Your tool function returns the validated model instance.

Does fastCRW work with Pydantic AI's streaming?

Yes. Tool calls are invoked synchronously during the message loop. If fastCRW takes time, the stream waits. For non-blocking scrapes, run them outside the loop or implement background fetching.

How do I handle fastCRW API errors in Pydantic AI?

Wrap the fastCRW call in try-except. Return a descriptive error message to the agent or raise a ToolError. Pydantic AI will include the error in the message loop for the agent to retry.

Can I use dependency injection with fastCRW tools?

Yes. Define tools as methods on a class with dependencies (e.g., api_key, session). Pydantic AI supports dependency injection via the run_sync and run_sync methods.

What's the difference between tool() and tool(depends_on=...)?

tool() is a simple decorator for standalone functions. depends_on allows tool functions to access agent context like request ID, user info, or a shared HTTP session.

How do I cache fastCRW results in Pydantic AI?

Maintain a dict or cache layer keyed by URL. Check the cache before calling fastCRW. If hit, return cached markdown. If miss, scrape and store.

Can Pydantic AI agents scrape multiple URLs in parallel?

No, tool calls are sequential. For parallel scrapes, batch them in a single tool call (e.g., scrape_multiple(urls: List[str]) -> List[ScrapedPage]) and call fastCRW's batch endpoint if available.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Integrations

View all integrations

Previous in Integrations

Flowise Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Next in Integrations

LlamaIndex Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Integrations

Vercel AI SDK Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Register fastCRW as a tool in Vercel AI SDK so generateText and streamText can scrape live web pages. Drop-in alternative to Firecrawl with 6.6 MB RAM runtime and 833 ms average latency on 1,000-URL benchmark.

vercel ai sdk web scrapingRegister fastCRW scrape/crawl/search as native Vercel AI SDK tools via tool() helper

Integrations

Cursor Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Add fastCRW as an MCP server in Cursor IDE. Configure ~/.cursor/mcp.json, then scrape, search, crawl, and extract web pages from within your agent prompts. 6.6 MB RAM runtime.

cursor web scrapingRegister fastCRW MCP server in ~/.cursor/mcp.json

Integrations

Dify Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Integrate fastCRW into Dify workflows via HTTP node or native plugin. Call scrape and search endpoints from Dify LLM apps, knowledge bases, and agents. 6.6 MB RAM runtime, 92% coverage on the 1,000-URL benchmark.

dify web scrapingWorks with the standard HTTP Request node — no custom plugin required

Related hubs

Keep the crawl path moving

Docs

Drop into endpoint reference once your integration is wired up.

Use Cases

See where this integration shape fits common AI-agent workloads.

Alternatives

Compare fastCRW against other scraping APIs your stack might consider.