Tutorial

How to Build a Deep Research Agent with CRW

Build a deep research agent that searches, scrapes, and synthesizes findings into structured reports using CRW's scraping API.

[Fast]
C
R
W
March 27, 202620 min read

What We're Building

A deep research agent that autonomously researches any topic by: (1) searching the web for relevant sources, (2) discovering pages on found sites using CRW's /v1/map, (3) scraping and extracting content with /v1/scrape, (4) extracting structured data with /v1/extract, and (5) synthesizing everything into a comprehensive research report with citations.

Unlike simple RAG pipelines that work with a fixed corpus, this agent actively explores the web — following leads, drilling into promising sources, and iterating until it has enough information to produce a thorough answer.

Prerequisites

  • CRW running locally (docker run -p 3000:3000 ghcr.io/us/crw:latest) or a fastCRW API key
  • Python 3.11+
  • An OpenAI API key
  • pip install openai firecrawl-py

Architecture Overview

The research agent follows a loop: Plan → Search → Scrape → Analyze → Decide → Repeat or Report. Each iteration deepens the agent's understanding until it decides it has enough information.

# The research loop:
#
#  ┌─────────┐
#  │  Plan   │ ← Break research question into sub-questions
#  └────┬────┘
#       ▼
#  ┌─────────┐
#  │ Search  │ ← Find relevant URLs (map endpoint)
#  └────┬────┘
#       ▼
#  ┌─────────┐
#  │ Scrape  │ ← Get clean content (scrape endpoint)
#  └────┬────┘
#       ▼
#  ┌─────────┐
#  │ Analyze │ ← Extract key findings, identify gaps
#  └────┬────┘
#       ▼
#  ┌─────────┐
#  │ Decide  │ ← Enough info? → Report. Gaps? → Loop back.
#  └─────────┘

Step 1: Set Up the CRW Client

from firecrawl import FirecrawlApp
import openai
import json

# CRW client — self-hosted or fastCRW
crw = FirecrawlApp(
    api_key="fc-YOUR-KEY",
    api_url="http://localhost:3000"  # or "https://fastcrw.com/api"
)

client = openai.OpenAI()

Step 2: Build the Research Planner

The planner breaks a high-level research question into specific sub-questions:

def plan_research(question: str, existing_findings: str = "") -> list[str]:
    """Break a research question into sub-questions."""
    prompt = f"""You are a research planner. Break this research question into
    3-5 specific sub-questions that can be answered by scraping web pages.

    Research question: {question}

    {"Existing findings (avoid duplicating these):" + existing_findings if existing_findings else ""}

    Return a JSON array of sub-questions. Example:
    ["What is X's pricing model?", "How does X compare to Y?", "What are the technical requirements?"]"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("questions", result.get("sub_questions", []))

Step 3: URL Discovery with Map

Use CRW's /v1/map endpoint to find relevant pages without downloading their full content:

def discover_sources(seed_urls: list[str]) -> list[str]:
    """Discover relevant URLs from seed sites using CRW's map endpoint."""
    all_urls = []
    for url in seed_urls:
        try:
            result = crw.map_url(url)
            links = result.get("links", [])
            all_urls.extend(links)
        except Exception as e:
            print(f"Map failed for {url}: {e}")
    # Deduplicate while preserving order
    seen = set()
    unique = []
    for url in all_urls:
        if url not in seen:
            seen.add(url)
            unique.append(url)
    return unique

Step 4: Intelligent URL Selection

Not all discovered URLs are worth scraping. Use the LLM to select the most relevant ones:

def select_urls(urls: list[str], question: str, max_urls: int = 10) -> list[str]:
    """Use LLM to select the most relevant URLs for the research question."""
    prompt = f"""Given this research question: "{question}"

    Select the {max_urls} most relevant URLs from this list:
    {json.dumps(urls[:100])}

    Return a JSON object with a "urls" array containing only the selected URLs.
    Prioritize pages that likely contain substantive information (docs, blog posts,
    about pages) over generic pages (login, terms of service, etc.)."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("urls", [])[:max_urls]

Step 5: Scrape and Extract Content

def scrape_sources(urls: list[str]) -> list[dict]:
    """Scrape multiple URLs and return structured content."""
    results = []
    for url in urls:
        try:
            data = crw.scrape_url(url, params={"formats": ["markdown"]})
            results.append({
                "url": url,
                "title": data.get("metadata", {}).get("title", ""),
                "content": data.get("markdown", ""),
            })
        except Exception as e:
            print(f"Scrape failed for {url}: {e}")
    return results

def extract_structured(url: str, schema: dict) -> dict:
    """Extract structured data from a page using CRW's extract endpoint."""
    try:
        data = crw.scrape_url(url, params={
            "formats": ["extract"],
            "extract": {"schema": schema}
        })
        return data.get("extract", {})
    except Exception as e:
        print(f"Extract failed for {url}: {e}")
        return {}

Step 6: Analyze and Synthesize

def analyze_findings(question: str, sources: list[dict]) -> dict:
    """Analyze scraped content and identify key findings and gaps."""
    source_text = ""
    for s in sources:
        source_text += f"

--- Source: {s['url']} ---
{s['content'][:2000]}"

    prompt = f"""Analyze these sources to answer: "{question}"

    Sources:
    {source_text}

    Return a JSON object with:
    - "findings": array of key findings (each with "fact" and "source_url")
    - "gaps": array of information gaps that need more research
    - "confidence": 0-100 score of how well the question is answered
    - "summary": 2-3 sentence summary of findings so far"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Step 7: The Research Loop

Put it all together in a loop that iterates until the agent has enough information:

def deep_research(question: str, seed_urls: list[str], max_iterations: int = 3) -> dict:
    """Run the full deep research pipeline."""
    all_findings = []
    all_sources = []
    iteration = 0

    while iteration < max_iterations:
        iteration += 1
        print(f"
{'='*60}")
        print(f"Research iteration {iteration}/{max_iterations}")
        print(f"{'='*60}")

        # Plan: what sub-questions do we need to answer?
        existing = json.dumps([f["fact"] for f in all_findings])
        sub_questions = plan_research(question, existing)
        print(f"Sub-questions: {sub_questions}")

        # Discover URLs
        discovered = discover_sources(seed_urls)
        print(f"Discovered {len(discovered)} URLs")

        # Select the most relevant URLs
        selected = select_urls(discovered, question)
        print(f"Selected {len(selected)} URLs to scrape")

        # Scrape
        new_sources = scrape_sources(selected)
        all_sources.extend(new_sources)
        print(f"Scraped {len(new_sources)} pages")

        # Analyze
        analysis = analyze_findings(question, all_sources)
        all_findings.extend(analysis.get("findings", []))
        confidence = analysis.get("confidence", 0)
        gaps = analysis.get("gaps", [])

        print(f"Confidence: {confidence}/100")
        print(f"Gaps remaining: {gaps}")

        # Decide: enough information?
        if confidence >= 80 or not gaps:
            print("Sufficient information gathered. Generating report.")
            break

        # If gaps remain, use them to guide next iteration
        seed_urls = []  # reset for next iteration search

    # Generate final report
    report = generate_report(question, all_findings, all_sources)
    return {
        "report": report,
        "sources": [{"url": s["url"], "title": s["title"]} for s in all_sources],
        "iterations": iteration,
        "total_findings": len(all_findings),
    }

Step 8: Generate the Final Report

def generate_report(question: str, findings: list[dict], sources: list[dict]) -> str:
    """Generate a comprehensive research report with citations."""
    findings_text = json.dumps(findings, indent=2)
    source_list = "
".join([f"- [{s['title']}]({s['url']})" for s in sources])

    prompt = f"""Write a comprehensive research report answering: "{question}"

    Key findings:
    {findings_text}

    Requirements:
    - Start with an executive summary
    - Organize findings into logical sections with headers
    - Cite sources inline using [Source Title](URL) format
    - Include a "Sources" section at the end
    - Be factual — only include information from the findings
    - 500-1000 words

    Available sources:
    {source_list}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

Running the Agent

result = deep_research(
    question="What are the top open-source web scraping frameworks in 2026 and how do they compare?",
    seed_urls=[
        "https://github.com/topics/web-scraping",
        "https://docs.example.com/scraping-tools",
    ],
    max_iterations=3,
)

print(result["report"])
print(f"
Research completed in {result['iterations']} iterations")
print(f"Used {len(result['sources'])} sources")
print(f"Extracted {result['total_findings']} findings")

Adding Structured Extraction

For specific data points, use CRW's extract endpoint with a JSON schema:

# Extract pricing information from a competitor page
pricing_schema = {
    "type": "object",
    "properties": {
        "plans": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "string"},
                    "features": {"type": "array", "items": {"type": "string"}},
                },
            },
        },
        "has_free_tier": {"type": "boolean"},
        "enterprise_available": {"type": "boolean"},
    },
}

pricing = extract_structured("https://competitor.com/pricing", pricing_schema)
print(json.dumps(pricing, indent=2))

Using fastCRW Instead of Self-Hosted

For production research agents that scrape many different sites, fastCRW handles proxy rotation and scaling:

crw = FirecrawlApp(
    api_key="fc-YOUR-FASTCRW-KEY",
    api_url="https://fastcrw.com/api"
)

The rest of the code stays the same. fastCRW is particularly valuable for deep research agents because they scrape diverse sites — the managed infrastructure handles scaling and reliability across different domains.

Why CRW for Deep Research?

Speed enables deeper research. Each research iteration involves multiple scrape calls. At 833ms per page, a 10-page iteration takes ~8 seconds. At 4.6 seconds per page (typical for other APIs), the same iteration takes 46 seconds. Over 3 iterations, that's 24 seconds vs 2+ minutes — the difference between an interactive tool and a batch job.

Map endpoint enables intelligent exploration. CRW's /v1/map returns all URLs on a site without downloading content. This lets the agent discover the site structure first, then selectively scrape only the relevant pages — saving time and tokens.

Extract endpoint provides structured data. Instead of scraping raw content and parsing it with an LLM, CRW's /v1/extract returns structured JSON matching your schema. This is faster, cheaper, and more reliable for specific data extraction tasks.

Next Steps

Get Started

Run CRW locally in one command:

docker run -p 3000:3000 ghcr.io/us/crw:latest

Or sign up for fastCRW to start building your deep research agent without managing infrastructure.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.