Use Cases/Use Case / Deep Research

Web Scraping for Deep Research Agents

Build Perplexity-style deep research pipelines with fastCRW — search to discover sources, scrape to extract full content, synthesize with an LLM. Firecrawl-compatible API, single Rust binary, AGPL-3.0.

Published

April 4, 2026

Updated

June 13, 2026

What Deep Research Actually Requires

Search engines solve discovery. Deep research requires the full content behind discovered links — extracted cleanly, attributed to sources, and ready for an LLM to reason over.

The gap between a search result and usable research material is significant:

A search snippet is 150–300 characters. A useful research source is 2,000–20,000 words.
Search ranking surfaces popular pages, not necessarily authoritative or complete ones.
Contradictions across sources require reading both in full to resolve.
A single research question typically branches into 5–15 sub-questions as you read.

Manually closing this gap — clicking through results, reading each page, taking notes, following links — is what makes research time-consuming. A deep research agent automates this loop: search → scrape full content → analyze gaps → search again → synthesize.

fastCRW provides the search and scrape primitives for this loop:

/v1/search — discovers relevant sources by query; returns ranked URLs with titles and snippets
/v1/scrape — extracts full page content as clean markdown; strips navigation, ads, and boilerplate
/v1/map — discovers all pages within an authoritative domain; useful when a single source proves rich
/v1/search with answer: true — adds managed LLM synthesis (paid plans) directly on top of search results

The Core Research Loop

Every deep research pipeline — from Perplexity's product to a custom LangGraph workflow — implements some version of this loop:

question
  → search (discover sources)
  → scrape (extract full content)
  → analyze (identify gaps, contradictions, sub-questions)
  → search again (refined queries per gap)
  → scrape more
  → [repeat until convergence or max iterations]
  → synthesize (cite sources)

The loop terminates when either the coverage check finds no remaining gaps, or a max-iterations cap is hit (3–5 passes is practical for most questions). fastCRW covers the search and scrape steps. You provide the reasoning LLM, the gap analysis prompt, and the synthesis step.

Quick Start: curl

# Step 1: Search — discover initial sources
curl -X POST https://api.fastcrw.com/v1/search \
  -H "Authorization: Bearer $FASTCRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "mechanistic interpretability transformer circuits 2025",
    "limit": 8
  }'

# Step 2: Scrape — extract full content from a discovered URL
curl -X POST https://api.fastcrw.com/v1/scrape \
  -H "Authorization: Bearer $FASTCRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://transformer-circuits.pub/2022/mech-interp-essay/index.html",
    "formats": ["markdown"]
  }'

# Step 3 (optional): Search with managed LLM synthesis
curl -X POST https://api.fastcrw.com/v1/search \
  -H "Authorization: Bearer $FASTCRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is the current state of mechanistic interpretability research",
    "limit": 5,
    "answer": true
  }'

# Step 4: Map an authoritative domain to find related documents
curl -X POST https://api.fastcrw.com/v1/map \
  -H "Authorization: Bearer $FASTCRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://transformer-circuits.pub",
    "includePatterns": ["*/2024/*", "*/2025/*"]
  }'

Full Python Implementation: Deep Research Agent

import asyncio
import os
import time
from dataclasses import dataclass, field
from typing import Any

import httpx

FASTCRW_API_KEY = os.environ["FASTCRW_API_KEY"]
BASE_URL = "https://api.fastcrw.com/v1"


@dataclass
class ResearchSource:
    url: str
    title: str
    content: str
    scraped_at: float = field(default_factory=time.time)


async def search_async(
    client: httpx.AsyncClient,
    query: str,
    limit: int = 8,
) -> list[dict[str, Any]]:
    """Search the web and return ranked source metadata."""
    response = await client.post(
        f"{BASE_URL}/search",
        json={"query": query, "limit": limit},
        timeout=30,
    )
    response.raise_for_status()
    return response.json().get("data", [])


async def scrape_async(
    client: httpx.AsyncClient,
    url: str,
) -> str | None:
    """Fetch full markdown content for a URL. Returns None on failure."""
    try:
        response = await client.post(
            f"{BASE_URL}/scrape",
            json={"url": url, "formats": ["markdown"]},
            timeout=45,
        )
        response.raise_for_status()
        data = response.json()
        if warning := data.get("warning"):
            print(f"  [warn] {url[:60]}: {warning}")
        return data.get("data", {}).get("markdown") or None
    except httpx.HTTPStatusError as e:
        print(f"  [error] scrape {url[:60]}: HTTP {e.response.status_code}")
        return None


async def map_domain_async(
    client: httpx.AsyncClient,
    domain: str,
    include_patterns: list[str] | None = None,
) -> list[str]:
    """Discover all reachable URLs on a domain for deep-dive passes."""
    payload: dict[str, Any] = {"url": domain}
    if include_patterns:
        payload["includePatterns"] = include_patterns

    response = await client.post(
        f"{BASE_URL}/map",
        json=payload,
        timeout=30,
    )
    response.raise_for_status()
    return response.json().get("urls", [])


async def scrape_batch(
    client: httpx.AsyncClient,
    urls: list[str],
    max_concurrent: int = 5,
) -> list[ResearchSource]:
    """Scrape multiple URLs concurrently, respecting a concurrency cap."""
    semaphore = asyncio.Semaphore(max_concurrent)
    sources: list[ResearchSource] = []

    async def scrape_one(url: str, title: str) -> None:
        async with semaphore:
            content = await scrape_async(client, url)
            if content and len(content.strip()) > 100:
                sources.append(ResearchSource(url=url, title=title, content=content))

    tasks = [
        scrape_one(result["url"], result.get("title", result["url"]))
        for result in urls
    ]
    await asyncio.gather(*tasks)
    return sources


def analyze_coverage(question: str, sources: list[ResearchSource]) -> list[str]:
    """
    Placeholder: in production, call an LLM here.
    Prompt: given the question and the scraped content, what sub-questions remain?
    Returns a list of follow-up queries.
    """
    # Example stub — replace with your LLM call
    # prompt = f"Question: {question}\n\nSources:\n{combined_content}\n\nWhat aspects are not yet covered?"
    # sub_questions = llm.generate(prompt)
    return []  # Return follow-up queries from the LLM


async def deep_research(
    question: str,
    max_iterations: int = 4,
    sources_per_iter: int = 5,
) -> dict[str, Any]:
    """
    Multi-step research loop: search → scrape → analyze gaps → repeat.
    
    In production, replace analyze_coverage() with a real LLM call
    and synthesize_report() with a final LLM synthesis step.
    """
    all_sources: list[ResearchSource] = []
    queries: list[str] = [question]
    seen_urls: set[str] = set()

    headers = {"Authorization": f"Bearer {FASTCRW_API_KEY}"}

    async with httpx.AsyncClient(headers=headers) as client:
        for iteration in range(max_iterations):
            print(f"\n[Iteration {iteration + 1}/{max_iterations}]")

            if not queries:
                print("  No remaining queries. Research complete.")
                break

            # Phase 1: Search all pending queries in parallel
            current_query = queries.pop(0)
            print(f"  Query: {current_query}")
            search_results = await search_async(client, current_query, limit=sources_per_iter)

            # Filter to unseen URLs
            new_results = [r for r in search_results if r["url"] not in seen_urls]
            for r in new_results:
                seen_urls.add(r["url"])

            print(f"  Found {len(new_results)} new sources to scrape")

            # Phase 2: Scrape sources in parallel
            new_sources = await scrape_batch(client, new_results)
            all_sources.extend(new_sources)
            print(f"  Successfully scraped {len(new_sources)}/{len(new_results)} sources")

            # Phase 3: Analyze coverage gaps
            sub_questions = analyze_coverage(question, all_sources)
            if sub_questions:
                print(f"  Gap analysis identified {len(sub_questions)} sub-questions")
                queries.extend(sub_questions)
            else:
                print("  Coverage check passed (or no LLM configured). Stopping.")
                if not queries:
                    break

    return {
        "question": question,
        "iterations": max_iterations,
        "sources_found": len(all_sources),
        "sources": [
            {
                "url": s.url,
                "title": s.title,
                "content_length": len(s.content),
                "content_preview": s.content[:500],
            }
            for s in all_sources
        ],
    }


if __name__ == "__main__":
    result = asyncio.run(
        deep_research(
            question="what are the key findings from mechanistic interpretability research in 2024 and 2025",
            max_iterations=3,
            sources_per_iter=5,
        )
    )
    print(f"\n=== RESEARCH COMPLETE ===")
    print(f"Sources collected: {result['sources_found']}")
    for source in result["sources"][:3]:
        print(f"\n  [{source['url'][:70]}]")
        print(f"  {source['content_preview'][:200]}...")

TypeScript / Node.js Implementation

import { setTimeout } from "timers/promises";

const FASTCRW_API_KEY = process.env.FASTCRW_API_KEY!;
const BASE_URL = "https://api.fastcrw.com/v1";

interface SearchResult {
  url: string;
  title: string;
  description?: string;
}

interface ResearchSource {
  url: string;
  title: string;
  content: string;
}

async function searchWeb(query: string, limit = 8): Promise<SearchResult[]> {
  const res = await fetch(`${BASE_URL}/search`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${FASTCRW_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ query, limit }),
  });
  if (!res.ok) throw new Error(`Search failed: ${res.status}`);
  const data = await res.json();
  return data.data ?? [];
}

async function scrapeUrl(url: string): Promise<string | null> {
  const res = await fetch(`${BASE_URL}/scrape`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${FASTCRW_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, formats: ["markdown"] }),
  });
  if (!res.ok) {
    console.warn(`[fastCRW] scrape failed ${res.status}: ${url}`);
    return null;
  }
  const data = await res.json();
  if (data.warning) console.warn(`[fastCRW warning] ${url}: ${data.warning}`);
  return data.data?.markdown ?? null;
}

async function scrapeBatch(
  results: SearchResult[],
  maxConcurrent = 5
): Promise<ResearchSource[]> {
  const sources: ResearchSource[] = [];
  const chunks: SearchResult[][] = [];

  // Chunk into batches to respect concurrency
  for (let i = 0; i < results.length; i += maxConcurrent) {
    chunks.push(results.slice(i, i + maxConcurrent));
  }

  for (const chunk of chunks) {
    const settled = await Promise.allSettled(
      chunk.map(async (result) => {
        const content = await scrapeUrl(result.url);
        if (content && content.trim().length > 100) {
          sources.push({ url: result.url, title: result.title, content });
        }
      })
    );
    // Small delay between batches to avoid rate-limit spikes
    await setTimeout(200);
  }
  return sources;
}

export async function deepResearch(
  question: string,
  maxIterations = 4,
  sourcesPerIter = 5
): Promise<{ question: string; sources: ResearchSource[] }> {
  const allSources: ResearchSource[] = [];
  const seenUrls = new Set<string>();
  let queries = [question];

  for (let iter = 0; iter < maxIterations; iter++) {
    console.log(`\n[Iteration ${iter + 1}/${maxIterations}]`);
    if (queries.length === 0) break;

    const currentQuery = queries.shift()!;
    console.log(`  Query: ${currentQuery}`);

    const searchResults = await searchWeb(currentQuery, sourcesPerIter);
    const newResults = searchResults.filter((r) => !seenUrls.has(r.url));
    newResults.forEach((r) => seenUrls.add(r.url));

    console.log(`  Scraping ${newResults.length} new sources...`);
    const newSources = await scrapeBatch(newResults);
    allSources.push(...newSources);

    console.log(`  Collected ${newSources.length}/${newResults.length} sources`);

    // In production: call your LLM here to identify gaps and generate sub-queries
    // const subQueries = await analyzeGaps(question, allSources);
    // queries.push(...subQueries);
    if (queries.length === 0) break; // No sub-queries generated (LLM not configured)
  }

  return { question, sources: allSources };
}

LangGraph Integration: State-Machine Research Loop

LangGraph is well-suited to deep research because the search→scrape→analyze loop maps naturally to graph nodes with conditional routing:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
import requests, os, json
from typing import TypedDict, Annotated
import operator

FASTCRW_API_KEY = os.environ["FASTCRW_API_KEY"]
BASE_URL = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {FASTCRW_API_KEY}"}


class ResearchState(TypedDict):
    question: str
    queries: list[str]
    sources: Annotated[list[dict], operator.add]  # accumulate across iterations
    iteration: int
    max_iterations: int
    done: bool


def search_node(state: ResearchState) -> ResearchState:
    """Search for sources using the first pending query."""
    query = state["queries"][0] if state["queries"] else state["question"]
    resp = requests.post(
        f"{BASE_URL}/search",
        headers=HEADERS,
        json={"query": query, "limit": 5},
    )
    results = resp.json().get("data", [])
    return {**state, "queries": state["queries"][1:], "_search_results": results}


def scrape_node(state: ResearchState) -> ResearchState:
    """Scrape all search results from the previous node."""
    search_results = state.get("_search_results", [])
    new_sources = []

    for result in search_results:
        resp = requests.post(
            f"{BASE_URL}/scrape",
            headers=HEADERS,
            json={"url": result["url"], "formats": ["markdown"]},
        )
        if resp.status_code == 200:
            content = resp.json().get("data", {}).get("markdown", "")
            if content.strip():
                new_sources.append({
                    "url": result["url"],
                    "title": result.get("title", ""),
                    "content": content[:3000],  # cap per-source context
                })

    return {**state, "sources": new_sources, "iteration": state["iteration"] + 1}


def analyze_node(state: ResearchState) -> ResearchState:
    """Use an LLM to identify coverage gaps and generate sub-queries."""
    if state["iteration"] >= state["max_iterations"]:
        return {**state, "done": True}

    llm = ChatOpenAI(model="gpt-4o-mini")
    combined = "\n\n".join(
        f"Source: {s['url']}\n{s['content'][:1000]}" for s in state["sources"][-10:]
    )
    prompt = f"""Question: {state['question']}

Existing sources summary:
{combined[:4000]}

Identify up to 2 specific sub-questions that are NOT yet answered by the sources above.
Return a JSON array of strings, e.g. ["sub-question 1", "sub-question 2"].
If the question is fully answered, return []."""

    response = llm.invoke([HumanMessage(content=prompt)])
    try:
        sub_questions = json.loads(response.content)
        if not sub_questions:
            return {**state, "done": True}
        return {**state, "queries": sub_questions, "done": False}
    except json.JSONDecodeError:
        return {**state, "done": True}


def should_continue(state: ResearchState) -> str:
    if state.get("done") or state["iteration"] >= state["max_iterations"]:
        return "synthesize"
    if state.get("queries"):
        return "search"
    return "synthesize"


def synthesize_node(state: ResearchState) -> ResearchState:
    """Final synthesis: summarize all sources into a cited research report."""
    llm = ChatOpenAI(model="gpt-4o")
    source_text = "\n\n---\n\n".join(
        f"[{i+1}] {s['url']}\n{s['content'][:2000]}"
        for i, s in enumerate(state["sources"])
    )
    prompt = f"""Research question: {state['question']}

Sources:
{source_text[:12000]}

Write a comprehensive research report that:
1. Answers the question directly in the opening paragraph
2. Covers the main findings with inline citations [1], [2], etc.
3. Notes any contradictions between sources
4. Ends with a bibliography linking to source URLs

Use only information from the provided sources."""

    response = llm.invoke([HumanMessage(content=prompt)])
    return {**state, "report": response.content}


# Build the graph
workflow = StateGraph(ResearchState)
workflow.add_node("search", search_node)
workflow.add_node("scrape", scrape_node)
workflow.add_node("analyze", analyze_node)
workflow.add_node("synthesize", synthesize_node)

workflow.set_entry_point("search")
workflow.add_edge("search", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_conditional_edges("analyze", should_continue)
workflow.add_edge("synthesize", END)

research_graph = workflow.compile()

# Run a research task
result = research_graph.invoke({
    "question": "What are the main approaches to AI alignment in 2025?",
    "queries": [],
    "sources": [],
    "iteration": 0,
    "max_iterations": 3,
    "done": False,
})
print(result.get("report", "No report generated"))

The /v1/search Answer Synthesis Mode

For lighter research tasks — or as a fast first-pass before a full iterative loop — use fastCRW's built-in LLM synthesis:

# Managed synthesis (paid plans, no API key to manage)
curl -X POST https://api.fastcrw.com/v1/search \
  -H "Authorization: Bearer $FASTCRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is the current consensus on scaling laws for large language models",
    "limit": 5,
    "answer": true
  }'

Synthesis runs on fastCRW's managed LLM, with a 3× credit markup and an 8,000-credit per-request cap (CANONICAL-FACTS.md §6, verified 2026-05-30). The managed LLM is available on every paid plan (Hobby and above); the Free plan has no LLM features.

The answer mode is the equivalent of Tavily's include_answer — a fast single-call synthesis. For production research pipelines that need multi-iteration depth and source-attributed output, use the iterative loop above and call your own LLM for synthesis.

Benchmark: What Extraction Accuracy Means for Research Quality

A research pipeline is only as good as the content its scraping layer delivers. Garbled content forces the synthesis LLM to hallucinate to fill gaps. Accurate content grounds the synthesis in real sources.

On Firecrawl's public scrape-content-dataset-v1 (1,000 URLs, 819 carry labeled ground truth), the three-way benchmark produced (diagnose_3way.py, 2026-05-08):

Tool	Truth-recall (of 819 labeled)	p50 latency	p90 latency (fast mode)
fastCRW	63.74% (522 URLs)	1,914 ms	4,348 ms
Crawl4AI	59.95% (491 URLs)	1,916 ms	4,754 ms
Firecrawl	56.04% (459 URLs)	2,305 ms	6,937 ms

The +7.70 pp truth-recall advantage over Firecrawl translates directly to research quality: for every 100 sources a research agent scrapes, fastCRW delivers accurate content on roughly 8 more pages. In a 3-iteration loop that touches 45 sources (5 per search, 3 searches per iteration, 3 iterations), that is approximately 3 additional sources that contribute real content rather than hallucination fodder.

In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three (Crawl4AI 4,754 ms, Firecrawl 6,937 ms). The Chrome-stealth fallback that recovers the URLs the others miss powers both the recall lead and the p90 win. fastCRW also recovers 34 URLs that neither Crawl4AI nor Firecrawl reached — 70% more unique recoveries than those two combined. For iterative research loops, where scraping happens in parallel, the p50 win (1,914 ms vs 2,305 ms) and the 91.8% scrape-success rate (of reachable URLs, 0 thrown errors across 3,000 requests) are the relevant figures. See full benchmark methodology for the complete breakdown.

For the search benchmark: fastCRW averaged 880 ms over 100 queries with 73 of 100 latency wins (benchmarks/triple-bench.ts, single point-in-time measurement). Fast search discovery reduces the latency of each research iteration's first step.

Architecture: Production Research Pipeline

A production deep research system needs more than just a scraping loop:

1. Query planning layer The initial research question is broad. A planning LLM breaks it into 3–5 focused sub-queries before the first search call. This produces more targeted results than a single broad search.

2. Source deduplication Multiple queries will surface overlapping results. Track seen URLs in a set; skip already-scraped sources. Hash article content to catch near-duplicates (same content, different URLs).

3. Source quality filtering Not all scraped sources are authoritative. Score sources by domain authority, publication date, content length, and citation density. Deprioritize thin pages (< 200 words after stripping) — they usually add noise, not signal.

4. Context management A research agent touching 30+ sources cannot pass all content to the synthesis LLM in a single context window. Options:

Extract key facts from each source using formats: ["json"] + jsonSchema before synthesis
Use a vector store for retrieval-augmented synthesis (see RAG pipelines)
Summarize each source progressively and pass only summaries to the final synthesis step

5. Citation enforcement Force source attribution in the synthesis prompt. A research output without source citations is not reproducible and cannot be fact-checked. Every claim should link to a scraped URL.

6. Domain deep-dives with /v1/map When a domain emerges as particularly authoritative, call /v1/map to discover all related pages. Use includePatterns to scope the discovery:

# Discover all 2024–2025 papers on a research institution's site
urls = await map_domain_async(
    client,
    "https://arxiv.org",
    include_patterns=["*/abs/2024.*", "*/abs/2025.*"],
)

Map costs 1 credit and typically returns hundreds of URLs in under 5 seconds. Filter the URL list for relevance before scraping — not every page the map returns is worth reading.

Comparison: fastCRW vs Purpose-Built Research Tools

Feature	fastCRW	Tavily	Perplexity API
Search endpoint	`/v1/search`	`/search`	`/chat/completions` (deep-research model)
Full page extraction	`/v1/scrape` — clean markdown	Optional content in response	Not available separately
LLM synthesis	Built-in (`answer: true`, managed, paid plans)	Built-in (`include_answer`)	Always included (billed per token)
Iteration control	Full — you own the loop	None — single call	Partial — internal iterations
Self-hosting	AGPL-3.0 free	Not available	Not available
Source citation	Returned URLs; you enforce in prompt	Returned URLs	Returned with answer
Search latency	880 ms avg, 73/100 latency wins (benchmarks/triple-bench.ts)	—	Higher (includes model generation)
Custom source targeting	`/v1/map` + `/v1/crawl`	Not available	Not available

The key distinction: fastCRW gives you the retrieval primitives and you own the reasoning loop. Perplexity and similar tools bundle the loop with the retrieval, which is simpler but less controllable. For production research agents where you need to tune the iteration strategy, source scoring, and synthesis prompt, owning the loop is the right tradeoff.

Pricing for Research Workloads

Research pipelines are moderately expensive because they combine search, scrape, and optionally LLM extraction per iteration. A typical 3-iteration research task touching 15 sources:

Step	Credits
3 search queries	3 credits (1 per query, CANONICAL-FACTS.md §3)
15 scrapes (http/lightpanda renderer)	15 credits (1 per scrape)
Optional: structured extraction on 15 sources	75 credits (5 per extraction)
Total (markdown only)	~18 credits per research task
Total (with structured extraction)	~93 credits per research task

At these rates, the Hobby plan (3,000 credits/mo) covers ~165 markdown-only research tasks per month. The Standard plan (100,000 credits/mo) covers 5,500 tasks. See pricing for current plan rates — do not rely on hard-coded numbers here, as rates are subject to change.

Self-hosting under AGPL-3.0 eliminates per-credit costs; you pay only your server. A single 2 vCPU / 4 GB VPS handles research pipelines at hundreds of tasks per day.

Good Fits vs Poor Fits

Strong fits for fastCRW in research pipelines:

Academic literature surveys that pull from preprint servers, university sites, and open-access journals
Competitive intelligence: tracking competitors' public communications, product announcements, and hiring signals
Policy and regulatory research: monitoring government publications, legislative records, and agency guidance
Market research: extracting analyst reports, earnings transcripts, and industry publications
Due diligence: researching companies, founders, and investment targets via public sources
AI-assisted journalism: building evidence bases for stories from public documents

Poor fits:

Research requiring access to paywalled academic databases — use their official APIs
Research that requires authenticated sessions (corporate intranets, subscriber-only sites)
Real-time monitoring that needs sub-second latency — fastCRW is optimized for accuracy, not streaming
Domains where robots.txt disallows scraping — fastCRW respects it by default

RAG pipelines — feed research output into vector stores for retrieval-augmented chat
AI agents — broader agent integration guide including MCP setup
LangChain integration — langchain-crw package for document loading
LangGraph integration — state-machine research loop patterns
MCP integration — expose fastCRW to Claude Code and Cursor
Firecrawl alternatives — full comparison including the 3-way benchmark
Search benchmark — fastCRW vs Tavily search latency and success rates
Scrape benchmark — full p50/p90/p99 breakdown and truth-recall methodology

FAQ

Q: How does a deep research pipeline differ from a single search query?

A: A single search returns ranked snippets. A deep research pipeline iterates: search → scrape full content → analyze gaps → search again with refined queries → scrape more. Each iteration adds depth. The loop continues until the coverage check passes or a max-iterations cap is hit. The scraping layer is what bridges the gap between a search snippet and the full document content needed for accurate synthesis.

Q: How does fastCRW's built-in search compare to Tavily for research pipelines?

A: fastCRW search averaged 880 ms over a 100-query benchmark, with 73 of 100 latency wins against both Firecrawl and Tavily (benchmarks/triple-bench.ts, single point-in-time measurement). The more important advantage for research pipelines is that search and scrape live on the same binary — no second API key, no second service to operate. The /v1/search answer synthesis mode (managed LLM, paid plans) adds an optional LLM layer directly on top of search results.

Q: Can fastCRW replace Perplexity's deep research product?

A: fastCRW is a building block, not a finished product. Perplexity's deep research is a complete UX with a custom reasoning loop, citation UI, and model fine-tuning. fastCRW gives you the scraping and search infrastructure to build the same retrieval loop — you bring the reasoning model and the synthesis logic. The advantage is control: you choose the LLM, the iteration strategy, the source selection policy, and you can self-host the entire pipeline for free under AGPL-3.0.

Q: What is the /v1/search answer synthesis mode and when should I use it?

A: Pass answer: true in a POST /v1/search request to get an LLM-synthesized answer alongside the source URLs. Synthesis runs on fastCRW's managed LLM, available on paid plans with no API key to manage; the Free plan has no LLM features. The managed LLM carries a 3× credit markup and an 8,000-credit per-request cap (CANONICAL-FACTS.md §6, verified 2026-05-30). Use built-in synthesis for lightweight research tasks; for production pipelines where you need full model control, run your own LLM over the returned source content.

Q: How many iterations should a deep research loop run?

A: Three to five iterations is a practical ceiling for most research tasks. The first pass covers obvious sources. The second fills gaps identified in the first. The third handles edge cases and contradictory claims. Beyond five passes, diminishing returns set in — you're mostly finding sources that rephrase what earlier sources said. Add a convergence check: if the gap analysis produces fewer than two new sub-questions, stop iterating.

Q: How do I handle paywalled or restricted research sources?

A: fastCRW respects robots.txt by default. For paywalled academic sources, use open-access alternatives (arXiv, PubMed Central, Semantic Scholar, SSRN for preprints) as search targets. For news paywalls, target sources with permissive terms. Do not attempt to bypass paywalls — fastCRW's terms prohibit scraping content you don't have the right to access. For licensed academic databases, use their official APIs alongside fastCRW for public-web research.

Sources

fastCRW /v1/search API reference

https://docs.fastcrw.com/api-reference/search/

fastCRW /v1/scrape API reference

https://docs.fastcrw.com/api-reference/scrape/

fastCRW 3-way scrape benchmark

/benchmarks/firecrawl-dataset

fastCRW search benchmark

/benchmarks/tavily-search

Perplexity AI deep research product overview

https://www.perplexity.ai/hub/blog/deep-research

FAQ

How does a deep research pipeline differ from a single search query?

A single search returns ranked snippets. A deep research pipeline iterates: search → scrape full content → analyze gaps → search again with refined queries → scrape more. Each iteration adds depth. The loop continues until the coverage check passes or a max-iterations cap is hit. The scraping layer is what bridges the gap between a search snippet and the full document content needed for accurate synthesis.

How does fastCRW's built-in search compare to Tavily for research pipelines?

fastCRW search averaged 880 ms over a 100-query benchmark, with 73 of 100 latency wins against both Firecrawl and Tavily (benchmarks/triple-bench.ts, single point-in-time measurement). The more important advantage for research pipelines is that search and scrape live on the same binary — no second API key, no second service to operate. The /v1/search answer synthesis mode (managed LLM, paid plans) adds an optional LLM layer directly on top of search results.

Can fastCRW replace Perplexity's deep research product?

fastCRW is a building block, not a finished product. Perplexity's deep research is a complete UX with a custom reasoning loop, citation UI, and model fine-tuning. fastCRW gives you the scraping and search infrastructure to build the same retrieval loop — you bring the reasoning model and the synthesis logic. The advantage is control: you choose the LLM, the iteration strategy, the source selection policy, and you can self-host the entire pipeline for free under AGPL-3.0.

What is the /v1/search answer synthesis mode and when should I use it?

Pass `answer: true` in a POST /v1/search request to get an LLM-synthesized answer alongside the source URLs. Synthesis runs on fastCRW's managed LLM, available on paid plans with no API key to manage; the Free plan has no LLM features. The managed LLM carries a 3× credit markup and an 8,000-credit per-request cap (CANONICAL-FACTS.md §6, verified 2026-05-30). Use built-in synthesis for lightweight research tasks; for production pipelines where you need full model control, run your own LLM over the returned source content.

How many iterations should a deep research loop run?

Three to five iterations is a practical ceiling for most research tasks. The first pass covers obvious sources. The second fills gaps identified in the first. The third handles edge cases and contradictory claims. Beyond five passes, diminishing returns set in — you're mostly finding sources that rephrase what earlier sources said. Add a convergence check: if the gap analysis produces fewer than two new sub-questions, stop iterating.

How do I handle paywalled or restricted research sources?

fastCRW respects robots.txt by default. For paywalled academic sources, use open-access alternatives (arXiv, PubMed Central, Semantic Scholar, SSRN for preprints) as search targets. For news paywalls, target sources with permissive terms. Do not attempt to bypass paywalls — fastCRW's terms prohibit scraping content you don't have the right to access. For licensed academic databases, use their official APIs alongside fastCRW for public-web research.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for News Aggregation

Next in Use Cases

Web Scraping for LLM Agents

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Content Aggregation

Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.

web scraping for content aggregationDiscover all content URLs on any domain with a single `/v1/map` call

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Web Scraping for Deep Research Agents

What Deep Research Actually Requires

The Core Research Loop

Quick Start: curl

Full Python Implementation: Deep Research Agent

TypeScript / Node.js Implementation

LangGraph Integration: State-Machine Research Loop

The /v1/search Answer Synthesis Mode

Benchmark: What Extraction Accuracy Means for Research Quality

Architecture: Production Research Pipeline

Comparison: fastCRW vs Purpose-Built Research Tools

Pricing for Research Workloads

Good Fits vs Poor Fits

Related Resources

FAQ

More from Use Cases

Web Scraping for Real Estate Data

Web Scraping for Content Aggregation

Web Scraping for RAG and AI Agent Training Data

Keep the crawl path moving

Alternatives

Benchmarks

Docs