Skip to main content
Integrations/Integration / LlamaIndex

LlamaIndex Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Build a custom LlamaIndex web reader that calls fastCRW for live ingestion, or wrap fastCRW scrape/crawl in a LlamaIndex tool for agent workflows. Markdown output feeds directly into embeddings pipelines. 6.6 MB RAM, 92% coverage on 1,000-URL benchmark.

Published
May 12, 2026
Updated
May 12, 2026
Category
integrations
Verdict

Build a custom LlamaIndex reader that calls fastCRW for HTTP scraping, JS rendering, and crawling. Use the same reader in loading pipelines, RAG queries, and agent tools. fastCRW outputs clean Markdown that embeds and chunks cleanly.

Custom SimpleWebPageReader-style LlamaIndex reader wrapping fastCRW REST APIWorks with LlamaIndex data loading, ingestion, and retrieval pipelinesExpose fastCRW as a LlamaIndex agent tool for in-loop scraping6.6 MB RAM fastCRW binary, 833 ms average latency, zero dependencies

Why LlamaIndex + fastCRW

LlamaIndex is the data framework for LLM applications — it connects your models to documents, APIs, and databases. fastCRW plugs in as the web scraping primitive. Instead of building your own scraper or using a heavyweight Firecrawl container, you write a lightweight LlamaIndex reader that calls fastCRW's REST API. The output is clean Markdown that feeds directly into chunking, embedding, and vector storage.

The pattern: ingest live web pages via fastCRW, let LlamaIndex chunk and embed them, query the vector store, and pass retrieved context to your LLM. Unlike batch scraping tools, fastCRW on-demand scraping keeps your RAG index fresh with zero infrastructure overhead.

Setup

  1. Install LlamaIndex and dependencies.
  2. Sign up at fastcrw.com for an API key.
  3. Export FASTCRW_API_KEY in your shell.
  4. Implement a custom LlamaIndex reader class that calls fastCRW.
  5. Use the reader in your data loading and agent pipelines.
pip install llama-index requests
export FASTCRW_API_KEY="fcrw_..."

Code Example: Custom LlamaIndex Reader for fastCRW

Create a file fastcrw_reader.py:

import os
import requests
from typing import List, Any
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class FastCRWReader(BaseReader):
    """
    A LlamaIndex reader that scrapes URLs via fastCRW REST API.
    Supports single-page scrape and recursive crawls.
    """

    def __init__(self, api_key: str | None = None, base_url: str = "https://fastcrw.com"):
        self.api_key = api_key or os.environ.get("FASTCRW_API_KEY")
        if not self.api_key:
            raise ValueError("FASTCRW_API_KEY not set")
        self.base_url = base_url

    def load_data(
        self,
        url: str,
        mode: str = "scrape",
        max_depth: int | None = None,
        formats: List[str] | None = None,
        **kwargs: Any,
    ) -> List[Document]:
        """
        Load and scrape URL(s) via fastCRW.

        Args:
            url: The target URL to scrape or crawl.
            mode: "scrape" (single page) or "crawl" (recursive).
            max_depth: For crawl mode, max crawl depth.
            formats: Output formats (default: ["markdown"]).
            **kwargs: Additional fastCRW parameters.

        Returns:
            List of LlamaIndex Document objects with scraped content.
        """
        if formats is None:
            formats = ["markdown"]

        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

        if mode == "scrape":
            # Single-page scrape
            payload = {
                "url": url,
                "formats": formats,
                **kwargs,
            }
            response = requests.post(
                f"{self.base_url}/api/v1/scrape",
                json=payload,
                headers=headers,
                timeout=60,
            )
            response.raise_for_status()
            data = response.json()

            doc_content = data["data"].get("markdown") or data["data"].get("html", "")
            return [
                Document(
                    text=doc_content,
                    metadata={
                        "url": url,
                        "status_code": data["data"].get("status_code"),
                        "load_time_ms": data["data"].get("load_time_ms"),
                    },
                )
            ]

        elif mode == "crawl":
            # Recursive crawl
            payload = {
                "url": url,
                "formats": formats,
                **kwargs,
            }
            if max_depth:
                payload["max_depth"] = max_depth

            response = requests.post(
                f"{self.base_url}/api/v1/crawl",
                json=payload,
                headers=headers,
                timeout=300,  # Crawls can take longer
            )
            response.raise_for_status()
            data = response.json()

            # Map crawl results to LlamaIndex documents
            documents = []
            for page in data["data"]:
                doc_content = page.get("markdown") or page.get("html", "")
                documents.append(
                    Document(
                        text=doc_content,
                        metadata={
                            "url": page.get("url"),
                            "status_code": page.get("status_code"),
                            "load_time_ms": page.get("load_time_ms"),
                        },
                    )
                )
            return documents

        else:
            raise ValueError(f"Unknown mode: {mode}. Use 'scrape' or 'crawl'.")

Usage Example: RAG Pipeline with fastCRW

from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from fastcrw_reader import FastCRWReader

# Configure embedding and LLM
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini")

# Instantiate the fastCRW reader
reader = FastCRWReader()

# Load documents from a website
documents = reader.load_data(
    url="https://docs.python.org/3/",
    mode="crawl",
    max_depth=2,  # Limit crawl depth for demo
)

# Create a vector index from scraped content
index = VectorStoreIndex.from_documents(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What are Python's built-in functions?")
print(response)

Agent Tool Example: In-Loop Scraping

Expose fastCRW scraping as a LlamaIndex agent tool:

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
import requests
import os


def scrape_url(url: str) -> str:
    """
    Scrape a single URL via fastCRW and return Markdown content.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    data = response.json()
    return data["data"]["markdown"]


def search_web(query: str, limit: int = 5) -> str:
    """
    Search the web via fastCRW and return results.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/search",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"query": query, "limit": limit},
        timeout=30,
    )
    response.raise_for_status()
    results = response.json()["data"]["results"]
    return "\n".join(
        [f"- {r['title']}: {r['url']}" for r in results[:limit]]
    )


# Create tools
scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
search_tool = FunctionTool.from_defaults(fn=search_web)

# Create agent
llm = OpenAI(model="gpt-4o-mini")
agent = ReActAgent.from_llm_and_tools(
    llm=llm,
    tools=[scrape_tool, search_tool],
    system_prompt="You are a research assistant. Use fastCRW tools to fetch live web content and answer questions based on current data.",
)

# Run agent reasoning loop
response = agent.chat("What are the latest Python 3.14 features?")
print(response)

Streaming Crawl Example

For large crawls, process pages as they arrive instead of batching:

import requests
import os
from llama_index.core import Document

def crawl_streaming(
    url: str,
    max_depth: int = 2,
    process_doc: callable = None,
) -> int:
    """
    Crawl a site via fastCRW and stream documents to a callback.

    Args:
        url: Starting URL.
        max_depth: Crawl depth limit.
        process_doc: Callback function (Document) -> None.

    Returns:
        Total pages crawled.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    payload = {
        "url": url,
        "formats": ["markdown"],
        "max_depth": max_depth,
    }

    response = requests.post(
        "https://fastcrw.com/api/v1/crawl",
        headers={"Authorization": f"Bearer {api_key}"},
        json=payload,
        timeout=300,
    )
    response.raise_for_status()
    data = response.json()

    count = 0
    for page in data["data"]:
        doc = Document(
            text=page.get("markdown", page.get("html", "")),
            metadata={"url": page.get("url")},
        )
        if process_doc:
            process_doc(doc)
        count += 1

    return count

When to Use This

  • RAG systems with live web content — scrape documentation, blogs, and news into a vector index.
  • Multi-document QA — crawl a website and build a searchable knowledge base.
  • Data pipelines — ingest web content at intervals (daily, hourly) without manual crawler maintenance.
  • Agent research loops — let agents scrape and summarize in real time during reasoning.
  • Competitive intelligence — scrape multiple competitor sites and aggregate with LlamaIndex tools.
  • Content curation — scrape RSS feeds and webpages, chunk, embed, and surface via semantic search.

Limits + Gotchas

  • Crawl timeouts — large crawls can take 30+ seconds. Set appropriate request timeouts in your reader.
  • Rate limiting — fastCRW enforces API rate limits. Implement exponential backoff if you hit 429 responses.
  • Memory usage — storing all pages from a large crawl in memory can be expensive. Use streaming or pagination.
  • Metadata mapping — fastCRW and LlamaIndex use different metadata schemas. Map custom fields explicitly in the Document constructor.
  • Duplicate URLs — the crawl endpoint may return duplicates. Deduplicate by URL before indexing.
  • JS rendering — static HTTP scraping is the default. For JS-heavy sites, request LightPanda or Chrome rendering (incurs cost).
  • Authentication — fastCRW cannot authenticate automatically. Scrape public content or use session persistence for authenticated flows.

Performance Notes

  • Single-page scrape: 833 ms median on HTTP content.
  • Crawl speed: Depends on site complexity and depth. Budget 1–5 seconds per page.
  • Embedding latency: Chunking and embedding 100 pages adds ~10–30 seconds with OpenAI embeddings.
  • Vector store: LlamaIndex supports in-memory (demo), Chroma, Pinecone, and others. Choose based on scale.

Continue exploring

More from Integrations

View all integrations

Related hubs

Keep the crawl path moving