Integrations/Integration / LlamaIndex

LlamaIndex Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Build a custom LlamaIndex web reader that calls fastCRW for live ingestion, or wrap fastCRW scrape/crawl in a LlamaIndex tool for agent workflows. Markdown output feeds directly into embeddings pipelines. 6.6 MB RAM, 92% coverage on 1,000-URL benchmark.

Published

May 12, 2026

Updated

May 12, 2026

Why LlamaIndex + fastCRW

LlamaIndex is the data framework for LLM applications — it connects your models to documents, APIs, and databases. fastCRW plugs in as the web scraping primitive. Instead of building your own scraper or using a heavyweight Firecrawl container, you write a lightweight LlamaIndex reader that calls fastCRW's REST API. The output is clean Markdown that feeds directly into chunking, embedding, and vector storage.

The pattern: ingest live web pages via fastCRW, let LlamaIndex chunk and embed them, query the vector store, and pass retrieved context to your LLM. Unlike batch scraping tools, fastCRW on-demand scraping keeps your RAG index fresh with zero infrastructure overhead.

Setup

Install LlamaIndex and dependencies.
Sign up at fastcrw.com for an API key.
Export FASTCRW_API_KEY in your shell.
Implement a custom LlamaIndex reader class that calls fastCRW.
Use the reader in your data loading and agent pipelines.

pip install llama-index requests
export FASTCRW_API_KEY="fcrw_..."

Code Example: Custom LlamaIndex Reader for fastCRW

Create a file fastcrw_reader.py:

import os
import requests
from typing import List, Any
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class FastCRWReader(BaseReader):
    """
    A LlamaIndex reader that scrapes URLs via fastCRW REST API.
    Supports single-page scrape and recursive crawls.
    """

    def __init__(self, api_key: str | None = None, base_url: str = "https://fastcrw.com"):
        self.api_key = api_key or os.environ.get("FASTCRW_API_KEY")
        if not self.api_key:
            raise ValueError("FASTCRW_API_KEY not set")
        self.base_url = base_url

    def load_data(
        self,
        url: str,
        mode: str = "scrape",
        max_depth: int | None = None,
        formats: List[str] | None = None,
        **kwargs: Any,
    ) -> List[Document]:
        """
        Load and scrape URL(s) via fastCRW.

        Args:
            url: The target URL to scrape or crawl.
            mode: "scrape" (single page) or "crawl" (recursive).
            max_depth: For crawl mode, max crawl depth.
            formats: Output formats (default: ["markdown"]).
            **kwargs: Additional fastCRW parameters.

        Returns:
            List of LlamaIndex Document objects with scraped content.
        """
        if formats is None:
            formats = ["markdown"]

        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

        if mode == "scrape":
            # Single-page scrape
            payload = {
                "url": url,
                "formats": formats,
                **kwargs,
            }
            response = requests.post(
                f"{self.base_url}/api/v1/scrape",
                json=payload,
                headers=headers,
                timeout=60,
            )
            response.raise_for_status()
            data = response.json()

            doc_content = data["data"].get("markdown") or data["data"].get("html", "")
            return [
                Document(
                    text=doc_content,
                    metadata={
                        "url": url,
                        "status_code": data["data"].get("status_code"),
                        "load_time_ms": data["data"].get("load_time_ms"),
                    },
                )
            ]

        elif mode == "crawl":
            # Recursive crawl
            payload = {
                "url": url,
                "formats": formats,
                **kwargs,
            }
            if max_depth:
                payload["max_depth"] = max_depth

            response = requests.post(
                f"{self.base_url}/api/v1/crawl",
                json=payload,
                headers=headers,
                timeout=300,  # Crawls can take longer
            )
            response.raise_for_status()
            data = response.json()

            # Map crawl results to LlamaIndex documents
            documents = []
            for page in data["data"]:
                doc_content = page.get("markdown") or page.get("html", "")
                documents.append(
                    Document(
                        text=doc_content,
                        metadata={
                            "url": page.get("url"),
                            "status_code": page.get("status_code"),
                            "load_time_ms": page.get("load_time_ms"),
                        },
                    )
                )
            return documents

        else:
            raise ValueError(f"Unknown mode: {mode}. Use 'scrape' or 'crawl'.")

Usage Example: RAG Pipeline with fastCRW

from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from fastcrw_reader import FastCRWReader

# Configure embedding and LLM
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini")

# Instantiate the fastCRW reader
reader = FastCRWReader()

# Load documents from a website
documents = reader.load_data(
    url="https://docs.python.org/3/",
    mode="crawl",
    max_depth=2,  # Limit crawl depth for demo
)

# Create a vector index from scraped content
index = VectorStoreIndex.from_documents(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What are Python's built-in functions?")
print(response)

Agent Tool Example: In-Loop Scraping

Expose fastCRW scraping as a LlamaIndex agent tool:

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
import requests
import os


def scrape_url(url: str) -> str:
    """
    Scrape a single URL via fastCRW and return Markdown content.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/scrape",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"url": url, "formats": ["markdown"]},
        timeout=60,
    )
    response.raise_for_status()
    data = response.json()
    return data["data"]["markdown"]


def search_web(query: str, limit: int = 5) -> str:
    """
    Search the web via fastCRW and return results.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    response = requests.post(
        "https://fastcrw.com/api/v1/search",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"query": query, "limit": limit},
        timeout=30,
    )
    response.raise_for_status()
    results = response.json()["data"]["results"]
    return "\n".join(
        [f"- {r['title']}: {r['url']}" for r in results[:limit]]
    )


# Create tools
scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
search_tool = FunctionTool.from_defaults(fn=search_web)

# Create agent
llm = OpenAI(model="gpt-4o-mini")
agent = ReActAgent.from_llm_and_tools(
    llm=llm,
    tools=[scrape_tool, search_tool],
    system_prompt="You are a research assistant. Use fastCRW tools to fetch live web content and answer questions based on current data.",
)

# Run agent reasoning loop
response = agent.chat("What are the latest Python 3.14 features?")
print(response)

Streaming Crawl Example

For large crawls, process pages as they arrive instead of batching:

import requests
import os
from llama_index.core import Document

def crawl_streaming(
    url: str,
    max_depth: int = 2,
    process_doc: callable = None,
) -> int:
    """
    Crawl a site via fastCRW and stream documents to a callback.

    Args:
        url: Starting URL.
        max_depth: Crawl depth limit.
        process_doc: Callback function (Document) -> None.

    Returns:
        Total pages crawled.
    """
    api_key = os.environ["FASTCRW_API_KEY"]
    payload = {
        "url": url,
        "formats": ["markdown"],
        "max_depth": max_depth,
    }

    response = requests.post(
        "https://fastcrw.com/api/v1/crawl",
        headers={"Authorization": f"Bearer {api_key}"},
        json=payload,
        timeout=300,
    )
    response.raise_for_status()
    data = response.json()

    count = 0
    for page in data["data"]:
        doc = Document(
            text=page.get("markdown", page.get("html", "")),
            metadata={"url": page.get("url")},
        )
        if process_doc:
            process_doc(doc)
        count += 1

    return count

When to Use This

RAG systems with live web content — scrape documentation, blogs, and news into a vector index.
Multi-document QA — crawl a website and build a searchable knowledge base.
Data pipelines — ingest web content at intervals (daily, hourly) without manual crawler maintenance.
Agent research loops — let agents scrape and summarize in real time during reasoning.
Competitive intelligence — scrape multiple competitor sites and aggregate with LlamaIndex tools.
Content curation — scrape RSS feeds and webpages, chunk, embed, and surface via semantic search.

Limits + Gotchas

Crawl timeouts — large crawls can take 30+ seconds. Set appropriate request timeouts in your reader.
Rate limiting — fastCRW enforces API rate limits. Implement exponential backoff if you hit 429 responses.
Memory usage — storing all pages from a large crawl in memory can be expensive. Use streaming or pagination.
Metadata mapping — fastCRW and LlamaIndex use different metadata schemas. Map custom fields explicitly in the Document constructor.
Duplicate URLs — the crawl endpoint may return duplicates. Deduplicate by URL before indexing.
JS rendering — static HTTP scraping is the default. For JS-heavy sites, request LightPanda or Chrome rendering (incurs cost).
Authentication — fastCRW cannot authenticate automatically. Scrape public content or use session persistence for authenticated flows.

Performance Notes

Single-page scrape: 833 ms median on HTTP content.
Crawl speed: Depends on site complexity and depth. Budget 1–5 seconds per page.
Embedding latency: Chunking and embedding 100 pages adds ~10–30 seconds with OpenAI embeddings.
Vector store: LlamaIndex supports in-memory (demo), Chroma, Pinecone, and others. Choose based on scale.

Sources

LlamaIndex documentation

https://docs.llamaindex.ai

LlamaIndex readers/loaders guide

https://docs.llamaindex.ai/en/stable/understanding_llamaindex/document_management/

fastCRW REST API docs

/docs/rest-api

LlamaIndex agent tools

https://docs.llamaindex.ai/en/stable/module_guides/agent_tools/

FAQ

How do I create a custom LlamaIndex reader for fastCRW?

Subclass BaseReader and implement load_data() to call the fastCRW /v1/scrape or /v1/crawl endpoint. Return a list of Document objects with Markdown content. See the code example below.

Can I use fastCRW for incremental RAG ingestion?

Yes. Call your fastCRW reader on each URL, store Document metadata with the URL and scraped timestamp, and implement document deduplication. LlamaIndex will handle chunking, embedding, and storage.

How do I handle large crawls in LlamaIndex?

For deep crawls, paginate results in your reader or call fastCRW with a bounded maxDepth. Stream documents into LlamaIndex as they arrive rather than batching all pages before creating Documents.

Can fastCRW scrape password-protected pages?

Not directly. For authenticated content, use fastCRW's session persistence feature (if available) or proxy the authentication outside the reader. Store credentials securely, never in code.

What metadata does fastCRW return for embeddings?

fastCRW returns metadata including URL, status code, and load time. Custom headers and structured extraction are available via LLM extraction. Map these to LlamaIndex Document metadata_dict.

How do I use fastCRW in a LlamaIndex agent?

Define an @agent.tool that calls fastCRW's scrape endpoint. Return the Markdown. The agent can invoke the tool mid-reasoning to fetch pages for answer grounding.

Can I cache fastCRW responses in LlamaIndex?

Yes. Implement a caching layer (file, Redis, or LlamaIndex's DocStore) keyed by URL. Before calling fastCRW, check if the URL was scraped recently. Cache hits save API calls.

Does fastCRW work with LlamaIndex's VectorStoreIndex?

Yes. Scrape via fastCRW reader, LlamaIndex chunks and embeds, then stores in the vector database. Queries retrieve and rerank based on relevance.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Integrations

View all integrations

Previous in Integrations

Pydantic AI Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Next in Integrations

Dify Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Integrations

Vercel AI SDK Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Register fastCRW as a tool in Vercel AI SDK so generateText and streamText can scrape live web pages. Drop-in alternative to Firecrawl with 6.6 MB RAM runtime and 833 ms average latency on 1,000-URL benchmark.

vercel ai sdk web scrapingRegister fastCRW scrape/crawl/search as native Vercel AI SDK tools via tool() helper

Integrations

Cursor Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Add fastCRW as an MCP server in Cursor IDE. Configure ~/.cursor/mcp.json, then scrape, search, crawl, and extract web pages from within your agent prompts. 6.6 MB RAM runtime.

cursor web scrapingRegister fastCRW MCP server in ~/.cursor/mcp.json

Integrations

Flowise Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Add fastCRW to Flowise workflows with an HTTP node or custom tool definition. No-code web scraping for LangChain flows, RAG pipelines, and AI agents. 6.6 MB RAM runtime, 92% coverage on the 1,000-URL benchmark.

flowise web scrapingDrop fastCRW into any Flowise flow with the built-in HTTP node

Related hubs

Keep the crawl path moving

Docs

Drop into endpoint reference once your integration is wired up.

Use Cases

See where this integration shape fits common AI-agent workloads.

Alternatives

Compare fastCRW against other scraping APIs your stack might consider.