LlamaIndex Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Build a custom LlamaIndex web reader that calls fastCRW for live ingestion, or wrap fastCRW scrape/crawl in a LlamaIndex tool for agent workflows. Markdown output feeds directly into embeddings pipelines. 6.6 MB RAM, 92% coverage on 1,000-URL benchmark.
Build a custom LlamaIndex reader that calls fastCRW for HTTP scraping, JS rendering, and crawling. Use the same reader in loading pipelines, RAG queries, and agent tools. fastCRW outputs clean Markdown that embeds and chunks cleanly.
Why LlamaIndex + fastCRW
LlamaIndex is the data framework for LLM applications — it connects your models to documents, APIs, and databases. fastCRW plugs in as the web scraping primitive. Instead of building your own scraper or using a heavyweight Firecrawl container, you write a lightweight LlamaIndex reader that calls fastCRW's REST API. The output is clean Markdown that feeds directly into chunking, embedding, and vector storage.
The pattern: ingest live web pages via fastCRW, let LlamaIndex chunk and embed them, query the vector store, and pass retrieved context to your LLM. Unlike batch scraping tools, fastCRW on-demand scraping keeps your RAG index fresh with zero infrastructure overhead.
Setup
- Install LlamaIndex and dependencies.
- Sign up at fastcrw.com for an API key.
- Export
FASTCRW_API_KEYin your shell. - Implement a custom LlamaIndex reader class that calls fastCRW.
- Use the reader in your data loading and agent pipelines.
pip install llama-index requests
export FASTCRW_API_KEY="fcrw_..."
Code Example: Custom LlamaIndex Reader for fastCRW
Create a file fastcrw_reader.py:
import os
import requests
from typing import List, Any
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
class FastCRWReader(BaseReader):
"""
A LlamaIndex reader that scrapes URLs via fastCRW REST API.
Supports single-page scrape and recursive crawls.
"""
def __init__(self, api_key: str | None = None, base_url: str = "https://fastcrw.com"):
self.api_key = api_key or os.environ.get("FASTCRW_API_KEY")
if not self.api_key:
raise ValueError("FASTCRW_API_KEY not set")
self.base_url = base_url
def load_data(
self,
url: str,
mode: str = "scrape",
max_depth: int | None = None,
formats: List[str] | None = None,
**kwargs: Any,
) -> List[Document]:
"""
Load and scrape URL(s) via fastCRW.
Args:
url: The target URL to scrape or crawl.
mode: "scrape" (single page) or "crawl" (recursive).
max_depth: For crawl mode, max crawl depth.
formats: Output formats (default: ["markdown"]).
**kwargs: Additional fastCRW parameters.
Returns:
List of LlamaIndex Document objects with scraped content.
"""
if formats is None:
formats = ["markdown"]
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
if mode == "scrape":
# Single-page scrape
payload = {
"url": url,
"formats": formats,
**kwargs,
}
response = requests.post(
f"{self.base_url}/api/v1/scrape",
json=payload,
headers=headers,
timeout=60,
)
response.raise_for_status()
data = response.json()
doc_content = data["data"].get("markdown") or data["data"].get("html", "")
return [
Document(
text=doc_content,
metadata={
"url": url,
"status_code": data["data"].get("status_code"),
"load_time_ms": data["data"].get("load_time_ms"),
},
)
]
elif mode == "crawl":
# Recursive crawl
payload = {
"url": url,
"formats": formats,
**kwargs,
}
if max_depth:
payload["max_depth"] = max_depth
response = requests.post(
f"{self.base_url}/api/v1/crawl",
json=payload,
headers=headers,
timeout=300, # Crawls can take longer
)
response.raise_for_status()
data = response.json()
# Map crawl results to LlamaIndex documents
documents = []
for page in data["data"]:
doc_content = page.get("markdown") or page.get("html", "")
documents.append(
Document(
text=doc_content,
metadata={
"url": page.get("url"),
"status_code": page.get("status_code"),
"load_time_ms": page.get("load_time_ms"),
},
)
)
return documents
else:
raise ValueError(f"Unknown mode: {mode}. Use 'scrape' or 'crawl'.")
Usage Example: RAG Pipeline with fastCRW
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from fastcrw_reader import FastCRWReader
# Configure embedding and LLM
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini")
# Instantiate the fastCRW reader
reader = FastCRWReader()
# Load documents from a website
documents = reader.load_data(
url="https://docs.python.org/3/",
mode="crawl",
max_depth=2, # Limit crawl depth for demo
)
# Create a vector index from scraped content
index = VectorStoreIndex.from_documents(documents)
# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What are Python's built-in functions?")
print(response)
Agent Tool Example: In-Loop Scraping
Expose fastCRW scraping as a LlamaIndex agent tool:
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
import requests
import os
def scrape_url(url: str) -> str:
"""
Scrape a single URL via fastCRW and return Markdown content.
"""
api_key = os.environ["FASTCRW_API_KEY"]
response = requests.post(
"https://fastcrw.com/api/v1/scrape",
headers={"Authorization": f"Bearer {api_key}"},
json={"url": url, "formats": ["markdown"]},
timeout=60,
)
response.raise_for_status()
data = response.json()
return data["data"]["markdown"]
def search_web(query: str, limit: int = 5) -> str:
"""
Search the web via fastCRW and return results.
"""
api_key = os.environ["FASTCRW_API_KEY"]
response = requests.post(
"https://fastcrw.com/api/v1/search",
headers={"Authorization": f"Bearer {api_key}"},
json={"query": query, "limit": limit},
timeout=30,
)
response.raise_for_status()
results = response.json()["data"]["results"]
return "\n".join(
[f"- {r['title']}: {r['url']}" for r in results[:limit]]
)
# Create tools
scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
search_tool = FunctionTool.from_defaults(fn=search_web)
# Create agent
llm = OpenAI(model="gpt-4o-mini")
agent = ReActAgent.from_llm_and_tools(
llm=llm,
tools=[scrape_tool, search_tool],
system_prompt="You are a research assistant. Use fastCRW tools to fetch live web content and answer questions based on current data.",
)
# Run agent reasoning loop
response = agent.chat("What are the latest Python 3.14 features?")
print(response)
Streaming Crawl Example
For large crawls, process pages as they arrive instead of batching:
import requests
import os
from llama_index.core import Document
def crawl_streaming(
url: str,
max_depth: int = 2,
process_doc: callable = None,
) -> int:
"""
Crawl a site via fastCRW and stream documents to a callback.
Args:
url: Starting URL.
max_depth: Crawl depth limit.
process_doc: Callback function (Document) -> None.
Returns:
Total pages crawled.
"""
api_key = os.environ["FASTCRW_API_KEY"]
payload = {
"url": url,
"formats": ["markdown"],
"max_depth": max_depth,
}
response = requests.post(
"https://fastcrw.com/api/v1/crawl",
headers={"Authorization": f"Bearer {api_key}"},
json=payload,
timeout=300,
)
response.raise_for_status()
data = response.json()
count = 0
for page in data["data"]:
doc = Document(
text=page.get("markdown", page.get("html", "")),
metadata={"url": page.get("url")},
)
if process_doc:
process_doc(doc)
count += 1
return count
When to Use This
- RAG systems with live web content — scrape documentation, blogs, and news into a vector index.
- Multi-document QA — crawl a website and build a searchable knowledge base.
- Data pipelines — ingest web content at intervals (daily, hourly) without manual crawler maintenance.
- Agent research loops — let agents scrape and summarize in real time during reasoning.
- Competitive intelligence — scrape multiple competitor sites and aggregate with LlamaIndex tools.
- Content curation — scrape RSS feeds and webpages, chunk, embed, and surface via semantic search.
Limits + Gotchas
- Crawl timeouts — large crawls can take 30+ seconds. Set appropriate request timeouts in your reader.
- Rate limiting — fastCRW enforces API rate limits. Implement exponential backoff if you hit 429 responses.
- Memory usage — storing all pages from a large crawl in memory can be expensive. Use streaming or pagination.
- Metadata mapping — fastCRW and LlamaIndex use different metadata schemas. Map custom fields explicitly in the Document constructor.
- Duplicate URLs — the crawl endpoint may return duplicates. Deduplicate by URL before indexing.
- JS rendering — static HTTP scraping is the default. For JS-heavy sites, request LightPanda or Chrome rendering (incurs cost).
- Authentication — fastCRW cannot authenticate automatically. Scrape public content or use session persistence for authenticated flows.
Performance Notes
- Single-page scrape: 833 ms median on HTTP content.
- Crawl speed: Depends on site complexity and depth. Budget 1–5 seconds per page.
- Embedding latency: Chunking and embedding 100 pages adds ~10–30 seconds with OpenAI embeddings.
- Vector store: LlamaIndex supports in-memory (demo), Chroma, Pinecone, and others. Choose based on scale.
Related
Continue exploring
More from Integrations
Pydantic AI Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Dify Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Vercel AI SDK Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Register fastCRW as a tool in Vercel AI SDK so generateText and streamText can scrape live web pages. Drop-in alternative to Firecrawl with 6.6 MB RAM runtime and 833 ms average latency on 1,000-URL benchmark.
Cursor Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Add fastCRW as an MCP server in Cursor IDE. Configure ~/.cursor/mcp.json, then scrape, search, crawl, and extract web pages from within your agent prompts. 6.6 MB RAM runtime.
Flowise Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Add fastCRW to Flowise workflows with an HTTP node or custom tool definition. No-code web scraping for LangChain flows, RAG pipelines, and AI agents. 6.6 MB RAM runtime, 92% coverage on the 1,000-URL benchmark.
Related hubs