LlamaIndex Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Build a custom LlamaIndex web reader that calls fastCRW for live ingestion, or wrap fastCRW scrape/crawl in a LlamaIndex tool for agent workflows. Markdown output feeds directly into embeddings pipelines. Small single static binary, local-first, self-host free under AGPL-3.0.
Build a custom LlamaIndex reader that calls fastCRW for HTTP scraping, JS rendering, and crawling. Use the same reader in loading pipelines, RAG queries, and agent tools. fastCRW outputs clean Markdown that embeds and chunks cleanly.
Why LlamaIndex + fastCRW
LlamaIndex is the data framework for LLM applications — it connects your models to documents, APIs, and databases. fastCRW plugs in as the web scraping primitive. Instead of building your own scraper or using a heavyweight Firecrawl container, you write a lightweight LlamaIndex reader that calls fastCRW's REST API. The output is clean Markdown that feeds directly into chunking, embedding, and vector storage.
The pattern: ingest live web pages via fastCRW, let LlamaIndex chunk and embed them, query the vector store, and pass retrieved context to your LLM. Unlike batch scraping tools, fastCRW on-demand scraping keeps your RAG index fresh with zero infrastructure overhead.
Setup
- Install LlamaIndex and dependencies.
- Sign up at fastcrw.com for an API key.
- Export
FASTCRW_API_KEYin your shell. - Implement a custom LlamaIndex reader class that calls fastCRW.
- Use the reader in your data loading and agent pipelines.
pip install llama-index requests
export FASTCRW_API_KEY="fcrw_..."
Code Example: Custom LlamaIndex Reader for fastCRW
Create a file fastcrw_reader.py:
import os
import time
import requests
from typing import List, Any
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
class FastCRWReader(BaseReader):
"""
A LlamaIndex reader that scrapes URLs via fastCRW REST API.
Supports single-page scrape and recursive crawls.
"""
def __init__(self, api_key: str | None = None, base_url: str = "https://fastcrw.com"):
self.api_key = api_key or os.environ.get("FASTCRW_API_KEY")
if not self.api_key:
raise ValueError("FASTCRW_API_KEY not set")
self.base_url = base_url
def load_data(
self,
url: str,
mode: str = "scrape",
limit: int | None = None,
formats: List[str] | None = None,
**kwargs: Any,
) -> List[Document]:
"""
Load and scrape URL(s) via fastCRW.
Args:
url: The target URL to scrape or crawl.
mode: "scrape" (single page) or "crawl" (recursive).
limit: For crawl mode, max number of pages to crawl.
formats: Output formats (default: ["markdown"]).
**kwargs: Additional fastCRW parameters.
Returns:
List of LlamaIndex Document objects with scraped content.
"""
if formats is None:
formats = ["markdown"]
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
if mode == "scrape":
# Single-page scrape
payload = {
"url": url,
"formats": formats,
**kwargs,
}
response = requests.post(
f"{self.base_url}/api/v1/scrape",
json=payload,
headers=headers,
timeout=60,
)
response.raise_for_status()
data = response.json()
doc_content = data["data"].get("markdown") or data["data"].get("html", "")
meta = data["data"].get("metadata", {})
return [
Document(
text=doc_content,
metadata={
"url": url,
"status_code": meta.get("statusCode"),
"elapsed_ms": meta.get("elapsedMs"),
},
)
]
elif mode == "crawl":
# Crawl is always async: POST returns a job id, then poll.
payload = {
"url": url,
"scrapeOptions": {"formats": formats},
**kwargs,
}
if limit:
payload["limit"] = limit
start = requests.post(
f"{self.base_url}/api/v1/crawl",
json=payload,
headers=headers,
timeout=60,
)
start.raise_for_status()
job_id = start.json()["id"]
# Poll until the crawl completes.
while True:
poll = requests.get(
f"{self.base_url}/api/v1/crawl/{job_id}",
headers=headers,
timeout=60,
)
poll.raise_for_status()
data = poll.json()
if data.get("status") == "completed":
break
time.sleep(2)
# Map crawl results to LlamaIndex documents
documents = []
for page in data["data"]:
doc_content = page.get("markdown") or page.get("html", "")
page_meta = page.get("metadata", {})
documents.append(
Document(
text=doc_content,
metadata={
"url": page_meta.get("sourceURL") or page_meta.get("url"),
"status_code": page_meta.get("statusCode"),
"elapsed_ms": page_meta.get("elapsedMs"),
},
)
)
return documents
else:
raise ValueError(f"Unknown mode: {mode}. Use 'scrape' or 'crawl'.")
Usage Example: RAG Pipeline with fastCRW
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from fastcrw_reader import FastCRWReader
# Configure embedding and LLM
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini")
# Instantiate the fastCRW reader
reader = FastCRWReader()
# Load documents from a website
documents = reader.load_data(
url="https://docs.python.org/3/",
mode="crawl",
limit=25, # Cap pages for demo
)
# Create a vector index from scraped content
index = VectorStoreIndex.from_documents(documents)
# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What are Python's built-in functions?")
print(response)
Agent Tool Example: In-Loop Scraping
Expose fastCRW scraping as a LlamaIndex agent tool:
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
import requests
import os
def scrape_url(url: str) -> str:
"""
Scrape a single URL via fastCRW and return Markdown content.
"""
api_key = os.environ["FASTCRW_API_KEY"]
response = requests.post(
"https://api.fastcrw.com/v1/scrape",
headers={"Authorization": f"Bearer {api_key}"},
json={"url": url, "formats": ["markdown"]},
timeout=60,
)
response.raise_for_status()
data = response.json()
return data["data"]["markdown"]
def search_web(query: str, limit: int = 5) -> str:
"""
Search the web via fastCRW and return results.
"""
api_key = os.environ["FASTCRW_API_KEY"]
response = requests.post(
"https://api.fastcrw.com/v1/search",
headers={"Authorization": f"Bearer {api_key}"},
json={"query": query, "limit": limit},
timeout=30,
)
response.raise_for_status()
results = response.json()["data"]
return "\n".join(
[f"- {r['title']}: {r['url']}\n {r['description']}" for r in results[:limit]]
)
# Create tools
scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
search_tool = FunctionTool.from_defaults(fn=search_web)
# Create agent
llm = OpenAI(model="gpt-4o-mini")
agent = ReActAgent.from_llm_and_tools(
llm=llm,
tools=[scrape_tool, search_tool],
system_prompt="You are a research assistant. Use fastCRW tools to fetch live web content and answer questions based on current data.",
)
# Run agent reasoning loop
response = agent.chat("What are the latest Python 3.14 features?")
print(response)
Streaming Crawl Example
For large crawls, process pages as they arrive instead of batching:
import requests
import os
import time
from llama_index.core.schema import Document
def crawl_streaming(
url: str,
limit: int = 25,
process_doc: callable = None,
) -> int:
"""
Crawl a site via fastCRW and stream documents to a callback.
Args:
url: Starting URL.
limit: Max number of pages to crawl.
process_doc: Callback function (Document) -> None.
Returns:
Total pages crawled.
"""
api_key = os.environ["FASTCRW_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}
payload = {
"url": url,
"limit": limit,
"scrapeOptions": {"formats": ["markdown"]},
}
# Crawl is always async: start the job, then poll for completion.
start = requests.post(
"https://api.fastcrw.com/v1/crawl",
headers=headers,
json=payload,
timeout=60,
)
start.raise_for_status()
job_id = start.json()["id"]
while True:
poll = requests.get(
f"https://api.fastcrw.com/v1/crawl/{job_id}",
headers=headers,
timeout=60,
)
poll.raise_for_status()
data = poll.json()
if data.get("status") == "completed":
break
time.sleep(2)
count = 0
for page in data["data"]:
page_meta = page.get("metadata", {})
doc = Document(
text=page.get("markdown", page.get("html", "")),
metadata={"url": page_meta.get("sourceURL") or page_meta.get("url")},
)
if process_doc:
process_doc(doc)
count += 1
return count
When to Use This
- RAG systems with live web content — scrape documentation, blogs, and news into a vector index.
- Multi-document QA — crawl a website and build a searchable knowledge base.
- Data pipelines — ingest web content at intervals (daily, hourly) without manual crawler maintenance.
- Agent research loops — let agents scrape and summarize in real time during reasoning.
- Competitive intelligence — scrape multiple competitor sites and aggregate with LlamaIndex tools.
- Content curation — scrape RSS feeds and webpages, chunk, embed, and surface via semantic search.
Limits + Gotchas
- Crawl timeouts — large crawls can take 30+ seconds. Set appropriate request timeouts in your reader.
- Rate limiting — fastCRW enforces API rate limits. Implement exponential backoff if you hit 429 responses.
- Memory usage — storing all pages from a large crawl in memory can be expensive. Use streaming or pagination.
- Metadata mapping — fastCRW and LlamaIndex use different metadata schemas. Map custom fields explicitly in the Document constructor.
- Duplicate URLs — the crawl endpoint may return duplicates. Deduplicate by URL before indexing.
- JS rendering — static HTTP scraping is the default. For JS-heavy sites, request LightPanda or Chrome rendering (incurs cost).
- Authentication — fastCRW cannot authenticate automatically. Scrape public content or use session persistence for authenticated flows.
Performance Notes
- Single-page scrape: low-latency on HTTP content; see /benchmarks for the full distribution.
- Crawl speed: Depends on site complexity and depth. Budget 1–5 seconds per page.
- Embedding latency: Chunking and embedding 100 pages adds ~10–30 seconds with OpenAI embeddings.
- Vector store: LlamaIndex supports in-memory (demo), Chroma, Pinecone, and others. Choose based on scale.
Related
Continue exploring
More from Integrations
MCP Web Scraping Integration — fastCRW [Firecrawl-Compatible]
fastCRW ships an official MCP server (crw-mcp) exposing scrape, search, crawl, map, and extract to any MCP-compatible client. Small single static binary, local-first, self-host free under AGPL-3.0.
Google ADK Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Wire fastCRW into Google's Agent Development Kit as a FunctionTool. Firecrawl-compatible scrape and search, small single static binary, local-first, self-host free under AGPL-3.0.
OpenAI Agents SDK Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Give OpenAI Agents SDK agents a fastCRW scrape and search tool with the @function_tool decorator. Small single static binary, local-first, Firecrawl-compatible API, self-host free under AGPL-3.0.
Related hubs
