LangChain Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Wire LangChain document loaders, retrievers, and agent tools into fastCRW with the official langchain-crw package — or a Firecrawl-compatible api_url override. A single static Rust binary, local-first, self-host free under AGPL-3.0.
Use fastCRW as the scraping primitive under LangChain document loaders, retrievers, and agent tools — install the official langchain-crw package, or point the community FirecrawlLoader at fastCRW with a one-line api_url override.
Verdict
LangChain is the dominant orchestration layer for retrieval pipelines and agent tools, and fastCRW is the scraping primitive that sits underneath it — turning live URLs into clean Markdown that LangChain can chunk, embed, and reason over. Install the official langchain-crw package for fastCRW-native loaders, or point the community FirecrawlLoader at fastCRW with a one-line api_url override, since the two APIs are wire-compatible on the scrape, crawl, map, and search surface. Either path keeps every chain, retriever, and agent loop you already wrote and swaps a multi-hundred-MB scraper container for a single ~8 MB static Rust binary that is local-first and self-hostable for free under AGPL-3.0.
Who This Is For
- Teams building RAG over web content — you need live docs, knowledge bases, or news turned into embeddable text without standing up a separate scraping service.
- Developers writing LangChain agents that browse — your agent needs a
scrapeorsearchtool it can call mid-reasoning to ground its answers. - Firecrawl users on LangChain — you already use
FirecrawlLoaderand want to cut runtime cost and image size without rewriting pipeline code. - Self-hosting / local-first shops — you want the whole ingestion path on your own infrastructure, including offline and air-gapped environments.
Setup
1. Install LangChain and langchain-crw
pip install -U langchain langchain-crw langchain-text-splitters
The recommended path is the official langchain-crw package, which ships fastCRW-native document loaders and tools. If you are migrating an existing Firecrawl project and want the smallest possible diff, you can instead keep langchain-community and override the base URL — both paths are shown below.
2. Provision a fastCRW API key
Sign up at fastcrw.com, copy the API key from the dashboard (it starts with fcrw_), and export it:
export FASTCRW_API_KEY="fcrw_..."
The free tier ships 500 one-time lifetime credits — enough to validate an ingestion pipeline end to end. A plain scrape is 1 credit; crawl is 1 credit per page; search is 1 credit per query.
Document Loading for RAG
The most common LangChain use is ingestion: pull a page (or a whole docs section), split it, embed it, and store the vectors. fastCRW's Markdown output is built for exactly this — onlyMainContent is on by default, so nav and boilerplate never reach your splitter.
import os
from langchain_community.document_loaders import FirecrawlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# fastCRW is Firecrawl-compatible — override api_url and the rest is identical.
loader = FirecrawlLoader(
api_key=os.environ["FASTCRW_API_KEY"],
api_url="https://api.fastcrw.com",
url="https://example.com/blog",
mode="scrape", # "scrape" for one page, "crawl" for a whole section
)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(docs)
print(f"Loaded {len(docs)} document(s) from fastCRW")
print(f"Split into {len(chunks)} chunks for the vector store")
To ingest an entire documentation site instead of one page, switch to crawl mode and bound the page count explicitly — fastCRW's crawl is async and capped, so an unbounded job cannot run away with your credits:
loader = FirecrawlLoader(
api_key=os.environ["FASTCRW_API_KEY"],
api_url="https://api.fastcrw.com",
url="https://example.com/docs",
mode="crawl",
params={"limit": 50, "maxDepth": 3}, # maxPages cap is 1000, maxDepth cap is 10
)
docs = loader.load()
From here the documents flow into any LangChain vector store unchanged:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
fastCRW as a LangChain Agent Tool
When an agent needs to fetch a page it discovers mid-reasoning, expose fastCRW as a tool. The @tool decorator turns a plain function into something a LangChain agent can call:
import os
import requests
from langchain_core.tools import tool
FASTCRW_BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {os.environ['FASTCRW_API_KEY']}"}
@tool
def fastcrw_scrape(url: str) -> str:
"""Scrape a single URL via fastCRW and return clean Markdown."""
r = requests.post(
f"{FASTCRW_BASE}/v1/scrape",
headers=HEADERS,
json={"url": url, "formats": ["markdown"], "onlyMainContent": True},
timeout=60,
)
r.raise_for_status()
return r.json()["data"]["markdown"]
@tool
def fastcrw_search(query: str) -> list[dict]:
"""Search the live web via fastCRW. Returns ranked results with URLs."""
r = requests.post(
f"{FASTCRW_BASE}/v1/search",
headers=HEADERS,
json={"query": query, "limit": 5},
timeout=60,
)
r.raise_for_status()
return r.json()["data"]
Bind both tools to an agent and it can search for sources, then scrape the most promising ones — a complete research loop without leaving LangChain:
from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
agent = create_react_agent(
ChatOpenAI(model="gpt-4o-mini"),
tools=[fastcrw_search, fastcrw_scrape],
prompt=hub_prompt, # any ReAct-style prompt
)
executor = AgentExecutor(agent=agent, tools=[fastcrw_search, fastcrw_scrape])
executor.invoke({"input": "Summarize the latest changes to the MCP spec."})
Why fastCRW Under LangChain
For RAG, extraction quality is not cosmetic — every junk chunk (a cookie banner, a nav menu, a duplicated footer) is a retrieval false positive that crowds out a real answer. On Firecrawl's own public 1,000-URL scrape-content-dataset-v1, scored by the open diagnose_3way.py harness on 2026-05-08, fastCRW recovered the labeled content on 63.74% of 819 labeled URLs, ahead of Crawl4AI (59.95%) and Firecrawl (56.04%). Median latency was 1914 ms (p50), effectively tied with Crawl4AI and ahead of Firecrawl's 2305 ms. fastCRW publishes the full p50/p90/p99 split — its p90 (14157 ms) is the worst of the three, the disclosed cost of the chrome-stealth fallback that recovers hard pages instead of dropping them. See the 1,000-URL benchmark for the full table and a one-command repro.
LangChain JS
LangChain's JavaScript SDK uses the FirecrawlLoader from @langchain/community. The same override applies — set apiUrl to https://api.fastcrw.com — with field names in camelCase:
import { FireCrawlLoader } from "@langchain/community/document_loaders/web/firecrawl";
const loader = new FireCrawlLoader({
url: "https://example.com/blog",
apiKey: process.env.FASTCRW_API_KEY,
apiUrl: "https://api.fastcrw.com",
mode: "scrape",
});
const docs = await loader.load();
Limits + Gotchas
- The
FirecrawlLoadermodeargument supports"scrape"and"crawl". fastCRW crawl is always async and bounded bylimit/maxPages— set it explicitly to keep credit spend predictable. - Long-running crawls inside an agent loop can blow the agent's iteration budget. Run crawls outside the agent and pass results back through context, or use a background job.
- LangChain document metadata is derived from the fastCRW response. Field names diverge slightly from Firecrawl — if a pipeline depends on one specific metadata key, check it after migrating.
- Structured extraction (
formats: ["json"]with ajsonSchema) costs 5 credits, not 1. Use plain["markdown"]for retrieval ingestion and reserve JSON extraction for genuinely schema-shaped data. searchanswer mode is managed on paid plans (no key, default DeepSeekdeepseek-v4-flash), or bring-your-own-key on any plan — supply your ownllmApiKeyandllmProviderif you want to bring your own model instead of ranked results.
Related
Continue exploring
More from Integrations
Migrate from Jina Reader to fastCRW — URL-to-Markdown Upgrade Guide
Vercel AI SDK Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Migrate from Tavily to fastCRW — Search API Migration Guide
Migrate from Tavily search API to fastCRW POST /v1/search. fastCRW search averaged 880 ms across a 100-query benchmark, and adds scrape, crawl, and map. Param mapping table, before/after code, and honest gaps (answer mode: managed on paid plans or bring-your-own-key; no domain filters).
Cursor Web Scraping Integration — fastCRW [Firecrawl-Compatible]
Add fastCRW as an MCP server in Cursor IDE. Configure ~/.cursor/mcp.json, then scrape, search, crawl, and map web pages from within your agent prompts. A single static Rust binary, local-first.
Claude Code Web Scraping Integration — fastCRW [MCP Server]
Add fastCRW as a Claude Code MCP server. One npx command registers scrape, search, crawl, map, and crawl-status tools. A single static Rust binary, local-first, self-host free under AGPL-3.0.
Related hubs
