Skip to main content
Integrations/Integration / LangChain

LangChain Web Scraping Integration — fastCRW [Firecrawl-Compatible]

Wire LangChain document loaders, retrievers, and agent tools into fastCRW with the official langchain-crw package — or a Firecrawl-compatible api_url override. A single static Rust binary, local-first, self-host free under AGPL-3.0.

Published
April 29, 2026
Updated
May 22, 2026
Category
integrations
Verdict

Use fastCRW as the scraping primitive under LangChain document loaders, retrievers, and agent tools — install the official langchain-crw package, or point the community FirecrawlLoader at fastCRW with a one-line api_url override.

Official langchain-crw package on PyPI — fastCRW-native loaders and toolsFirecrawlLoader works unchanged via api_url override (overlap surface: /scrape, /crawl, /map, /search)Clean Markdown output drops straight into RecursiveCharacterTextSplitter and embeddings63.74% truth-recall on Firecrawl's public 1,000-URL dataset — the highest of three tools testedA single ~8 MB static Rust binary instead of a multi-hundred-MB scraper container

Verdict

LangChain is the dominant orchestration layer for retrieval pipelines and agent tools, and fastCRW is the scraping primitive that sits underneath it — turning live URLs into clean Markdown that LangChain can chunk, embed, and reason over. Install the official langchain-crw package for fastCRW-native loaders, or point the community FirecrawlLoader at fastCRW with a one-line api_url override, since the two APIs are wire-compatible on the scrape, crawl, map, and search surface. Either path keeps every chain, retriever, and agent loop you already wrote and swaps a multi-hundred-MB scraper container for a single ~8 MB static Rust binary that is local-first and self-hostable for free under AGPL-3.0.

Who This Is For

  • Teams building RAG over web content — you need live docs, knowledge bases, or news turned into embeddable text without standing up a separate scraping service.
  • Developers writing LangChain agents that browse — your agent needs a scrape or search tool it can call mid-reasoning to ground its answers.
  • Firecrawl users on LangChain — you already use FirecrawlLoader and want to cut runtime cost and image size without rewriting pipeline code.
  • Self-hosting / local-first shops — you want the whole ingestion path on your own infrastructure, including offline and air-gapped environments.

Setup

1. Install LangChain and langchain-crw

pip install -U langchain langchain-crw langchain-text-splitters

The recommended path is the official langchain-crw package, which ships fastCRW-native document loaders and tools. If you are migrating an existing Firecrawl project and want the smallest possible diff, you can instead keep langchain-community and override the base URL — both paths are shown below.

2. Provision a fastCRW API key

Sign up at fastcrw.com, copy the API key from the dashboard (it starts with fcrw_), and export it:

export FASTCRW_API_KEY="fcrw_..."

The free tier ships 500 one-time lifetime credits — enough to validate an ingestion pipeline end to end. A plain scrape is 1 credit; crawl is 1 credit per page; search is 1 credit per query.

Document Loading for RAG

The most common LangChain use is ingestion: pull a page (or a whole docs section), split it, embed it, and store the vectors. fastCRW's Markdown output is built for exactly this — onlyMainContent is on by default, so nav and boilerplate never reach your splitter.

import os
from langchain_community.document_loaders import FirecrawlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# fastCRW is Firecrawl-compatible — override api_url and the rest is identical.
loader = FirecrawlLoader(
    api_key=os.environ["FASTCRW_API_KEY"],
    api_url="https://api.fastcrw.com",
    url="https://example.com/blog",
    mode="scrape",  # "scrape" for one page, "crawl" for a whole section
)

docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(docs)

print(f"Loaded {len(docs)} document(s) from fastCRW")
print(f"Split into {len(chunks)} chunks for the vector store")

To ingest an entire documentation site instead of one page, switch to crawl mode and bound the page count explicitly — fastCRW's crawl is async and capped, so an unbounded job cannot run away with your credits:

loader = FirecrawlLoader(
    api_key=os.environ["FASTCRW_API_KEY"],
    api_url="https://api.fastcrw.com",
    url="https://example.com/docs",
    mode="crawl",
    params={"limit": 50, "maxDepth": 3},  # maxPages cap is 1000, maxDepth cap is 10
)
docs = loader.load()

From here the documents flow into any LangChain vector store unchanged:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

fastCRW as a LangChain Agent Tool

When an agent needs to fetch a page it discovers mid-reasoning, expose fastCRW as a tool. The @tool decorator turns a plain function into something a LangChain agent can call:

import os
import requests
from langchain_core.tools import tool

FASTCRW_BASE = "https://api.fastcrw.com"
HEADERS = {"Authorization": f"Bearer {os.environ['FASTCRW_API_KEY']}"}


@tool
def fastcrw_scrape(url: str) -> str:
    """Scrape a single URL via fastCRW and return clean Markdown."""
    r = requests.post(
        f"{FASTCRW_BASE}/v1/scrape",
        headers=HEADERS,
        json={"url": url, "formats": ["markdown"], "onlyMainContent": True},
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["data"]["markdown"]


@tool
def fastcrw_search(query: str) -> list[dict]:
    """Search the live web via fastCRW. Returns ranked results with URLs."""
    r = requests.post(
        f"{FASTCRW_BASE}/v1/search",
        headers=HEADERS,
        json={"query": query, "limit": 5},
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["data"]

Bind both tools to an agent and it can search for sources, then scrape the most promising ones — a complete research loop without leaving LangChain:

from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI

agent = create_react_agent(
    ChatOpenAI(model="gpt-4o-mini"),
    tools=[fastcrw_search, fastcrw_scrape],
    prompt=hub_prompt,  # any ReAct-style prompt
)
executor = AgentExecutor(agent=agent, tools=[fastcrw_search, fastcrw_scrape])
executor.invoke({"input": "Summarize the latest changes to the MCP spec."})

Why fastCRW Under LangChain

For RAG, extraction quality is not cosmetic — every junk chunk (a cookie banner, a nav menu, a duplicated footer) is a retrieval false positive that crowds out a real answer. On Firecrawl's own public 1,000-URL scrape-content-dataset-v1, scored by the open diagnose_3way.py harness on 2026-05-08, fastCRW recovered the labeled content on 63.74% of 819 labeled URLs, ahead of Crawl4AI (59.95%) and Firecrawl (56.04%). Median latency was 1914 ms (p50), effectively tied with Crawl4AI and ahead of Firecrawl's 2305 ms. fastCRW publishes the full p50/p90/p99 split — its p90 (14157 ms) is the worst of the three, the disclosed cost of the chrome-stealth fallback that recovers hard pages instead of dropping them. See the 1,000-URL benchmark for the full table and a one-command repro.

LangChain JS

LangChain's JavaScript SDK uses the FirecrawlLoader from @langchain/community. The same override applies — set apiUrl to https://api.fastcrw.com — with field names in camelCase:

import { FireCrawlLoader } from "@langchain/community/document_loaders/web/firecrawl";

const loader = new FireCrawlLoader({
  url: "https://example.com/blog",
  apiKey: process.env.FASTCRW_API_KEY,
  apiUrl: "https://api.fastcrw.com",
  mode: "scrape",
});

const docs = await loader.load();

Limits + Gotchas

  • The FirecrawlLoader mode argument supports "scrape" and "crawl". fastCRW crawl is always async and bounded by limit / maxPages — set it explicitly to keep credit spend predictable.
  • Long-running crawls inside an agent loop can blow the agent's iteration budget. Run crawls outside the agent and pass results back through context, or use a background job.
  • LangChain document metadata is derived from the fastCRW response. Field names diverge slightly from Firecrawl — if a pipeline depends on one specific metadata key, check it after migrating.
  • Structured extraction (formats: ["json"] with a jsonSchema) costs 5 credits, not 1. Use plain ["markdown"] for retrieval ingestion and reserve JSON extraction for genuinely schema-shaped data.
  • search answer mode is managed on paid plans (no key, default DeepSeek deepseek-v4-flash), or bring-your-own-key on any plan — supply your own llmApiKey and llmProvider if you want to bring your own model instead of ranked results.

Continue exploring

More from Integrations

View all integrations

Related hubs

Keep the crawl path moving