Skip to main content
Use Cases/Use Case / Self-Hosting

Self-Hosted Web Scraping API

Run fastCRW on your own infrastructure — a single ~8 MB Docker image, no Redis or Node.js required, full Firecrawl-compatible API. Deploy on a $5 VPS or inside your own VPC for complete data control, privacy, and zero per-scrape fees.

Published
March 11, 2026
Updated
June 13, 2026
Category
use cases
Verdict

fastCRW's self-hosted deployment is the most operationally lightweight scraping API stack available: one container for most workloads, an optional browser sidecar when you actually need JS rendering, and zero platform services to operate. AGPL-3.0 means you pay only your server — no per-scrape fees, no vendor lock-in, no data leaving your network.

Single ~8 MB Docker image — no Redis, no Node.js, no KafkaFull Firecrawl-compatible API — drop-in base-URL swap from existing integrationsRun on a $5/month VPS for low-volume workloadsAGPL-3.0 — free to self-host, data stays in your own network

Why Teams Self-Host a Scraping API

Hosted scraping APIs solve the infrastructure problem — you get an endpoint and start scraping in minutes. But hosted APIs introduce a different set of constraints:

  • Per-scrape billing. Every page you scrape costs a credit or a dollar. For high-volume workloads (millions of pages/month), managed pricing adds up quickly — Firecrawl charges $0.83–5.33 per 1,000 scrapes across its tiers (source: marketing/competitor-prices.lock.md, verified 2026-05-18). At $0 per 1,000 self-hosted scrapes (CANONICAL-FACTS §8), the economics flip completely at scale.
  • Data egress. Every URL you scrape and every page you receive passes through a third-party API. For regulated industries (healthcare, fintech, legal) or when scraping proprietary internal data sources, that egress is often a compliance or security problem.
  • Network topology. If your scraping targets are inside a private network (internal documentation, intranet pages, staging environments), a public cloud API can't reach them. A self-hosted instance inside your VPN can.
  • Operational predictability. Managed APIs can throttle, rate-limit, or reprice. Self-hosting gives you a fixed infrastructure cost and full control over throughput.

fastCRW is designed to make self-hosting as simple as possible. The goal is not "run your own large crawler platform" — it is "expose a Firecrawl-compatible scraping API with as few moving parts as possible."

The fastCRW Self-Hosting Architecture

Default: one container

The minimal fastCRW deployment is a single Docker container running a static Rust binary. No Redis. No Node.js. No message queue. No separate worker process. The image size is approximately 8 MB (CANONICAL-FACTS §7: "Docker image — single ~8 MB binary").

┌─────────────────────────────────┐
│  fastCRW container (~8 MB)      │
│  POST /v1/scrape                │
│  POST /v1/crawl                 │
│  POST /v1/map                   │
│  POST /v1/search                │
│  GET  /health                   │
└─────────────────────────────────┘

This handles HTTP scraping (the http renderer) out of the box. Most static and server-rendered sites respond correctly to HTTP scraping without JavaScript execution.

Add LightPanda for JavaScript rendering

LightPanda is a lightweight browser sidecar that handles most JavaScript-rendered pages without the full overhead of Chrome. Add it to your Docker Compose file when your targets include React SPAs, Next.js apps, and other client-rendered sites.

┌──────────────────┐     ┌──────────────────────┐
│  fastCRW         │────▶│  LightPanda sidecar   │
│  (~8 MB)         │     │  (lightweight browser) │
└──────────────────┘     └──────────────────────┘

LightPanda scraping costs 1 credit — same as HTTP. In managed mode, the default renderer is auto, which tries HTTP first and falls back to LightPanda for dynamic content.

Add Chrome for heavy anti-bot targets

For sites with sophisticated bot detection (Cloudflare challenges, fingerprinting, browser environment checks), Chrome is the most effective renderer. It is also the heaviest: roughly 500 MB image size plus approximately 1 GB resident RAM when active (CANONICAL-FACTS §7: "The opt-in chrome Compose variant is ~500 MB image + ~1 GB resident").

┌──────────────────┐     ┌──────────────────────┐
│  fastCRW         │────▶│  Chrome sidecar       │
│  (~8 MB)         │     │  (~500 MB image,      │
└──────────────────┘     │   ~1 GB resident RAM) │
                         └──────────────────────┘

Chrome is opt-in. Start without it, test your real targets, and only add Chrome if you find your success rate on critical targets is unsatisfactory. Most scraping workloads — including most market research and enrichment pipelines — don't need Chrome.

Comparison: fastCRW Self-Hosted vs. Alternatives

All structural facts from CANONICAL-FACTS §7 (marketing/CANONICAL-FACTS.md, verified 2026-05-22).

fastCRW (self-hosted)Firecrawl (self-hosted)Managed cloud API
Container count (minimal)150 (managed for you)
Docker image size~8 MB~2–3 GB totalN/A
External service depsNoneRedis requiredN/A
Renderer optionsHTTP, LightPanda, ChromePlaywright/ChromeVaries
Per-scrape fee$0 (pay your server)$0 (pay your server)Per-credit billing
LicenseAGPL-3.0AGPL-3.0Proprietary
API compatibilityFirecrawl-compatibleDepends on provider
Data egressNone (stays in your network)NonePasses through provider

Deployment Walkthrough

Minimal: single container on a VPS

# On your VPS (Ubuntu/Debian)
# Install Docker
curl -fsSL https://get.docker.com | sh

# Pull and run fastCRW
docker run -d \
  --name fastcrw \
  --restart unless-stopped \
  -p 3002:3002 \
  -e CRW_API_KEY=your-secret-key \
  ghcr.io/us/crw:latest

# Test
curl -X POST http://localhost:3002/v1/scrape \
  -H "Authorization: Bearer your-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

Standard: Docker Compose with LightPanda

# docker-compose.yml
services:
  fastcrw:
    image: ghcr.io/us/crw:latest
    restart: unless-stopped
    ports:
      - "3002:3002"
    environment:
      CRW_API_KEY: "your-secret-key"
      LIGHTPANDA_URL: "http://lightpanda:9222"
    depends_on:
      - lightpanda

  lightpanda:
    image: ghcr.io/us/lightpanda:latest
    restart: unless-stopped
    # No port exposure needed — internal network only
docker compose up -d

Full: Docker Compose with Chrome

# docker-compose.chrome.yml
services:
  fastcrw:
    image: ghcr.io/us/crw:latest
    restart: unless-stopped
    ports:
      - "3002:3002"
    environment:
      CRW_API_KEY: "your-secret-key"
      LIGHTPANDA_URL: "http://lightpanda:9222"
      CHROME_URL: "http://chrome:9223"
    depends_on:
      - lightpanda
      - chrome

  lightpanda:
    image: ghcr.io/us/lightpanda:latest
    restart: unless-stopped

  chrome:
    image: browserless/chrome:latest
    restart: unless-stopped
    environment:
      MAX_CONCURRENT_SESSIONS: "5"
      # Budget ~1 GB RAM for Chrome

Reverse proxy with TLS (Caddy)

# Caddyfile — TLS is automatic via Let's Encrypt
scrape.yourcompany.com {
  reverse_proxy localhost:3002
}
# Install Caddy and run
caddy run --config Caddyfile

Your self-hosted fastCRW API is now available at https://scrape.yourcompany.com/v1/scrape with automatic HTTPS.

Migrating from Managed Firecrawl to Self-Hosted fastCRW

fastCRW is Firecrawl-compatible. The API shapes — request bodies, response envelopes, endpoint paths — match Firecrawl's /v1 surface. The migration is a base-URL swap:

# Before (Firecrawl managed)
CRW_BASE_URL = "https://api.firecrawl.dev/v1"

# After (fastCRW self-hosted)
CRW_BASE_URL = "http://your-server:3002/v1"

# Everything else stays the same
response = requests.post(
    f"{CRW_BASE_URL}/scrape",
    json={"url": "https://example.com", "formats": ["markdown"]},
    headers={"Authorization": f"Bearer {API_KEY}"}
)

Minor divergences from Firecrawl exist in response field names and error envelopes (CANONICAL-FACTS §9 — "Response field names and error envelopes have minor divergence from Firecrawl"). Test your integration against the migration guide to catch any field name differences.

Using the Same API from Your Applications

Once self-hosted, your existing application code works without changes — just update the base URL and API key.

Python

import requests

CRW_BASE_URL = "http://your-server:3002/v1"
CRW_API_KEY = "your-secret-key"

# Scrape a page
def scrape(url: str) -> dict:
    resp = requests.post(
        f"{CRW_BASE_URL}/scrape",
        json={"url": url, "formats": ["markdown"]},
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    resp.raise_for_status()
    return resp.json()

# Map a domain
def map_domain(domain: str) -> list[str]:
    resp = requests.post(
        f"{CRW_BASE_URL}/map",
        json={"url": f"https://{domain}"},
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    resp.raise_for_status()
    return resp.json().get("urls", [])

# Crawl a site
def start_crawl(url: str, max_pages: int = 100) -> str:
    resp = requests.post(
        f"{CRW_BASE_URL}/crawl",
        json={"url": url, "maxPages": max_pages, "maxDepth": 3},
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    resp.raise_for_status()
    return resp.json().get("id")  # crawl job ID

result = scrape("https://news.ycombinator.com")
print(result["data"]["markdown"][:500])

JavaScript/TypeScript

const BASE_URL = "http://your-server:3002/v1";
const API_KEY = process.env.CRW_API_KEY!;

const headers = {
  Authorization: `Bearer ${API_KEY}`,
  "Content-Type": "application/json",
};

// Scrape
async function scrape(url: string) {
  const res = await fetch(`${BASE_URL}/scrape`, {
    method: "POST",
    headers,
    body: JSON.stringify({ url, formats: ["markdown"] }),
  });
  return res.json();
}

// Map
async function mapDomain(url: string): Promise<string[]> {
  const res = await fetch(`${BASE_URL}/map`, {
    method: "POST",
    headers,
    body: JSON.stringify({ url }),
  });
  const data = await res.json();
  return data.urls ?? [];
}

// Extract with schema
async function extract(url: string, schema: object) {
  const res = await fetch(`${BASE_URL}/scrape`, {
    method: "POST",
    headers,
    body: JSON.stringify({ url, formats: ["json"], jsonSchema: schema }),
  });
  const data = await res.json();
  return data?.data?.json;
}

// Example
const result = await scrape("https://example.com");
console.log(result.data.markdown);

curl — all core endpoints

# Health check (no auth)
curl http://localhost:3002/health

# Scrape (markdown)
curl -X POST http://localhost:3002/v1/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

# Scrape with structured extraction
curl -X POST http://localhost:3002/v1/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "title":       { "type": "string" },
        "description": { "type": "string" }
      }
    }
  }'

# Map a domain
curl -X POST http://localhost:3002/v1/map \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Start a crawl
curl -X POST http://localhost:3002/v1/crawl \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "maxPages": 50, "maxDepth": 2}'

# Check crawl status (use the ID from the crawl response)
curl http://localhost:3002/v1/crawl/your-crawl-id \
  -H "Authorization: Bearer $CRW_API_KEY"

Choosing Renderer: HTTP vs. LightPanda vs. Chrome

The renderer choice affects both scrape quality and cost. fastCRW's auto mode selects intelligently, but you can force a specific renderer with the renderer field.

RendererWhat it handlesCostRAM per requestWhen to use
httpStatic HTML, server-rendered pages1 creditMinimalDefault for most sites
lightpandaJavaScript SPAs, lazy-loaded content1 credit~50–100 MBMost dynamic pages
chromeHeavy anti-bot, fingerprinting, CAPTCHA-guarded2 credits~1 GBOnly when lightpanda fails
auto (default)Tries chrome → lightpanda → http1–2 creditsVariesProduction default

Credit costs from CANONICAL-FACTS §3 (marketing/CANONICAL-FACTS.md, verified 2026-05-29).

A practical approach: start with auto. After your first batch of scrapes, check which URLs consistently returned thin content. For those URLs, test lightpanda explicitly. Only add Chrome when lightpanda still fails on a critical target.

In self-hosted mode, you're not paying per-credit — you're paying for server resources. The Chrome sidecar's ~1 GB RAM cost is a fixed overhead whether you use it or not, so size your server accordingly before enabling it.

MCP Integration: fastCRW as a Tool for AI Agents

fastCRW ships an MCP (Model Context Protocol) transport at /mcp, which means your self-hosted instance works directly as a tool server for Claude, Claude Code, and any MCP-compatible AI agent framework.

Install the MCP package:

# npm
npm install -g crw-mcp@0.6.0

# or via bunx
bunx crw-mcp@0.6.0

Point it at your self-hosted instance:

// Claude Desktop or Claude Code config
{
  "mcpServers": {
    "fastcrw": {
      "command": "npx",
      "args": ["crw-mcp"],
      "env": {
        "CRW_API_URL": "http://your-server:3002",
        "CRW_API_KEY": "your-secret-key"
      }
    }
  }
}

Your AI agent now calls scrape, crawl, map, and search as native tools against your self-hosted instance. No data transits any cloud service — the agent calls your server, your server scrapes the target, results return to the agent. See MCP integration for the full configuration guide.

Production Hardening

Authentication

fastCRW requires a bearer token (Authorization: Bearer yourkey) for all API calls. Set a strong, randomly generated key (32+ characters) via the CRW_API_KEY environment variable. If you expose the API externally, rotate the key periodically and use a secrets manager (Vault, AWS SSM, Doppler) rather than hardcoding it in Compose files.

TLS

Never expose the raw HTTP port (3002) to the public internet. Put fastCRW behind a reverse proxy (Caddy, Nginx, Traefik) that terminates TLS. Caddy's automatic HTTPS via Let's Encrypt is the easiest path.

Rate limiting

Implement rate limiting at the reverse proxy level to prevent runaway scrape loops from exhausting your server's bandwidth or triggering IP bans on target sites. Nginx's limit_req_zone or Traefik's RateLimit middleware are straightforward options.

Network isolation

If fastCRW should only be accessible inside your private network, bind the container to the internal network interface rather than 0.0.0.0. For Kubernetes deployments, use a ClusterIP service and expose only through your ingress.

Resource limits

Set container CPU and memory limits in your Compose file or Kubernetes manifests to prevent a burst of heavy Chrome-rendered scrapes from OOM-killing other services on the same host.

Logging and monitoring

fastCRW logs each request with URL, renderer, status, and latency. Ship these logs to your existing log aggregation stack (Loki, Datadog, CloudWatch). Set an alert if the scrape success rate drops below 80% — that signals target-site changes or network issues.

Read the full hardening guide at /docs/self-hosting-hardening.

AGPL-3.0: What It Means for Self-Hosters

fastCRW is licensed AGPL-3.0. For most teams self-hosting, the practical implications are:

  • If you run fastCRW without modifying the source: no obligation. You can use it commercially, internally, or as an API for your own products.
  • If you modify the fastCRW source and deploy it as a network service: you must publish your modifications under AGPL-3.0. This is the "network use is distribution" clause specific to AGPL.
  • If you build a product on top of fastCRW (e.g., your own managed scraping service): consult a lawyer. AGPL copyleft may require you to open-source your wrapper, depending on how tightly coupled it is.

For the vast majority of self-hosting use cases — running fastCRW inside your own infrastructure to power internal tools, data pipelines, or AI agent workflows — AGPL-3.0 has no practical impact.

When to Self-Host vs. Use Managed Cloud

Self-hosting is the right choice when:

  • Cost at scale. You're scraping millions of pages per month and per-scrape fees are a significant line item. Self-hosted cost is fixed server overhead; managed cost scales with volume.
  • Data control and privacy. Your scraping targets contain sensitive data, or your organization's policy prohibits data egress to third parties.
  • Private network access. Your targets are inside a VPN or private network that the public cloud can't reach.
  • Compliance requirements. HIPAA, SOC 2, GDPR data-residency requirements, or similar constraints that require data to stay in specific jurisdictions.
  • Custom SLA. You need guaranteed throughput and uptime SLAs that a shared managed API can't provide.

Managed cloud (fastcrw.com) is the right choice when:

  • Speed to first scrape. You want an API key and production-ready endpoint in under 5 minutes with no ops work.
  • Burst capacity. Your scraping volume is unpredictable and you want elastic throughput without provisioning servers.
  • Minimal ops. No team member wants to own the infrastructure. Managed handles scaling, browser sidecars, updates, and uptime.
  • Low volume. Under ~50,000 scrapes per month, the Hobby or Standard plan is often cheaper than a dedicated VPS once ops time is factored in.

The two are not mutually exclusive. Run self-hosted for your steady-state high-volume workloads; use the managed API for burst capacity or experimentation. Because the API is identical, you can route traffic between the two without code changes.

Good Fits for Self-Hosting

  • Privacy-sensitive workloads — healthcare, legal, and fintech teams where scraped data cannot leave the organization's network
  • High-volume pipelines — millions of pages per month where per-scrape managed fees are prohibitive
  • Internal knowledge ingestion — scraping private intranet pages, documentation sites, or staging environments inside a VPN
  • Cost-sensitive startups — teams that want production-grade scraping without a large managed API bill
  • Platform engineering teams — building an internal scraping microservice that other teams call, rather than each team integrating a managed API separately
  • AI agent infrastructure — self-hosted fastCRW as the scraping backend for LLM agents that must keep browsing activity private

When Self-Hosting Is the Wrong Choice

  • Immediate throughput, minimal ops: If no one on your team wants to own infrastructure, managed is faster and simpler. Self-hosting requires initial setup and ongoing maintenance.
  • Tiny volume: Below ~10,000 scrapes per month, the time cost of operating a server likely exceeds the cost of a Hobby plan ($13/mo — launch price, was $19, 3,000 credits — CANONICAL-FACTS §2).
  • Elastic burst needs: If your scraping volume spikes unpredictably (seasonal campaign, viral traffic), managed cloud handles elasticity automatically. Self-hosted capacity is fixed at what you provisioned.
  • Aggressive anti-bot targets at scale: Running a fleet of Chrome browsers at scale is operationally significant. If your primary use case is bypassing heavy anti-bot protection at high volume, a managed API with a distributed browser fleet may be more effective.
  • Self-hosting guide — step-by-step deployment instructions including Docker Compose files
  • Self-hosting hardening guide — security, TLS, rate limiting, and access control
  • Firecrawl self-hosted Rust alternative — detailed comparison of fastCRW vs. Firecrawl when self-hosting
  • MCP integration — connect your self-hosted instance to Claude and AI agent frameworks
  • Pricing — managed cloud plan rates; useful for comparing managed vs. self-hosted cost at your volume
  • Benchmarks — accuracy and latency data for evaluating whether self-hosted performance meets your requirements
  • Lead enrichment — a common self-hosting use case: enrichment pipelines where CRM data can't leave the VPC
  • Market research — another self-hosting candidate: competitive intelligence pipelines with proprietary internal data

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving