The main gaps today: screenshot capture (a request for formats: ["screenshot"] returns HTTP 422), PDF and DOCX parsing, best-in-class anti-bot bypass, reliable rendering of complex SPAs, WebSocket crawl-progress streaming, persistent job storage across restarts, and automatic request retry. Each limitation has a documented workaround and most are on the roadmap. The limitations matrix in this post lays out severity and roadmap priority for each.

Is CRW production-ready?

CRW is production-ready for its target use cases: scraping HTML content to clean markdown, building RAG pipelines, exposing web scraping via MCP, and self-hosting a Firecrawl-compatible API. It is not the right fit for screenshot capture, PDF indexing, or scraping highly-protected sites without external proxy support. If your workflow stays within the supported use cases, the limitations listed here are unlikely to block you.

Does CRW support screenshots?

No. Screenshot output is not implemented — passing formats: ["screenshot"] returns an HTTP 422 error. Screenshot capture is waiting on LightPanda's screenshot API to stabilize for production use. If you need screenshots today, Firecrawl handles them well, and because CRW is API-compatible you can route screenshot requests to Firecrawl while using CRW for content extraction.

Can CRW scrape JavaScript-heavy sites and SPAs?

CRW renders many JavaScript-heavy pages through LightPanda, an experimental browser written in Zig that implements a subset of browser APIs. Reliability is good for typical React, Vue, and Angular SPAs but lower than Playwright-based tools on complex apps with heavy client-side routing or unimplemented browser APIs. SPA coverage is the highest-priority gap in CRW's core scraping; for pages where content quality is critical, use a thin-content fallback to Firecrawl.

Does CRW support PDF parsing?

No. CRW scrapes HTML pages only — passing a PDF URL to /v1/scrape fetches the raw bytes but cannot extract text from PDF, DOCX, or XLSX files, so the response errors or returns empty content. For documents, use a dedicated parser like pdfplumber in Python, or route PDF URLs to Firecrawl. PDF text extraction is on the roadmap but not in the current development cycle.

How fast is CRW, and what is the latency tradeoff?

On Firecrawl's public 1,000-URL scrape-content dataset (diagnose_3way.py, 2026-05-08), CRW posted the best median latency at 1914 ms versus Firecrawl's 2305 ms. The honest tradeoff is the tail: CRW's p90 of 14157 ms is the worst of the three tools tested, because the chrome-stealth fallback that recovers hard URLs — and gives CRW the top truth-recall of 63.74% — is the same mechanism that produces the slow tail. CRW does not yet have built-in retry with backoff, so handle 429s and timeouts in your client code.

Where CRW Still Falls Short — and What We're Improving

Why We're Writing This

Trust in developer tools is built on honesty. If we only wrote about CRW's strengths, you'd discover the gaps when they matter most — mid-project. This post documents CRW's current limitations clearly, with workarounds and roadmap status for each. Use it to make an informed decision before committing.

Limitation 1: No Screenshot Support

Current state: CRW does not capture page screenshots. The screenshot format from Firecrawl's API is not implemented. Passing "formats": ["screenshot"] returns an error.

Why it's missing: Screenshot capture requires a headless browser to perform a full render and then invoke the Page.captureScreenshot DevTools command. CRW's current LightPanda integration can render HTML for content extraction, but the screenshot capture API isn't yet stable enough for production use in LightPanda.

Who it affects: Teams that need visual page captures for QA dashboards, visual regression monitoring, archival purposes, or any workflow where a pixel-accurate rendering matters — not just the text content.

Workaround: Use Firecrawl or Crawl4AI for screenshot requirements. Because CRW is API-compatible with Firecrawl, you can route screenshot requests to Firecrawl while using CRW for content extraction. In your client code:

const BASE_URL_CRW = "http://localhost:3000"; // or https://api.fastcrw.com for cloud
const BASE_URL_FIRECRAWL = "https://api.firecrawl.dev";

async function scrape(url: string, needsScreenshot: boolean) {
  const base = needsScreenshot ? BASE_URL_FIRECRAWL : BASE_URL_CRW;
  const formats = needsScreenshot ? ["markdown", "screenshot"] : ["markdown"];
  const res = await fetch(`${base}/v1/scrape`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ url, formats }),
  });
  return res.json();
}

Roadmap: Screenshot support is planned. We'll add it once LightPanda's screenshot API stabilizes for production use. Track progress on GitHub Issues.

Limitation 2: No PDF or Document Extraction

Current state: CRW scrapes HTML pages only. If you pass a PDF URL to /v1/scrape, CRW will fetch the raw bytes but cannot parse or extract text from PDF, DOCX, XLSX, or other document formats. The response will either error or return empty content.

Why it's missing: Document parsing is a separate problem domain from HTML scraping. A production-quality PDF parser (handling encrypted PDFs, multi-column layouts, tables, figures) adds significant binary size and maintenance surface. The primary CRW use case — web content extraction for LLMs — is fully served by HTML. We haven't added document parsing complexity that only a subset of users need.

Who it affects: Teams that need to index mixed content sources — public websites alongside corporate document repositories, uploaded PDFs, or government document archives.

Workaround: For document URLs, route them to a dedicated document parser. Two good options:

# Python: pdfplumber for local PDF files
import pdfplumber

def extract_pdf(path: str) -> str:
    with pdfplumber.open(path) as pdf:
        return "
".join(page.extract_text() for page in pdf.pages)

# Or use Firecrawl for PDF URLs (CRW handles the HTML portion)
# Firecrawl handles PDFs at: POST /v1/scrape with PDF URL

Roadmap: PDF text extraction (not full layout parsing) is on the roadmap but not in the current development cycle. DOCX support is further out.

Limitation 3: Anti-Bot Is Not Best-in-Class

Current state: CRW has basic anti-bot measures — realistic user agents, request header mimicry, cookie handling, and configurable delays between requests. For most publicly accessible content this is sufficient. For aggressively protected sites using Cloudflare Enterprise, PerimeterX, DataDome, or Akamai Bot Manager, CRW may receive blocks more often than specialized anti-bot services.

Why it's limited: Effective anti-bot bypass at the highest level requires residential proxy networks, TLS fingerprint spoofing (matching browser JA3/JA4 signatures), browser behavioral mimicry (mouse movements, scroll patterns), and CAPTCHA solving integrations. These are expensive, legally complex, and require ongoing maintenance as detection systems evolve. CRW's focus is the 80% of scraping tasks that don't require sophisticated anti-bot — public docs, news, product pages, open APIs.

Who it affects: Teams scraping high-protection targets: e-commerce with active price-scraping protection, social media platforms, financial data sites, and any site actively investing in scraper detection.

Workaround: For difficult targets, combine CRW with an external residential proxy network. CRW supports HTTP/HTTPS proxy configuration:

curl -X POST http://localhost:3000/v1/scrape   -H "Authorization: Bearer YOUR_API_KEY"   -H "Content-Type: application/json"   -d '{
    "url": "https://protected-site.com",
    "formats": ["markdown"],
    "proxy": "http://user:pass@residential-proxy.example.com:8080"
  }'

This separates the IP/fingerprint layer (handled by your proxy provider: Oxylabs, BrightData, Smartproxy) from the content extraction layer (handled by CRW). You get anti-bot bypass without requiring CRW to bundle a full anti-detection stack.

Roadmap: We're evaluating built-in proxy rotation via configurable proxy pools. Native CAPTCHA solving is not planned — this is better handled at the proxy/browser layer.

Limitation 4: SPA Coverage Is Inconsistent

Current state: CRW's JavaScript rendering via LightPanda works for many SPAs, but reliability on complex React/Vue/Angular applications is lower than Playwright-based alternatives. Some pages with heavy client-side routing, lazy loading, or browser APIs not yet implemented in LightPanda may return incomplete content.

Why it's inconsistent: LightPanda is an experimental browser written in Zig that implements a subset of browser APIs. It's not Chromium — it doesn't have the full W3C API surface area. Pages that depend on APIs not yet implemented (e.g., some Web Animations API methods, certain Canvas operations, WebRTC) may not render correctly.

How to detect it: If a scrape returns thin content (fewer than 200 words) for a page you know has substantial content, or if the markdown is clearly navigation text rather than page content, LightPanda likely failed to render the application state. Compare with a manual browser visit.

Workaround: For URLs where content quality is critical and the page is JavaScript-heavy, route to Firecrawl or use Playwright directly. You can detect this conditionally:

async function scrapeWithFallback(url: string) {
  const result = await crwScrape(url);
  
  // If markdown is thin, fall back to Firecrawl
  if (result.data.markdown.split(" ").length < 200) {
    console.warn(`Thin content for ${url}, falling back to Firecrawl`);
    return firecrawlScrape(url);
  }
  
  return result;
}

Roadmap: Improving LightPanda SPA coverage is the highest-priority gap in CRW's core scraping capability. We're actively contributing to LightPanda and tracking its API surface area expansion. This is an active area of development.

Limitation 5: No WebSocket / SSE Streaming for Crawl Progress

Current state: The crawl progress API uses polling — you call GET /v1/crawl/:id repeatedly until the job completes. There's no WebSocket or Server-Sent Events stream for real-time progress updates.

Why it's missing: Polling is simpler to implement, has no persistent connection management, and works through proxies and load balancers without special configuration. For most backend automation use cases, polling every 2 seconds is perfectly adequate.

Who it affects: UI-heavy applications that want to show a real-time progress bar ("Crawled 47/100 pages...") without the overhead of polling. Also affects very large crawls where you want early results rather than waiting for completion.

Workaround: Poll at a short interval and expose progress to your UI through your own backend:

// Server: poll CRW and push to client via your own SSE stream
app.get("/crawl-progress/:jobId", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  
  const poll = async () => {
    const status = await fetch(`http://crw:3000/v1/crawl/${req.params.jobId}`);
    const data = await status.json();
    
    res.write(`data: ${JSON.stringify(data)}

`);
    
    if (data.status !== "completed" && data.status !== "failed") {
      setTimeout(poll, 1500);
    } else {
      res.end();
    }
  };
  
  poll();
});

Roadmap: SSE (Server-Sent Events) for crawl progress is planned as a simpler alternative to WebSockets. This will allow real-time progress without full-duplex WebSocket complexity.

Limitation 6: No Persistent Job Storage

Current state: Crawl job state is stored in memory. If CRW restarts — planned or unplanned — all in-progress crawl jobs are lost. The job ID becomes invalid and the status endpoint returns 404.

Why it's missing: Persistent storage means adding an external dependency (Redis, SQLite, PostgreSQL). This conflicts with the core design goal: a single binary with no external dependencies. Every dependency is an operational burden.

Who it affects: Teams running long crawl jobs (multi-hour site indexing) where an unexpected CRW restart would require restarting the entire crawl from scratch.

Workaround: Break large crawls into smaller batches and manage state externally:

// Instead of one 1000-page crawl, do 10 x 100-page crawls
// and track completion in your own database
const batches = urlBatches(allUrls, 100);
for (const batch of batches) {
  const jobId = await startCrawl(batch[0], { limit: 100 });
  await waitForCompletion(jobId);
  await saveResultsToDb(jobId);
  console.log(`Completed batch, ${batches.indexOf(batch) + 1}/${batches.length}`);
}

Also use a process manager that restarts CRW quickly: docker run --restart unless-stopped or a systemd unit with Restart=on-failure.

Roadmap: Optional SQLite-backed job persistence is planned as an opt-in feature. The default behavior (in-memory, no dependencies) will be preserved.

Limitation 7: No Built-in Retry with Backoff

Current state: CRW does not automatically retry failed requests. If a page returns a 429 (rate limited), 503 (service unavailable), or a network timeout, the scrape fails immediately and the error is returned to the caller.

Why it's missing: Retry logic with backoff is best implemented at the client layer where you can make informed decisions about which errors to retry and how aggressively. Server-side retry can mask important signals (persistent 403s, quota exhaustion).

Workaround: Implement retry in your client code:

async function scrapeWithRetry(url: string, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const res = await fetch("http://localhost:3000/v1/scrape", { // or https://api.fastcrw.com/v1/scrape for cloud
        method: "POST",
        headers: {
          "Authorization": "Bearer YOUR_API_KEY",
          "Content-Type": "application/json",
        },
        body: JSON.stringify({ url, formats: ["markdown"] }),
      });
      const data = await res.json();
      if (data.success) return data;
      if (res.status === 429) {
        const wait = Math.min(1000 * Math.pow(2, attempt), 30000);
        await new Promise(r => setTimeout(r, wait));
        continue;
      }
      throw new Error(data.error);
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise(r => setTimeout(r, 1000 * attempt));
    }
  }
}

Roadmap: Configurable per-request retry is planned for a future release. When added, it will be opt-in with explicit retry count and backoff strategy parameters.

Limitation 8: No Response Caching

Current state: CRW does not cache scraped responses. Every call to /v1/scrape performs a fresh HTTP fetch, regardless of how recently that URL was scraped. This means scraping the same URL twice costs twice the network round-trips.

Workaround: Add a caching layer in front of CRW. A Redis-backed approach works well:

import { createClient } from "redis";

const redis = createClient();
const TTL = 60 * 60; // 1 hour cache

async function cachedScrape(url: string) {
  const cacheKey = `scrape:${url}`;
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);
  
  const result = await fetch("http://localhost:3000/v1/scrape", { // or https://api.fastcrw.com/v1/scrape for cloud
    method: "POST",
    headers: {
      "Authorization": "Bearer YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, formats: ["markdown"] }),
  }).then(r => r.json());
  
  await redis.setEx(cacheKey, TTL, JSON.stringify(result));
  return result;
}

Roadmap: Optional HTTP response caching with configurable TTL is planned. When added, it will be opt-in to avoid surprising behavior for use cases that require fresh content.

Honest Summary: Limitations Matrix

Limitation	Severity	Workaround Available	Roadmap Priority
No screenshots	Medium	✅ Route to Firecrawl	Medium
No PDF/document parsing	Medium	✅ pdfplumber / Firecrawl	Low
Anti-bot not best-in-class	Medium	✅ Residential proxy	Medium
SPA coverage inconsistent	High	⚠️ Fallback to Firecrawl	High
No crawl progress streaming	Low	✅ Client-side polling proxy	Medium
No persistent job storage	Medium	✅ External state tracking	Medium
No built-in retry	Low	✅ Client-side retry	Low
No response caching	Low	✅ Redis cache layer	Low

What CRW Is Good At Today

Despite these gaps, CRW works well for a specific, well-defined set of use cases:

Scraping HTML-primary content to clean markdown for LLMs
Building RAG pipelines from websites and documentation
Exposing web scraping to AI agents via MCP
Self-hosting a Firecrawl-compatible API with minimal infrastructure
High-concurrency scraping of publicly accessible content
Low-memory self-hosting as a sidecar to existing applications

If your use case falls in this set, CRW's limitations are unlikely to be blocking. If you need screenshots, document parsing, or reliable SPA coverage today, use the tools that do those things well — and check back as the roadmap progresses.

How We Prioritize the Roadmap

GitHub issues drive our priority decisions. If a specific limitation is blocking your use case, open an issue with your use case description. Limitations with multiple real-world users asking for them get prioritized. The roadmap reflects what developers are actually trying to do with CRW, not what sounds impressive in a feature list.

Try CRW

Open-Source Path

docker run -p 3000:3000 ghcr.io/us/crw:latest

GitHub · Docs

Hosted Path — fastCRW

fastCRW — 500 free credits, no credit card required.