Engineering

Where CRW Still Falls Short — and What We're Improving

An honest look at CRW's current limitations — screenshots, PDF parsing, anti-bot, SPA coverage, retry logic, caching — and the roadmap for each.

[Fast]
C
R
W
March 14, 202610 min read

Why We're Writing This

Trust in developer tools is built on honesty. If we only wrote about CRW's strengths, you'd discover the gaps when they matter most — mid-project. This post documents CRW's current limitations clearly, with workarounds and roadmap status for each. Use it to make an informed decision before committing.

Limitation 1: No Screenshot Support

Current state: CRW does not capture page screenshots. The screenshot format from Firecrawl's API is not implemented. Passing "formats": ["screenshot"] returns an error.

Why it's missing: Screenshot capture requires a headless browser to perform a full render and then invoke the Page.captureScreenshot DevTools command. CRW's current LightPanda integration can render HTML for content extraction, but the screenshot capture API isn't yet stable enough for production use in LightPanda.

Who it affects: Teams that need visual page captures for QA dashboards, visual regression monitoring, archival purposes, or any workflow where a pixel-accurate rendering matters — not just the text content.

Workaround: Use Firecrawl or Crawl4AI for screenshot requirements. Because CRW is API-compatible with Firecrawl, you can route screenshot requests to Firecrawl while using CRW for content extraction. In your client code:

const BASE_URL_CRW = "http://localhost:3000"; // or https://fastcrw.com/api for cloud
const BASE_URL_FIRECRAWL = "https://api.firecrawl.dev";

async function scrape(url: string, needsScreenshot: boolean) {
  const base = needsScreenshot ? BASE_URL_FIRECRAWL : BASE_URL_CRW;
  const formats = needsScreenshot ? ["markdown", "screenshot"] : ["markdown"];
  const res = await fetch(`${base}/v1/scrape`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ url, formats }),
  });
  return res.json();
}

Roadmap: Screenshot support is planned. We'll add it once LightPanda's screenshot API stabilizes for production use. Track progress on GitHub Issues.

Limitation 2: No PDF or Document Extraction

Current state: CRW scrapes HTML pages only. If you pass a PDF URL to /v1/scrape, CRW will fetch the raw bytes but cannot parse or extract text from PDF, DOCX, XLSX, or other document formats. The response will either error or return empty content.

Why it's missing: Document parsing is a separate problem domain from HTML scraping. A production-quality PDF parser (handling encrypted PDFs, multi-column layouts, tables, figures) adds significant binary size and maintenance surface. The primary CRW use case — web content extraction for LLMs — is fully served by HTML. We haven't added document parsing complexity that only a subset of users need.

Who it affects: Teams that need to index mixed content sources — public websites alongside corporate document repositories, uploaded PDFs, or government document archives.

Workaround: For document URLs, route them to a dedicated document parser. Two good options:

# Python: pdfplumber for local PDF files
import pdfplumber

def extract_pdf(path: str) -> str:
    with pdfplumber.open(path) as pdf:
        return "
".join(page.extract_text() for page in pdf.pages)

# Or use Firecrawl for PDF URLs (CRW handles the HTML portion)
# Firecrawl handles PDFs at: POST /v1/scrape with PDF URL

Roadmap: PDF text extraction (not full layout parsing) is on the roadmap but not in the current development cycle. DOCX support is further out.

Limitation 3: Anti-Bot Is Not Best-in-Class

Current state: CRW has basic anti-bot measures — realistic user agents, request header mimicry, cookie handling, and configurable delays between requests. For most publicly accessible content this is sufficient. For aggressively protected sites using Cloudflare Enterprise, PerimeterX, DataDome, or Akamai Bot Manager, CRW may receive blocks more often than specialized anti-bot services.

Why it's limited: Effective anti-bot bypass at the highest level requires residential proxy networks, TLS fingerprint spoofing (matching browser JA3/JA4 signatures), browser behavioral mimicry (mouse movements, scroll patterns), and CAPTCHA solving integrations. These are expensive, legally complex, and require ongoing maintenance as detection systems evolve. CRW's focus is the 80% of scraping tasks that don't require sophisticated anti-bot — public docs, news, product pages, open APIs.

Who it affects: Teams scraping high-protection targets: e-commerce with active price-scraping protection, social media platforms, financial data sites, and any site actively investing in scraper detection.

Workaround: For difficult targets, combine CRW with an external residential proxy network. CRW supports HTTP/HTTPS proxy configuration:

curl -X POST http://localhost:3000/v1/scrape   -H "Authorization: Bearer YOUR_API_KEY"   -H "Content-Type: application/json"   -d '{
    "url": "https://protected-site.com",
    "formats": ["markdown"],
    "proxy": "http://user:pass@residential-proxy.example.com:8080"
  }'

This separates the IP/fingerprint layer (handled by your proxy provider: Oxylabs, BrightData, Smartproxy) from the content extraction layer (handled by CRW). You get anti-bot bypass without requiring CRW to bundle a full anti-detection stack.

Roadmap: We're evaluating built-in proxy rotation via configurable proxy pools. Native CAPTCHA solving is not planned — this is better handled at the proxy/browser layer.

Limitation 4: SPA Coverage Is Inconsistent

Current state: CRW's JavaScript rendering via LightPanda works for many SPAs, but reliability on complex React/Vue/Angular applications is lower than Playwright-based alternatives. Some pages with heavy client-side routing, lazy loading, or browser APIs not yet implemented in LightPanda may return incomplete content.

Why it's inconsistent: LightPanda is an experimental browser written in Zig that implements a subset of browser APIs. It's not Chromium — it doesn't have the full W3C API surface area. Pages that depend on APIs not yet implemented (e.g., some Web Animations API methods, certain Canvas operations, WebRTC) may not render correctly.

How to detect it: If a scrape returns thin content (fewer than 200 words) for a page you know has substantial content, or if the markdown is clearly navigation text rather than page content, LightPanda likely failed to render the application state. Compare with a manual browser visit.

Workaround: For URLs where content quality is critical and the page is JavaScript-heavy, route to Firecrawl or use Playwright directly. You can detect this conditionally:

async function scrapeWithFallback(url: string) {
  const result = await crwScrape(url);
  
  // If markdown is thin, fall back to Firecrawl
  if (result.data.markdown.split(" ").length < 200) {
    console.warn(`Thin content for ${url}, falling back to Firecrawl`);
    return firecrawlScrape(url);
  }
  
  return result;
}

Roadmap: Improving LightPanda SPA coverage is the highest-priority gap in CRW's core scraping capability. We're actively contributing to LightPanda and tracking its API surface area expansion. This is an active area of development.

Limitation 5: No WebSocket / SSE Streaming for Crawl Progress

Current state: The crawl progress API uses polling — you call GET /v1/crawl/:id repeatedly until the job completes. There's no WebSocket or Server-Sent Events stream for real-time progress updates.

Why it's missing: Polling is simpler to implement, has no persistent connection management, and works through proxies and load balancers without special configuration. For most backend automation use cases, polling every 2 seconds is perfectly adequate.

Who it affects: UI-heavy applications that want to show a real-time progress bar ("Crawled 47/100 pages...") without the overhead of polling. Also affects very large crawls where you want early results rather than waiting for completion.

Workaround: Poll at a short interval and expose progress to your UI through your own backend:

// Server: poll CRW and push to client via your own SSE stream
app.get("/crawl-progress/:jobId", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  
  const poll = async () => {
    const status = await fetch(`http://crw:3000/v1/crawl/${req.params.jobId}`);
    const data = await status.json();
    
    res.write(`data: ${JSON.stringify(data)}

`);
    
    if (data.status !== "completed" && data.status !== "failed") {
      setTimeout(poll, 1500);
    } else {
      res.end();
    }
  };
  
  poll();
});

Roadmap: SSE (Server-Sent Events) for crawl progress is planned as a simpler alternative to WebSockets. This will allow real-time progress without full-duplex WebSocket complexity.

Limitation 6: No Persistent Job Storage

Current state: Crawl job state is stored in memory. If CRW restarts — planned or unplanned — all in-progress crawl jobs are lost. The job ID becomes invalid and the status endpoint returns 404.

Why it's missing: Persistent storage means adding an external dependency (Redis, SQLite, PostgreSQL). This conflicts with the core design goal: a single binary with no external dependencies. Every dependency is an operational burden.

Who it affects: Teams running long crawl jobs (multi-hour site indexing) where an unexpected CRW restart would require restarting the entire crawl from scratch.

Workaround: Break large crawls into smaller batches and manage state externally:

// Instead of one 1000-page crawl, do 10 x 100-page crawls
// and track completion in your own database
const batches = urlBatches(allUrls, 100);
for (const batch of batches) {
  const jobId = await startCrawl(batch[0], { limit: 100 });
  await waitForCompletion(jobId);
  await saveResultsToDb(jobId);
  console.log(`Completed batch, ${batches.indexOf(batch) + 1}/${batches.length}`);
}

Also use a process manager that restarts CRW quickly: docker run --restart unless-stopped or a systemd unit with Restart=on-failure.

Roadmap: Optional SQLite-backed job persistence is planned as an opt-in feature. The default behavior (in-memory, no dependencies) will be preserved.

Limitation 7: No Built-in Retry with Backoff

Current state: CRW does not automatically retry failed requests. If a page returns a 429 (rate limited), 503 (service unavailable), or a network timeout, the scrape fails immediately and the error is returned to the caller.

Why it's missing: Retry logic with backoff is best implemented at the client layer where you can make informed decisions about which errors to retry and how aggressively. Server-side retry can mask important signals (persistent 403s, quota exhaustion).

Workaround: Implement retry in your client code:

async function scrapeWithRetry(url: string, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const res = await fetch("http://localhost:3000/v1/scrape", { // or https://fastcrw.com/api/v1/scrape for cloud
        method: "POST",
        headers: {
          "Authorization": "Bearer YOUR_API_KEY",
          "Content-Type": "application/json",
        },
        body: JSON.stringify({ url, formats: ["markdown"] }),
      });
      const data = await res.json();
      if (data.success) return data;
      if (res.status === 429) {
        const wait = Math.min(1000 * Math.pow(2, attempt), 30000);
        await new Promise(r => setTimeout(r, wait));
        continue;
      }
      throw new Error(data.error);
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise(r => setTimeout(r, 1000 * attempt));
    }
  }
}

Roadmap: Configurable per-request retry is planned for a future release. When added, it will be opt-in with explicit retry count and backoff strategy parameters.

Limitation 8: No Response Caching

Current state: CRW does not cache scraped responses. Every call to /v1/scrape performs a fresh HTTP fetch, regardless of how recently that URL was scraped. This means scraping the same URL twice costs twice the network round-trips.

Workaround: Add a caching layer in front of CRW. A Redis-backed approach works well:

import { createClient } from "redis";

const redis = createClient();
const TTL = 60 * 60; // 1 hour cache

async function cachedScrape(url: string) {
  const cacheKey = `scrape:${url}`;
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);
  
  const result = await fetch("http://localhost:3000/v1/scrape", { // or https://fastcrw.com/api/v1/scrape for cloud
    method: "POST",
    headers: {
      "Authorization": "Bearer YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, formats: ["markdown"] }),
  }).then(r => r.json());
  
  await redis.setEx(cacheKey, TTL, JSON.stringify(result));
  return result;
}

Roadmap: Optional HTTP response caching with configurable TTL is planned. When added, it will be opt-in to avoid surprising behavior for use cases that require fresh content.

Honest Summary: Limitations Matrix

LimitationSeverityWorkaround AvailableRoadmap Priority
No screenshotsMedium✅ Route to FirecrawlMedium
No PDF/document parsingMedium✅ pdfplumber / FirecrawlLow
Anti-bot not best-in-classMedium✅ Residential proxyMedium
SPA coverage inconsistentHigh⚠️ Fallback to FirecrawlHigh
No crawl progress streamingLow✅ Client-side polling proxyMedium
No persistent job storageMedium✅ External state trackingMedium
No built-in retryLow✅ Client-side retryLow
No response cachingLow✅ Redis cache layerLow

What CRW Is Good At Today

Despite these gaps, CRW works well for a specific, well-defined set of use cases:

  • Scraping HTML-primary content to clean markdown for LLMs
  • Building RAG pipelines from websites and documentation
  • Exposing web scraping to AI agents via MCP
  • Self-hosting a Firecrawl-compatible API with minimal infrastructure
  • High-concurrency scraping of publicly accessible content
  • Low-memory self-hosting as a sidecar to existing applications

If your use case falls in this set, CRW's limitations are unlikely to be blocking. If you need screenshots, document parsing, or reliable SPA coverage today, use the tools that do those things well — and check back as the roadmap progresses.

How We Prioritize the Roadmap

GitHub issues drive our priority decisions. If a specific limitation is blocking your use case, open an issue with your use case description. Limitations with multiple real-world users asking for them get prioritized. The roadmap reflects what developers are actually trying to do with CRW, not what sounds impressive in a feature list.

Try CRW

Open-Source Path

docker run -p 3000:3000 ghcr.io/us/crw:latest

GitHub · Docs

Hosted Path — fastCRW

fastCRW — 50 free credits, no credit card required.

Frequently Asked Questions

What can't CRW do?

The main things CRW doesn't do yet: screenshot capture, PDF/DOCX parsing, best-in-class anti-bot bypass, reliable rendering of complex SPAs, WebSocket crawl progress streaming, persistent job storage across restarts, and automatic request retry. Each limitation has a workaround and most are on the roadmap. See the matrix above for the full picture.

Is CRW production-ready?

CRW is production-ready for its target use cases: scraping HTML content to clean markdown, building RAG pipelines, exposing web scraping via MCP, and self-hosting a Firecrawl-compatible API. It's not production-ready for screenshot capture, PDF indexing, or scraping highly-protected sites without external proxy support. If your workflow stays within the supported use cases, the limitations listed here are unlikely to matter.

Does CRW support screenshots?

Not currently. Screenshot capture requires LightPanda's screenshot API to stabilize, which is still in development. If you need screenshots today, Firecrawl handles this well — and since CRW is API-compatible, you can route screenshot requests to Firecrawl while using CRW for content extraction.

Can CRW scrape JavaScript-heavy sites?

CRW handles many JavaScript-heavy pages through LightPanda, which provides headless browser rendering. Reliability is good for typical React/Vue SPAs, but lower than Playwright for complex apps with heavy client-side routing or many browser API dependencies. For pages where content quality is critical, use a fallback strategy as shown in the SPA section above.

Does CRW support PDF parsing?

Not currently. CRW scrapes HTML pages only. For PDF URLs, use a dedicated parser like pdfplumber (Python) or route to Firecrawl, which has full PDF extraction support.

When will screenshots be available?

We're targeting screenshot support once LightPanda's Page.captureScreenshot API is stable enough for production use. We don't give hard timelines for features that depend on third-party library maturity. The best way to get an update is to watch the GitHub repository or star the issue tracking screenshot support.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.