Do I have to rewrite everything to migrate from Scrapy?

No. Migrate incrementally: replace the fetch+parse internals of brittle spiders first while keeping your pipelines, storage, and scheduler. Many teams end up with a hybrid — Scrapy for orchestration, fastCRW for clean extraction.

What does fastCRW NOT replace in a Scrapy project?

It does not replace your scheduling/queueing (keep Scrapyd/cron/your orchestrator), your item pipelines for validation and storage, or highly custom downloader middleware with no API equivalent. It replaces the fetch/render/extract-to-clean-data layer.

Migrating from Scrapy to fastCRW: A Practical Guide (2026)

By the fastCRW team · Migration tutorial · Scrapy is excellent — this is about when an API fits better, not a takedown.

Should you even migrate? (read this first)

Scrapy is a mature, well-engineered crawling framework. If your team is happy maintaining spiders, has deep Scrapy expertise, and your bottleneck is not engineering time, you may not need to migrate at all. This guide is for the common situation where the costs have crept up: per-site selector maintenance, anti-bot whack-a-mole, HTML→clean-text plumbing for RAG, and ops for the Twisted reactor — and you would rather call an API that returns clean, LLM-ready data.

fastCRW is a Firecrawl-compatible, open-core web-data API (scrape/crawl/map/search) you can self-host as a single ~6 MB Rust binary. The migration is rarely all-or-nothing; the best path is usually incremental.

What maps cleanly, what to keep Scrapy for

Scrapy concept	fastCRW equivalent	Notes
Spider that fetches + parses fields	`/v1/scrape` with markdown or JSON schema	Deletes most parser code
CrawlSpider / link-following	`/v1/crawl` (depth, limit)	Concurrency/dedupe handled
Sitemap/URL discovery	`/v1/map`	Fast URL inventory
SERP / discovery step	`/v1/search` + scrapeOptions	One call for find+fetch
Item pipelines (clean/validate/store)	Keep — your code	fastCRW returns clean data; you still own storage/validation
Scheduling (Scrapyd/cron)	Keep — your scheduler	fastCRW is stateless; call it from your orchestrator
Highly custom downloader middleware	Case-by-case	If you rely on bespoke middleware behavior, evaluate carefully

Rule of thumb: fastCRW replaces the fetch + render + extract-to-clean-data layer. It does not replace your scheduling or your storage/validation logic — keep those.

Migration strategy: incremental, not big-bang

Inventory spiders by type. Field-extraction spiders, broad crawlers, sitemap walkers, search-then-scrape jobs.
Start with the most brittle spider. The one that breaks most on redesigns or anti-bot is the highest-ROI first migration.
Wrap fastCRW behind your existing interface. Keep the spider's output contract; swap the internals to call fastCRW. Pipelines and storage do not change.
Run side by side. Compare extracted fields against the old spider on real URLs before cutover.
Decide hosted vs self-host. Data residency → run the AGPL-3.0 binary. Otherwise fastCRW Cloud.
Retire spiders as parity is proven. Some spiders (heavy custom middleware) may stay on Scrapy indefinitely — that is fine.

Before / after: a field-extraction spider

Before (Scrapy)

import scrapy

class ArticleSpider(scrapy.Spider):
    name = "article"
    start_urls = ["https://example.com/blog/post"]

    def parse(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "author": response.css(".author::text").get(),
            "date": response.css("time::attr(datetime)").get(),
            "body": " ".join(response.css("article p::text").getall()),
        }

Breaks when the site changes class names or DOM structure; you maintain selectors forever.

After (fastCRW, schema-driven — no selectors)

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="key", api_url="http://localhost:3000")

def fetch_article(url: str) -> dict:
    res = app.scrape_url(url, params={
        "formats": ["json"],
        "jsonOptions": {"schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "date": {"type": "string"},
                "body": {"type": "string"},
            },
            "required": ["title", "body"],
        }},
    })
    return res["json"]

The schema is semantic — it survives redesigns far better than CSS selectors and lives in version control as plain data.

Before / after: a broad crawler

Before (Scrapy CrawlSpider)

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SiteSpider(CrawlSpider):
    name = "site"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com"]
    rules = (Rule(LinkExtractor(), callback="parse_page", follow=True),)

    def parse_page(self, response):
        yield {"url": response.url,
               "text": " ".join(response.css("body ::text").getall())}

After (fastCRW crawl — one call, clean markdown)

result = app.crawl_url("https://example.com", params={
    "maxDepth": 3, "limit": 500, "formats": ["markdown"],
})
for page in result["data"]:
    yield {"url": page["metadata"]["url"], "text": page["markdown"]}

Concurrency, deduplication, depth control, and clean text extraction are handled. You delete the link-extractor rules and the body-text scraping glue.

Keeping Scrapy as orchestrator (hybrid)

You do not have to abandon Scrapy. A clean hybrid keeps Scrapy for scheduling/queueing and uses fastCRW for content:

import scrapy
from firecrawl import FirecrawlApp

crw = FirecrawlApp(api_key="key", api_url="http://localhost:3000")

class HybridSpider(scrapy.Spider):
    name = "hybrid"
    start_urls = ["https://example.com/index"]

    def parse(self, response):
        for href in response.css("a.item::attr(href)").getall():
            url = response.urljoin(href)
            data = crw.scrape_url(url, params={"formats": ["markdown"]})
            yield {"url": url, "markdown": data["markdown"]}

This is often the pragmatic end state: Scrapy where it is strong (orchestration, your pipelines), fastCRW where it is strong (fast, clean, low-maintenance extraction).

Operational differences to plan for

Footprint. The Twisted reactor and per-process memory go away on migrated paths; fastCRW self-host is a single ~6 MB binary with low idle RAM — runs on a $5 VPS.
Anti-bot. fastCRW handles common cases; for the hardest hostile targets you may still keep a specialized proxy/Scrapy path. Be honest about which sites those are.
Error handling. fastCRW returns standard HTTP errors and per-URL crawl status — simpler than reading Twisted tracebacks, but you adapt retry logic to HTTP semantics.
Data residency. Self-host the AGPL-3.0 binary if scraped data must not leave your infra.

When to keep Scrapy

Heavy reliance on custom downloader/spider middleware with no clean API equivalent.
Extremely hostile targets where your bespoke proxy + Scrapy stack already wins.
A large, stable spider fleet that is not a maintenance burden — do not migrate for its own sake.

The maintenance economics that justify migrating

Scrapy's running cost is rarely the framework itself — it is the recurring human time spent keeping spiders alive. Three categories dominate that time, and they are exactly what fastCRW absorbs:

Selector rot. Every target redesign breaks CSS/XPath selectors, and a spider fleet of any size means this happens somewhere most weeks. A semantic JSON schema describes intent ("the price") rather than DOM position, so it survives most redesigns untouched. Across a fleet, this is the single largest recurring saving and the easiest to underestimate until you total the tickets.
Anti-bot drift. Maintaining custom headers, fingerprints, throttling, and proxy logic is a permanent arms race when you own it in middleware. Moving the fetch layer to a maintained engine moves that arms race off your team for the common cases (you keep it only for genuinely hostile targets you consciously choose to keep on Scrapy).
Reliability plumbing. Retries, backoff, dedupe, depth limits, and concurrency tuning are code you wrote and now maintain. These are built into the crawl endpoint, so migrated paths shed that surface entirely.

The way to make this concrete for a migration decision is to count the last quarter's engineering hours spent on those three categories across your spiders. That number — not the per-page price of any API — is usually what justifies the migration, and it is invisible until you add it up.

A staged rollout plan that de-risks the move

Big-bang migrations of a spider fleet fail for predictable reasons: output contract drift, missed edge cases, and anti-bot differences discovered in production. A staged plan removes that risk. Stage one: pick the single spider that generates the most maintenance tickets and wrap fastCRW behind its existing output interface, leaving pipelines and storage untouched, then run both in shadow for a week and diff the fields. Stage two: promote that spider and migrate the next two by ticket volume. Stage three: migrate the bulk of low-complexity field-extraction spiders, which are mechanical once the pattern is proven. Stage four: deliberately decide which spiders stay on Scrapy forever — heavy custom middleware, the hardest hostile targets — and document why. Ending with a clear, intentional hybrid is a successful outcome, not a failure to fully migrate; the goal is lower total maintenance, not framework purity.

Getting started

docker run -p 3000:3000 ghcr.io/us/crw:latest

Free self-host (AGPL-3.0). fastCRW Cloud's free tier is a one-time lifetime 500 credits (not monthly); the unlimited free path is self-host. GitHub · fastCRW Cloud.