By the fastCRW team · Migration tutorial · Scrapy is excellent — this is about when an API fits better, not a takedown.
Should you even migrate? (read this first)
Scrapy is a mature, well-engineered crawling framework. If your team is happy maintaining spiders, has deep Scrapy expertise, and your bottleneck is not engineering time, you may not need to migrate at all. This guide is for the common situation where the costs have crept up: per-site selector maintenance, anti-bot whack-a-mole, HTML→clean-text plumbing for RAG, and ops for the Twisted reactor — and you would rather call an API that returns clean, LLM-ready data.
fastCRW is a Firecrawl-compatible, open-core web-data API (scrape/crawl/map/search) you can self-host as a single ~6 MB Rust binary. The migration is rarely all-or-nothing; the best path is usually incremental.
What maps cleanly, what to keep Scrapy for
| Scrapy concept | fastCRW equivalent | Notes |
|---|---|---|
| Spider that fetches + parses fields | /v1/scrape with markdown or JSON schema | Deletes most parser code |
| CrawlSpider / link-following | /v1/crawl (depth, limit) | Concurrency/dedupe handled |
| Sitemap/URL discovery | /v1/map | Fast URL inventory |
| SERP / discovery step | /v1/search + scrapeOptions | One call for find+fetch |
| Item pipelines (clean/validate/store) | Keep — your code | fastCRW returns clean data; you still own storage/validation |
| Scheduling (Scrapyd/cron) | Keep — your scheduler | fastCRW is stateless; call it from your orchestrator |
| Highly custom downloader middleware | Case-by-case | If you rely on bespoke middleware behavior, evaluate carefully |
Rule of thumb: fastCRW replaces the fetch + render + extract-to-clean-data layer. It does not replace your scheduling or your storage/validation logic — keep those.
Migration strategy: incremental, not big-bang
- Inventory spiders by type. Field-extraction spiders, broad crawlers, sitemap walkers, search-then-scrape jobs.
- Start with the most brittle spider. The one that breaks most on redesigns or anti-bot is the highest-ROI first migration.
- Wrap fastCRW behind your existing interface. Keep the spider's output contract; swap the internals to call fastCRW. Pipelines and storage do not change.
- Run side by side. Compare extracted fields against the old spider on real URLs before cutover.
- Decide hosted vs self-host. Data residency → run the AGPL-3.0 binary. Otherwise fastCRW Cloud.
- Retire spiders as parity is proven. Some spiders (heavy custom middleware) may stay on Scrapy indefinitely — that is fine.
Before / after: a field-extraction spider
Before (Scrapy)
import scrapy
class ArticleSpider(scrapy.Spider):
name = "article"
start_urls = ["https://example.com/blog/post"]
def parse(self, response):
yield {
"title": response.css("h1::text").get(),
"author": response.css(".author::text").get(),
"date": response.css("time::attr(datetime)").get(),
"body": " ".join(response.css("article p::text").getall()),
}
Breaks when the site changes class names or DOM structure; you maintain selectors forever.
After (fastCRW, schema-driven — no selectors)
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="key", api_url="http://localhost:3000")
def fetch_article(url: str) -> dict:
res = app.scrape_url(url, params={
"formats": ["json"],
"jsonOptions": {"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"date": {"type": "string"},
"body": {"type": "string"},
},
"required": ["title", "body"],
}},
})
return res["json"]
The schema is semantic — it survives redesigns far better than CSS selectors and lives in version control as plain data.
Before / after: a broad crawler
Before (Scrapy CrawlSpider)
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SiteSpider(CrawlSpider):
name = "site"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
rules = (Rule(LinkExtractor(), callback="parse_page", follow=True),)
def parse_page(self, response):
yield {"url": response.url,
"text": " ".join(response.css("body ::text").getall())}
After (fastCRW crawl — one call, clean markdown)
result = app.crawl_url("https://example.com", params={
"maxDepth": 3, "limit": 500, "formats": ["markdown"],
})
for page in result["data"]:
yield {"url": page["metadata"]["url"], "text": page["markdown"]}
Concurrency, deduplication, depth control, and clean text extraction are handled. You delete the link-extractor rules and the body-text scraping glue.
Keeping Scrapy as orchestrator (hybrid)
You do not have to abandon Scrapy. A clean hybrid keeps Scrapy for scheduling/queueing and uses fastCRW for content:
import scrapy
from firecrawl import FirecrawlApp
crw = FirecrawlApp(api_key="key", api_url="http://localhost:3000")
class HybridSpider(scrapy.Spider):
name = "hybrid"
start_urls = ["https://example.com/index"]
def parse(self, response):
for href in response.css("a.item::attr(href)").getall():
url = response.urljoin(href)
data = crw.scrape_url(url, params={"formats": ["markdown"]})
yield {"url": url, "markdown": data["markdown"]}
This is often the pragmatic end state: Scrapy where it is strong (orchestration, your pipelines), fastCRW where it is strong (fast, clean, low-maintenance extraction).
Operational differences to plan for
- Footprint. The Twisted reactor and per-process memory go away on migrated paths; fastCRW self-host is a single ~6 MB binary with low idle RAM — runs on a $5 VPS.
- Anti-bot. fastCRW handles common cases; for the hardest hostile targets you may still keep a specialized proxy/Scrapy path. Be honest about which sites those are.
- Error handling. fastCRW returns standard HTTP errors and per-URL crawl status — simpler than reading Twisted tracebacks, but you adapt retry logic to HTTP semantics.
- Data residency. Self-host the AGPL-3.0 binary if scraped data must not leave your infra.
When to keep Scrapy
- Heavy reliance on custom downloader/spider middleware with no clean API equivalent.
- Extremely hostile targets where your bespoke proxy + Scrapy stack already wins.
- A large, stable spider fleet that is not a maintenance burden — do not migrate for its own sake.
The maintenance economics that justify migrating
Scrapy's running cost is rarely the framework itself — it is the recurring human time spent keeping spiders alive. Three categories dominate that time, and they are exactly what fastCRW absorbs:
- Selector rot. Every target redesign breaks CSS/XPath selectors, and a spider fleet of any size means this happens somewhere most weeks. A semantic JSON schema describes intent ("the price") rather than DOM position, so it survives most redesigns untouched. Across a fleet, this is the single largest recurring saving and the easiest to underestimate until you total the tickets.
- Anti-bot drift. Maintaining custom headers, fingerprints, throttling, and proxy logic is a permanent arms race when you own it in middleware. Moving the fetch layer to a maintained engine moves that arms race off your team for the common cases (you keep it only for genuinely hostile targets you consciously choose to keep on Scrapy).
- Reliability plumbing. Retries, backoff, dedupe, depth limits, and concurrency tuning are code you wrote and now maintain. These are built into the crawl endpoint, so migrated paths shed that surface entirely.
The way to make this concrete for a migration decision is to count the last quarter's engineering hours spent on those three categories across your spiders. That number — not the per-page price of any API — is usually what justifies the migration, and it is invisible until you add it up.
A staged rollout plan that de-risks the move
Big-bang migrations of a spider fleet fail for predictable reasons: output contract drift, missed edge cases, and anti-bot differences discovered in production. A staged plan removes that risk. Stage one: pick the single spider that generates the most maintenance tickets and wrap fastCRW behind its existing output interface, leaving pipelines and storage untouched, then run both in shadow for a week and diff the fields. Stage two: promote that spider and migrate the next two by ticket volume. Stage three: migrate the bulk of low-complexity field-extraction spiders, which are mechanical once the pattern is proven. Stage four: deliberately decide which spiders stay on Scrapy forever — heavy custom middleware, the hardest hostile targets — and document why. Ending with a clear, intentional hybrid is a successful outcome, not a failure to fully migrate; the goal is lower total maintenance, not framework purity.
Getting started
docker run -p 3000:3000 ghcr.io/us/crw:latest
Free self-host (AGPL-3.0). fastCRW Cloud's free tier is a one-time lifetime 500 credits (not monthly); the unlimited free path is self-host. GitHub · fastCRW Cloud.
