By the fastCRW team · Engineering deep-dive · Last reviewed 2026-01-01
Disclosure: fastCRW is an open-core project we maintain. This guide is about the structural mechanics of "open source" in the scraping category, not a feature pitch — the mechanics are what matter when you're betting infrastructure on it.
"Open source" is doing a lot of unverified work in 2026
Almost every scraping tool now claims to be open source. The label has become close to meaningless on its own, because the interesting question is not "is there a public repo?" — it's "is the version I can run myself genuinely complete, or is the part I actually need behind a cloud paywall?" In 2026 a community forked a major open-core scraper specifically because its anti-bot engine was closed-source and cloud-only. That fork is the single most instructive event in the category, and it's why this guide exists.
The four flavors of "open source scraping"
- Genuinely complete OSS. The repo is the whole product. Self-host gets full feature parity, including anti-bot. No capability is gated behind a hosted service. This is rare and valuable.
- Open-core with a crippled OSS tier. A public repo exists, but the parts that make scraping actually work at scale — the anti-bot engine, the proxy logic — are cloud-only or degrade over time as features migrate to the paid hosted product. The repo is real; the usefulness self-hosted is not what the label implies.
- Permissive library, you own the ops. Fully open (Apache/MIT), but it's a library, not a service: you assemble proxies, browsers, queues, and anti-bot yourself. Free in license, expensive in engineer-time.
- Source-available, not open source. You can read the code but the license forbids the use you want (commercial hosting, redistribution). Reading ≠ freedom.
The trap is conflating these. "It's open source" sounds like flavor 1; in practice the most popular hosted players are flavor 2, and the popular self-host libraries are flavor 3. Knowing which flavor you're adopting is the entire decision.
The open-core bait-and-switch, mechanically
Here is how a crippled-OSS tier degrades trust over time, observed across multiple independent sources in 2026:
- The hosted product gets a closed-source anti-bot/stealth engine. Self-host can't bypass the protected sites that matter most, so the "free" path silently fails on exactly the corpus you care about (Cloudflare/WAF-fronted pages).
- New features ship cloud-only first; the self-host path "degrades over time" as the gap widens release over release.
- Self-host needs heavy infra anyway (Redis + a 5-service Compose file + multiple GB of RAM), so the "free" option also has a real operational cost — you pay in ops what you saved in license.
- The net effect is structural pressure toward the paid cloud, which engineers correctly read as bait-and-switch and respond to with distrust — and, in at least one prominent case, a community fork.
This isn't a moral complaint; it's a risk model. If the part you depend on can be moved cloud-only, your "open source" insurance policy is void exactly when you'd need to claim it.
What genuine parity looks like (the test)
To tell flavor 1 from flavor 2, ask these concrete questions of any "open source" scraper:
- Is the anti-bot/rendering engine in the open-source repo, or is it cloud-only? This is the single highest-signal question. If the hard part is closed, the OSS is decorative.
- Does self-host get the same API surface as the cloud? If you can move between them by changing a base URL, parity is real and your escape hatch works.
- Has any capability migrated from OSS to cloud-only over time? Check the changelog and issues, not the marketing.
- What does self-host actually require? One binary vs. a 5-service stack with multi-GB RAM is the difference between "free" and "free if you also run a cluster."
- What's the license, exactly? AGPL-3.0, Apache, MIT, or source-available — each has different obligations and freedoms (more below).
Where fastCRW sits
fastCRW is flavor 1 with a managed option bolted on, not flavor 2 with a token repo. The engine is AGPL-3.0 and fully featured when self-hosted — there is no crippled OSS tier and no cloud-only anti-bot core; the same engine runs in both places. It is a single small Rust binary with a low idle footprint and a small Docker image: no Redis, no Python runtime, no Chromium, no 5-service Compose file. The Managed Cloud exists because some teams would rather not run a global proxy network themselves — it's convenience, not coercion, and the API is identical, so moving between self-host and Cloud is a base-URL change, not a migration.
AGPL-3.0: what it actually obligates (and what it doesn't)
AGPL-3.0 scares legal teams more than it should because the obligation is narrower than the reputation:
- Using the hosted API or Managed Cloud places no copyleft obligation on your code. You are a network client of someone else's AGPL service; your application's source is entirely unaffected.
- Running the unmodified engine internally to scrape for your own product is fine. AGPL's network clause triggers on distributing or offering a modified version of the AGPL software itself over a network — not on using its output.
- The obligation appears only if you modify the engine and then offer that modified engine to others over a network. In that specific case you must offer the corresponding modified source. Most teams never hit this.
- For teams whose policies forbid AGPL entirely, a commercial license is available — but the common fear ("AGPL means I must open-source my product") is simply not how it works for API/Cloud use or internal unmodified use.
- If you want a complete, genuinely free scraper with a tiny footprint and an identical managed fallback → fastCRW (flavor 1).
- If you want maximal Python extraction control and accept owning all the ops → a permissive library like Crawl4AI (flavor 3) — free in license, but budget the engineer-time.
- If you adopt an open-core hosted product, run the five-question parity test first and assume the cloud-only parts will widen, not narrow, over time.
- Never adopt "source-available" expecting open-source freedoms; read the license for the use you actually need.
We cover the licensing logic in depth in AGPL-3.0 for SaaS, explained; the short version is that AGPL is the strongest possible parity guarantee — it's what makes the "self-host the exact same software" escape hatch legally durable.
Governance is the signal the marketing won't give you
The parity test catches the current state, but it doesn't predict the trajectory — and open-core's whole risk is the trajectory (flavor 1 quietly drifting toward flavor 2 over releases). To predict trajectory you read governance, not the README. Concrete signals: Is the license a copyleft that structurally prevents a closed competing fork (AGPL) or a permissive one that allows the open core to be strip-mined into a closed product? Are commits still landing in the open engine, or has development visibly shifted to a private cloud repo? When new capability ships, does it land in the OSS engine first or cloud-only first — and has that pattern changed over the last several releases? Is there a stated parity commitment, and has it ever been walked back? A tool can pass the parity test today and fail it in a year because the incentives pointed that way the whole time. The most reliable single proxy: a copyleft license on the engine aligns the vendor's incentives with parity, because they cannot privately fork their own community's improvements into a closed product — which is exactly why AGPL on a scraping engine is a governance feature, not a licensing inconvenience.
The "fork as referendum" pattern
One underrated diagnostic: when a community forks an open-core tool, that fork is a referendum on the original's open-core honesty, and it's higher-signal than any review. People do not fork casually — maintaining a fork is expensive and thankless. A fork specifically motivated by "the part we need is closed-source / cloud-only" (which is exactly what happened to a major scraper in 2026 over its anti-bot engine) is the clearest possible market signal that the OSS was flavor 2, not flavor 1, regardless of how it was marketed. When evaluating any open-core scraper, search for its forks and read why they exist. A healthy flavor-1 project tends to attract contribution back to the mainline (nothing to escape); a flavor-2 project tends to attract escape forks. The presence, motivation, and momentum of forks tells you what the vendor's positioning won't.
The pragmatic recommendation
Try the complete-OSS version
docker compose up # full parity, AGPL-3.0, no cloud-only engine, no key
Managed Cloud (optional, same API): one-time lifetime 500 free credits, no card. fastcrw.com · GitHub
Related: AGPL-3.0 for SaaS explained · Self-host vs managed scraping · Local-first scraping & data privacy
