By the fastCRW team · Benchmark figures verified 2026-05-18 from a run dated 2026-05-08 · Verify independently before quoting internally.
Disclosure: We build fastCRW. This is a vendor-authored engineering post. The performance numbers below are from a single benchmark run and we publish the slow tail alongside the wins — weight the conclusions accordingly.
When to migrate a Ruby web scraper to Go
If you are reading this, you probably have a legacy Ruby scraper — Nokogiri for parsing, Mechanize or HTTParty for fetching — that worked fine at a few hundred pages a day and now buckles at tens of thousands. The instinct is to migrate the Ruby web scraper to Go for concurrency and throughput, and that instinct is often right. But a full rewrite is not the only path, and it is rarely the cheapest. This post covers three options: rewrite in Go, keep Ruby and offload the heavy work, or have your Go service call a scraping API so the rewrite never touches parsing logic at all.
The honest version of the trade-off matters here, because a Go rewrite costs engineering weeks and the gains are real but bounded. Let's map the rewrite first, then the shortcut, then a build-vs-buy decision you can defend in a planning doc.
Why Ruby hits a wall at scale
Ruby's concurrency story is the usual blocker. MRI's Global VM Lock (GVL) means threads don't run Ruby bytecode in true parallel — they interleave, and only release the lock during I/O. For an I/O-bound scraper that is less catastrophic than it sounds (network waits release the GVL), but you still pay for per-thread memory, context-switching, and a parser (Nokogiri wraps libxml2) that allocates heavily on large or malformed documents. Mechanize layering a stateful agent on top adds more. At thousands of concurrent fetches, the process footprint and tail latency grow more steeply than the work does.
Why Go is the common target
Go's appeal for scraping is goroutines: cheap, scheduled in user space, and trivially fanned out across a worker pool with a bounded channel. A Go scraper that holds 5,000 in-flight requests does so with a fraction of the per-unit overhead of 5,000 Ruby threads. You also get a single static binary to deploy — no Ruby runtime, no gem bundle, no version drift between dev and prod. For a deeper comparison of the language ecosystems, see web scraping in Ruby and web scraping in Go.
Mapping the rewrite, component by component
If you do rewrite, the migration is mostly a 1:1 mapping of concerns. The table below pairs the Ruby building blocks with their Go equivalents — or with the place where an API call replaces the block entirely.
| Ruby (legacy) | Go equivalent | Or: offload to an API |
|---|---|---|
| Nokogiri HTML parsing | goquery / net/html | Markdown or JSON-schema output from /v1/scrape |
| Mechanize stateful agent | Explicit http.Client + cookie jar | One stateless request per page |
| Thread pool + GVL | Goroutines + bounded worker channel | Concurrency handled server-side |
| Hand-rolled retries/backoff | context deadlines + retry loop | Renderer fallback inside the engine |
| Selenium/Watir for JS pages | chromedp / a headless browser | chrome renderer on the API |
Concurrency: the part worth getting right
The single biggest win of a Go rewrite is also the easiest place to shoot yourself: unbounded goroutines will happily open 50,000 sockets and get your IP blocked or your process OOM-killed. Use a bounded worker pool — a buffered channel of N tokens, one acquired per fetch — plus a rate limiter (golang.org/x/time/rate) so you stay polite to the target. This is the discipline Ruby's GVL accidentally imposed on you; in Go you have to add it back on purpose.
Parsing and the cleanup you still own
Here is the catch a pure rewrite doesn't solve: Nokogiri gave you raw DOM, and so will goquery. You still own boilerplate stripping, main-content detection, JavaScript-rendered pages that return an empty shell to a plain http.Get, and the per-site selector churn that breaks every time a site reships its markup. The language changed; the maintenance surface did not. That is exactly the work the API path removes.
The simpler shortcut: call an API from Go
The third option keeps your Go service thin. Instead of porting parsers and a browser fleet, your Go code makes an HTTP request to a scraping engine and gets back clean Markdown (or typed JSON) — no DOM walking, no headless Chrome to operate. fastCRW exposes a Firecrawl-compatible REST API, so the call is a plain POST /v1/scrape:
- Send a URL, get Markdown.
POST /v1/scrapewith a JSON body containing the target URL returns LLM-ready Markdown by default — 1 credit on thehttporlightpandarenderer, 2 credits when thechromerenderer is needed for JavaScript pages. - Want fields instead of prose? Add
formats: ["json"]and ajsonSchemaand the engine fills your schema from the page (5 credits per request; LLM extraction supports OpenAI and Anthropic providers only). - Let the engine pick the renderer. The default
autorenderer falls backchrome → lightpanda → http, so you don't branch on "does this page need a browser?" in your Go code.
The engine doing the rendering and cleanup is written in Rust and ships as a single static binary (~8 MB image, one container — a structural fact, not a benchmark claim), which matters for the build-vs-buy math below. Your Go program stays a thin orchestrator: a worker pool issuing http.Client requests to the API, collecting structured results.
Self-host it next to your Go service
You are not forced onto a hosted cloud to get this. The engine is AGPL-3.0 and self-hosting is free — you pay only for your own server (see pricing for the managed option). A common pattern is to run the single binary as a sidecar container next to your Go service and point your Go HTTP client at http://localhost:PORT/v1/scrape. Scraped content and target URLs never leave your infrastructure, and you have replaced "rewrite Nokogiri and operate a browser fleet" with "deploy one more small binary."
Performance expectations, disclosed honestly
The reason to consider the API path on speed grounds is the median, and the reason to plan carefully is the tail. On Firecrawl's own public scrape-content dataset — 819 labeled URLs out of 1,000, measured with diagnose_3way.py on 2026-05-08 — fastCRW posted a p50 of 1914 ms, which beats Firecrawl's 2305 ms and is effectively tied with Crawl4AI (1916 ms). On the same run it had the highest accuracy of the three tools tested: 63.74% truth-recall versus 59.95% (Crawl4AI) and 56.04% (Firecrawl), with 0 thrown errors across 3,000 requests and an 87.7% scrape-success rate.
Now the honest part: fastCRW's p90 was 14157 ms — the worst of the three tested (Crawl4AI 4754 ms, Firecrawl 6937 ms). That is not noise. The chrome-stealth fallback that recovers the hard URLs the other tools miss is the same mechanism that produces the slow tail. If your Go service has a tight per-request deadline, set a generous context timeout and treat tail pages as a separate, retried queue rather than blocking a worker on them. The full p50/p90/p99 split lives on /benchmarks — never plan against a single average.
Why a Rust engine helps but isn't magic
People expect "rewrite the slow thing in a fast language" to produce a flat speed multiplier. It does not. A Rust (or Go) engine removes interpreter and GC overhead and parses faster, but a scraper's wall-clock time is dominated by the network round trip and, for JavaScript pages, by the browser render. No language rewrite shortens a 12-second JS-heavy page load. That is exactly why the median is fast and the tail is long — and why we publish both numbers instead of a tidy multiplier.
Build vs buy for the rewrite
Three paths, and a rough rule for each:
- Pure Go rewrite — worth it when scraping logic is your core product, you need fine-grained per-request control, and you have the engineering weeks. You own parsing, rendering, anti-bot, and selector maintenance forever, but you control every byte.
- Keep Ruby, offload heavy pages — the cheapest near-term move. Leave the Ruby orchestration in place and call the API for the JS-heavy or accuracy-critical pages that were the actual bottleneck. No rewrite at all.
- Thin Go service + API — the sweet spot if you wanted Go for deployment and concurrency anyway, but don't want to own parsing and a browser fleet. Your Go code orchestrates; the Rust engine renders and cleans.
The deciding question is rarely "Ruby or Go?" It is "how much of the scraper do we want to keep maintaining?" A pure rewrite changes the language and keeps all the maintenance; the API path removes the parsing-and-browser maintenance regardless of which language calls it.
What the API path does not replace
State the limits plainly so you don't migrate into a wall:
- Stateless per request. There is no persistent session — multi-step logged-in flows that Mechanize handled need their own state layer (or keep them in Go/Ruby with a real cookie jar).
- No built-in anti-bot / Fire-engine. Hardened targets may still need a proxy layer in front of your requests.
- No screenshot output (a request for
formats: ["screenshot"]returns HTTP 422), no multi-URL batch extract, and no managed agent — you compose the loop yourself with a worker pool. - LLM extraction is OpenAI/Anthropic only. If you need a different provider for schema-filling, that path is closed today.
Where a Go rewrite genuinely wins
To be fair to the option this post complicates: a from-scratch Go scraper wins when you need total control. Custom retry semantics per domain, request-level fingerprinting, a proxy-rotation strategy you tune by hand, or integration with an existing Go data pipeline are all cases where owning the code beats calling an API. If scraping is the product rather than plumbing, write the Go. The architectural reasoning behind picking a compiled engine for the hot path is worth reading — see Rust vs Python scrapers: architecture. And if you do go the thin-Go-plus-API route, the Go web scraping quickstart shows the client code end to end.
Sources
- fastCRW 3-way scrape benchmark —
bench/server-runs/RESULT_3WAY_1000_FULL.md(diagnose_3way.py, 819 labeled URLs, 2026-05-08): truth-recall 63.74%, p50 1914 ms, p90 14157 ms, 87.7% scrape-success, 0 errors. - fastCRW repo and endpoint table: github.com/us/crw · managed cloud fastcrw.com
- marketing/CANONICAL-FACTS.md §1 (product identity), §2 (self-host free), §4 (API surface), §5 (scrape benchmark), §7 (structural footprint).
Related: Web scraping in Ruby · Web scraping in Go · Go web scraping quickstart · Rust vs Python scrapers
