By the fastCRW team · Java tutorial & analysis · Last reviewed 2026-01-01
Disclosure: fastCRW is a single-binary scraping engine with an HTTP API. This guide is candid about the JVM-specific cost of doing scraping in-process and where a sidecar is the better architecture.
Java scraping is almost always inside a long-lived service
Java scraping lives in Spring Boot services, batch jobs, and enterprise backends — long-running JVM processes. That context defines the real problem: not "can Java parse HTML" (JSoup has done that superbly for years) but "what does it cost to host the render-heavy, anti-bot-heavy part of scraping inside a JVM service whose memory and warm-up characteristics you're already managing carefully?" The answer is a footprint tax most teams under-account for, and that's this guide's subject.
The Java stack
- JSoup — the standard: fetch + parse + CSS-selector extraction in one clean, fast, dependency-light library. For static HTML this is genuinely excellent and you should just use it.
- Java 11+
HttpClient— the modern built-in HTTP client (async, HTTP/2). Pair with JSoup's parser when you want explicit fetch control. - HtmlUnit — a headless "GUI-less browser" in pure Java; handles some JS without a real browser, but its JS engine diverges from modern Chromium and breaks on heavy SPAs.
- Selenium / Playwright-for-Java — drive a real browser for true JS rendering. This is the heavyweight path: a browser process plus a driver alongside your JVM.
Minimal JSoup example
Document doc = Jsoup.connect("https://example.com/")
.userAgent("Mozilla/5.0")
.timeout(8000)
.get();
String title = doc.selectFirst("article h1").text();
System.out.println(title);
The Java-specific cost: the JVM footprint tax stacks the wrong way
Every language has a footprint story; Java's is distinctive because the costs compound on the JVM:
- Baseline JVM memory. A scrape worker is a JVM — heap, metaspace, thread stacks — hundreds of MB before it does anything. Fine when amortized across a busy service; expensive when you scale out dedicated scrape workers, because each one carries a full JVM.
- Browser on top of JVM. Add Selenium/Playwright and you now run a Chromium process next to the JVM. You're paying JVM footprint and browser footprint per worker — the worst of both, and the line that blows up scale-out cost.
- Warm-up vs. burst scraping. JIT warm-up means a freshly scaled scrape worker is slow for its first requests. If you scale scrape workers reactively to crawl bursts, you eat cold JVM performance exactly when load spikes.
- GC pauses under parse load. Parsing many large documents concurrently generates garbage; GC pressure on a shared service JVM can perturb latency for your non-scraping endpoints too. Scraping becomes a noisy neighbor inside your own process.
None of this is "Java is bad at scraping." It's that putting render-heavy scraping in-process on the JVM makes scraping a tax on the rest of your service, and dedicated JVM scrape workers are an expensive way to isolate it.
The sidecar pattern: isolate scraping in a tiny process, not a fat JVM
The architecture that resolves the footprint tax is to move fetch/render/anti-bot/extraction out of the JVM entirely and into a small dedicated process your Java service calls over HTTP. The contrast is the whole point: instead of isolating scraping in more JVMs (each hundreds of MB plus a browser), you isolate it in a single static binary (~6MB, single-digit-MB idle RAM, no browser resident until needed). Your Spring service stays lean and its GC/latency profile stays uncontaminated:
var req = HttpRequest.newBuilder()
.uri(URI.create(base + "/v1/scrape"))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(
"{\"url\":\"" + target + "\",\"formats\":[\"markdown\"]}"))
.build();
var body = client.send(req, HttpResponse.BodyHandlers.ofString()).body();
// clean markdown; no Chromium and no extra JVM in your scale-out
What the sidecar deletes from your Java footprint math: no Chromium next to the JVM, no JIT-warm-up penalty on scrape bursts (the engine is a compiled binary with tens-of-ms cold start), no scraping GC pressure on your service heap, and scale-out of the scrape capacity is scaling a 6MB process, not a JVM fleet. fastCRW is a fitting sidecar because it is exactly that single static binary, open-core (AGPL-3.0), with a Firecrawl-compatible HTTP API — run it as a container in your pod/host, and switch the same Java code to the Managed Cloud by changing one base-URL property.
The Java-specific subtlety: thread-per-request, virtual threads, and the blocking-call reality
Java's scraping concurrency story changed with virtual threads (Project Loom), and it's worth being precise because it's easy to over-claim. Virtual threads make blocking I/O cheap again: you can have tens of thousands of virtual threads each doing a blocking JSoup fetch without exhausting platform threads, which genuinely improves in-process scraping ergonomics versus the old thread-pool-and-CompletableFuture gymnastics. But virtual threads solve the concurrency cost, not the footprint cost or the browser cost — you still carry the JVM heap, you still need a real Chromium for true JS rendering, and many parsing libraries pin carrier threads during CPU-heavy work, partially defeating the benefit under parse-heavy load. So Loom narrows but does not close the gap: it makes "many concurrent fetches in one JVM" pleasant, while the reasons to externalize render/anti-bot/extraction (footprint, browser, GC noise, JIT warm-up on burst scale-out) are untouched. Use virtual threads to fan out calls to the sidecar elegantly; don't mistake them for a reason to move rendering back into the JVM.
Class loading, warm-up, and why burst-scaled scrape workers underperform
A concrete Java-specific cost that the sidecar pattern removes deserves elaboration. When an autoscaler spins up a new JVM scrape worker in response to a crawl burst, that JVM must class-load, JIT-compile hot paths, and warm its caches before it reaches steady-state throughput — frequently tens of seconds of degraded performance, exactly during the spike that triggered the scale-up. Worse, scrape workloads are often bursty by nature (a scheduled crawl, a user-triggered bulk import), so you repeatedly pay warm-up at the worst moment. Workarounds exist (CDS, AOT/GraalVM native-image, keeping a warm pool) but each adds build and operational complexity to your service. A compiled single-binary sidecar has effectively no warm-up: tens-of-ms cold start, steady-state performance from request one. Offloading scraping to it means the burst-scale path you care about isn't the JVM-warm-up path at all — you scale a thing that's fast immediately, and your JVM service scales only on its own (steadier) workload.
When in-process JSoup is still the right call
- Static, friendly HTML, modest volume → JSoup in-process is clean, fast, and adds zero infrastructure. Don't over-engineer.
- Tight coupling between parsed data and your domain model in a batch job → in-process extraction is reasonable.
- No JS rendering and no anti-bot in your target set → the footprint tax is small; JSoup alone is fine.
Adopt the sidecar when you need JS rendering (avoid a browser next to the JVM), when targets are hostile (anti-bot doesn't belong in your service code), or when scrape-worker JVM/browser footprint is becoming a real line on your infrastructure bill.
Bottom line
Web scraping in Java is well served by JSoup for the static-parse majority — use it. The Java-specific cost appears when you add rendering and scale out: the JVM footprint tax compounds, a browser on top of the JVM is the worst of both, and scraping becomes a noisy neighbor in your own process. The clean fix is a sidecar — isolate scraping in a single small static binary rather than a fleet of fat JVMs — called over HTTP. fastCRW's open-core single-binary engine is built to be exactly that sidecar, self-hosted or managed by one config switch.
Try it from Java
docker compose up # 6MB sidecar, not a second JVM, AGPL-3.0
Managed Cloud, same API: one-time lifetime 500 free credits, no card. fastcrw.com · GitHub
Related: Rust vs Python scrapers · Web scraping in Go · Scraping latency explained
