Skip to main content
Tutorial

Web Scraping in Java (2026): JSoup, the JVM Footprint Tax, and the Sidecar Pattern

Web scraping in Java for backend teams — JSoup and HtmlUnit, Selenium's heavyweight reality, the JVM memory tax for scrape workers, and why a single-binary scraping sidecar beats fattening your service.

fastcrw
June 1, 202614 min read

By the fastCRW team · Java tutorial & analysis · Last reviewed 2026-01-01

Disclosure: fastCRW is a single-binary scraping engine with an HTTP API. This guide is candid about the JVM-specific cost of doing scraping in-process and where a sidecar is the better architecture.

Java scraping is almost always inside a long-lived service

Java scraping lives in Spring Boot services, batch jobs, and enterprise backends — long-running JVM processes. That context defines the real problem: not "can Java parse HTML" (JSoup has done that superbly for years) but "what does it cost to host the render-heavy, anti-bot-heavy part of scraping inside a JVM service whose memory and warm-up characteristics you're already managing carefully?" The answer is a footprint tax most teams under-account for, and that's this guide's subject.

The Java stack

  • JSoup — the standard: fetch + parse + CSS-selector extraction in one clean, fast, dependency-light library. For static HTML this is genuinely excellent and you should just use it.
  • Java 11+ HttpClient — the modern built-in HTTP client (async, HTTP/2). Pair with JSoup's parser when you want explicit fetch control.
  • HtmlUnit — a headless "GUI-less browser" in pure Java; handles some JS without a real browser, but its JS engine diverges from modern Chromium and breaks on heavy SPAs.
  • Selenium / Playwright-for-Java — drive a real browser for true JS rendering. This is the heavyweight path: a browser process plus a driver alongside your JVM.

Minimal JSoup example

Document doc = Jsoup.connect("https://example.com/")
        .userAgent("Mozilla/5.0")
        .timeout(8000)
        .get();
String title = doc.selectFirst("article h1").text();
System.out.println(title);

The Java-specific cost: the JVM footprint tax stacks the wrong way

Every language has a footprint story; Java's is distinctive because the costs compound on the JVM:

  • Baseline JVM memory. A scrape worker is a JVM — heap, metaspace, thread stacks — hundreds of MB before it does anything. Fine when amortized across a busy service; expensive when you scale out dedicated scrape workers, because each one carries a full JVM.
  • Browser on top of JVM. Add Selenium/Playwright and you now run a Chromium process next to the JVM. You're paying JVM footprint and browser footprint per worker — the worst of both, and the line that blows up scale-out cost.
  • Warm-up vs. burst scraping. JIT warm-up means a freshly scaled scrape worker is slow for its first requests. If you scale scrape workers reactively to crawl bursts, you eat cold JVM performance exactly when load spikes.
  • GC pauses under parse load. Parsing many large documents concurrently generates garbage; GC pressure on a shared service JVM can perturb latency for your non-scraping endpoints too. Scraping becomes a noisy neighbor inside your own process.

None of this is "Java is bad at scraping." It's that putting render-heavy scraping in-process on the JVM makes scraping a tax on the rest of your service, and dedicated JVM scrape workers are an expensive way to isolate it.

The sidecar pattern: isolate scraping in a tiny process, not a fat JVM

The architecture that resolves the footprint tax is to move fetch/render/anti-bot/extraction out of the JVM entirely and into a small dedicated process your Java service calls over HTTP. The contrast is the whole point: instead of isolating scraping in more JVMs (each hundreds of MB plus a browser), you isolate it in a single static binary (~6MB, single-digit-MB idle RAM, no browser resident until needed). Your Spring service stays lean and its GC/latency profile stays uncontaminated:

var req = HttpRequest.newBuilder()
    .uri(URI.create(base + "/v1/scrape"))
    .header("Authorization", "Bearer " + apiKey)
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(
        "{\"url\":\"" + target + "\",\"formats\":[\"markdown\"]}"))
    .build();
var body = client.send(req, HttpResponse.BodyHandlers.ofString()).body();
// clean markdown; no Chromium and no extra JVM in your scale-out

What the sidecar deletes from your Java footprint math: no Chromium next to the JVM, no JIT-warm-up penalty on scrape bursts (the engine is a compiled binary with tens-of-ms cold start), no scraping GC pressure on your service heap, and scale-out of the scrape capacity is scaling a 6MB process, not a JVM fleet. fastCRW is a fitting sidecar because it is exactly that single static binary, open-core (AGPL-3.0), with a Firecrawl-compatible HTTP API — run it as a container in your pod/host, and switch the same Java code to the Managed Cloud by changing one base-URL property.

The Java-specific subtlety: thread-per-request, virtual threads, and the blocking-call reality

Java's scraping concurrency story changed with virtual threads (Project Loom), and it's worth being precise because it's easy to over-claim. Virtual threads make blocking I/O cheap again: you can have tens of thousands of virtual threads each doing a blocking JSoup fetch without exhausting platform threads, which genuinely improves in-process scraping ergonomics versus the old thread-pool-and-CompletableFuture gymnastics. But virtual threads solve the concurrency cost, not the footprint cost or the browser cost — you still carry the JVM heap, you still need a real Chromium for true JS rendering, and many parsing libraries pin carrier threads during CPU-heavy work, partially defeating the benefit under parse-heavy load. So Loom narrows but does not close the gap: it makes "many concurrent fetches in one JVM" pleasant, while the reasons to externalize render/anti-bot/extraction (footprint, browser, GC noise, JIT warm-up on burst scale-out) are untouched. Use virtual threads to fan out calls to the sidecar elegantly; don't mistake them for a reason to move rendering back into the JVM.

Class loading, warm-up, and why burst-scaled scrape workers underperform

A concrete Java-specific cost that the sidecar pattern removes deserves elaboration. When an autoscaler spins up a new JVM scrape worker in response to a crawl burst, that JVM must class-load, JIT-compile hot paths, and warm its caches before it reaches steady-state throughput — frequently tens of seconds of degraded performance, exactly during the spike that triggered the scale-up. Worse, scrape workloads are often bursty by nature (a scheduled crawl, a user-triggered bulk import), so you repeatedly pay warm-up at the worst moment. Workarounds exist (CDS, AOT/GraalVM native-image, keeping a warm pool) but each adds build and operational complexity to your service. A compiled single-binary sidecar has effectively no warm-up: tens-of-ms cold start, steady-state performance from request one. Offloading scraping to it means the burst-scale path you care about isn't the JVM-warm-up path at all — you scale a thing that's fast immediately, and your JVM service scales only on its own (steadier) workload.

When in-process JSoup is still the right call

  • Static, friendly HTML, modest volume → JSoup in-process is clean, fast, and adds zero infrastructure. Don't over-engineer.
  • Tight coupling between parsed data and your domain model in a batch job → in-process extraction is reasonable.
  • No JS rendering and no anti-bot in your target set → the footprint tax is small; JSoup alone is fine.

Adopt the sidecar when you need JS rendering (avoid a browser next to the JVM), when targets are hostile (anti-bot doesn't belong in your service code), or when scrape-worker JVM/browser footprint is becoming a real line on your infrastructure bill.

Bottom line

Web scraping in Java is well served by JSoup for the static-parse majority — use it. The Java-specific cost appears when you add rendering and scale out: the JVM footprint tax compounds, a browser on top of the JVM is the worst of both, and scraping becomes a noisy neighbor in your own process. The clean fix is a sidecar — isolate scraping in a single small static binary rather than a fleet of fat JVMs — called over HTTP. fastCRW's open-core single-binary engine is built to be exactly that sidecar, self-hosted or managed by one config switch.

Try it from Java

docker compose up   # 6MB sidecar, not a second JVM, AGPL-3.0

Managed Cloud, same API: one-time lifetime 500 free credits, no card. fastcrw.com · GitHub

Related: Rust vs Python scrapers · Web scraping in Go · Scraping latency explained

FAQ

Frequently asked questions

What is the best library for web scraping in Java?
JSoup is the standard for static HTML — fetch, parse, and CSS-selector extraction in one fast, lightweight library. HtmlUnit handles limited JS; Selenium or Playwright-for-Java drive a real browser for full JS rendering at a heavy footprint cost.
Why is in-process scraping expensive on the JVM?
Each scrape worker carries a full JVM (hundreds of MB), a browser for JS rendering runs on top of that, JIT warm-up penalizes burst-scaled workers, and parse-heavy GC pressure can perturb your service's other endpoints. Scraping becomes a tax on the rest of the process.
What is the sidecar pattern for Java scraping?
Move fetch/render/anti-bot/extraction out of the JVM into a small dedicated process your service calls over HTTP — isolating scraping in a single ~6MB static binary instead of a fleet of fat JVMs. fastCRW's open-core single-binary engine is designed for this, self-hosted or via its Cloud by one config switch.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive