What is the best library for web scraping in Java?

JSoup is the standard for static HTML — fetch, parse, and CSS-selector extraction in one fast, lightweight library. HtmlUnit handles limited JS; Selenium or Playwright-for-Java drive a real browser for full JS rendering at a heavy footprint cost.

Why is in-process scraping expensive on the JVM?

Each scrape worker carries a full JVM (hundreds of MB), a browser for JS rendering runs on top of that, JIT warm-up penalizes burst-scaled workers, and parse-heavy GC pressure can perturb your service's other endpoints. Scraping becomes a tax on the rest of the process.

What is the sidecar pattern for Java scraping?

Move fetch/render/anti-bot/extraction out of the JVM into a small dedicated process your service calls over HTTP — isolating scraping in a single ~6MB static binary instead of a fleet of fat JVMs. fastCRW's open-core single-binary engine is designed for this, self-hosted or via its Cloud by one config switch.

Web Scraping in Java (2026): JSoup, the JVM Footprint Tax, and the Sidecar Pattern

By the fastCRW team · Java tutorial & analysis · Last reviewed 2026-01-01

Disclosure: fastCRW is a single-binary scraping engine with an HTTP API. This guide is candid about the JVM-specific cost of doing scraping in-process and where a sidecar is the better architecture.

Java scraping is almost always inside a long-lived service

Java scraping lives in Spring Boot services, batch jobs, and enterprise backends — long-running JVM processes. That context defines the real problem: not "can Java parse HTML" (JSoup has done that superbly for years) but "what does it cost to host the render-heavy, anti-bot-heavy part of scraping inside a JVM service whose memory and warm-up characteristics you're already managing carefully?" The answer is a footprint tax most teams under-account for, and that's this guide's subject.

The Java stack

JSoup — the standard: fetch + parse + CSS-selector extraction in one clean, fast, dependency-light library. For static HTML this is genuinely excellent and you should just use it.
Java 11+ HttpClient — the modern built-in HTTP client (async, HTTP/2). Pair with JSoup's parser when you want explicit fetch control.
HtmlUnit — a headless "GUI-less browser" in pure Java; handles some JS without a real browser, but its JS engine diverges from modern Chromium and breaks on heavy SPAs.
Selenium / Playwright-for-Java — drive a real browser for true JS rendering. This is the heavyweight path: a browser process plus a driver alongside your JVM.

Minimal JSoup example

Document doc = Jsoup.connect("https://example.com/")
        .userAgent("Mozilla/5.0")
        .timeout(8000)
        .get();
String title = doc.selectFirst("article h1").text();
System.out.println(title);

The Java-specific cost: the JVM footprint tax stacks the wrong way

Every language has a footprint story; Java's is distinctive because the costs compound on the JVM:

Baseline JVM memory. A scrape worker is a JVM — heap, metaspace, thread stacks — hundreds of MB before it does anything. Fine when amortized across a busy service; expensive when you scale out dedicated scrape workers, because each one carries a full JVM.
Browser on top of JVM. Add Selenium/Playwright and you now run a Chromium process next to the JVM. You're paying JVM footprint and browser footprint per worker — the worst of both, and the line that blows up scale-out cost.
Warm-up vs. burst scraping. JIT warm-up means a freshly scaled scrape worker is slow for its first requests. If you scale scrape workers reactively to crawl bursts, you eat cold JVM performance exactly when load spikes.
GC pauses under parse load. Parsing many large documents concurrently generates garbage; GC pressure on a shared service JVM can perturb latency for your non-scraping endpoints too. Scraping becomes a noisy neighbor inside your own process.

None of this is "Java is bad at scraping." It's that putting render-heavy scraping in-process on the JVM makes scraping a tax on the rest of your service, and dedicated JVM scrape workers are an expensive way to isolate it.

The sidecar pattern: isolate scraping in a tiny process, not a fat JVM

The architecture that resolves the footprint tax is to move fetch/render/anti-bot/extraction out of the JVM entirely and into a small dedicated process your Java service calls over HTTP. The contrast is the whole point: instead of isolating scraping in more JVMs (each hundreds of MB plus a browser), you isolate it in a single static binary (~6MB, single-digit-MB idle RAM, no browser resident until needed). Your Spring service stays lean and its GC/latency profile stays uncontaminated:

var req = HttpRequest.newBuilder()
    .uri(URI.create(base + "/v1/scrape"))
    .header("Authorization", "Bearer " + apiKey)
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(
        "{\"url\":\"" + target + "\",\"formats\":[\"markdown\"]}"))
    .build();
var body = client.send(req, HttpResponse.BodyHandlers.ofString()).body();
// clean markdown; no Chromium and no extra JVM in your scale-out

What the sidecar deletes from your Java footprint math: no Chromium next to the JVM, no JIT-warm-up penalty on scrape bursts (the engine is a compiled binary with tens-of-ms cold start), no scraping GC pressure on your service heap, and scale-out of the scrape capacity is scaling a 6MB process, not a JVM fleet. fastCRW is a fitting sidecar because it is exactly that single static binary, open-core (AGPL-3.0), with a Firecrawl-compatible HTTP API — run it as a container in your pod/host, and switch the same Java code to the Managed Cloud by changing one base-URL property.

The Java-specific subtlety: thread-per-request, virtual threads, and the blocking-call reality

Java's scraping concurrency story changed with virtual threads (Project Loom), and it's worth being precise because it's easy to over-claim. Virtual threads make blocking I/O cheap again: you can have tens of thousands of virtual threads each doing a blocking JSoup fetch without exhausting platform threads, which genuinely improves in-process scraping ergonomics versus the old thread-pool-and-CompletableFuture gymnastics. But virtual threads solve the concurrency cost, not the footprint cost or the browser cost — you still carry the JVM heap, you still need a real Chromium for true JS rendering, and many parsing libraries pin carrier threads during CPU-heavy work, partially defeating the benefit under parse-heavy load. So Loom narrows but does not close the gap: it makes "many concurrent fetches in one JVM" pleasant, while the reasons to externalize render/anti-bot/extraction (footprint, browser, GC noise, JIT warm-up on burst scale-out) are untouched. Use virtual threads to fan out calls to the sidecar elegantly; don't mistake them for a reason to move rendering back into the JVM.

Class loading, warm-up, and why burst-scaled scrape workers underperform

A concrete Java-specific cost that the sidecar pattern removes deserves elaboration. When an autoscaler spins up a new JVM scrape worker in response to a crawl burst, that JVM must class-load, JIT-compile hot paths, and warm its caches before it reaches steady-state throughput — frequently tens of seconds of degraded performance, exactly during the spike that triggered the scale-up. Worse, scrape workloads are often bursty by nature (a scheduled crawl, a user-triggered bulk import), so you repeatedly pay warm-up at the worst moment. Workarounds exist (CDS, AOT/GraalVM native-image, keeping a warm pool) but each adds build and operational complexity to your service. A compiled single-binary sidecar has effectively no warm-up: tens-of-ms cold start, steady-state performance from request one. Offloading scraping to it means the burst-scale path you care about isn't the JVM-warm-up path at all — you scale a thing that's fast immediately, and your JVM service scales only on its own (steadier) workload.

When in-process JSoup is still the right call

Static, friendly HTML, modest volume → JSoup in-process is clean, fast, and adds zero infrastructure. Don't over-engineer.
Tight coupling between parsed data and your domain model in a batch job → in-process extraction is reasonable.
No JS rendering and no anti-bot in your target set → the footprint tax is small; JSoup alone is fine.

Adopt the sidecar when you need JS rendering (avoid a browser next to the JVM), when targets are hostile (anti-bot doesn't belong in your service code), or when scrape-worker JVM/browser footprint is becoming a real line on your infrastructure bill.

Bottom line

Web scraping in Java is well served by JSoup for the static-parse majority — use it. The Java-specific cost appears when you add rendering and scale out: the JVM footprint tax compounds, a browser on top of the JVM is the worst of both, and scraping becomes a noisy neighbor in your own process. The clean fix is a sidecar — isolate scraping in a single small static binary rather than a fleet of fat JVMs — called over HTTP. fastCRW's open-core single-binary engine is built to be exactly that sidecar, self-hosted or managed by one config switch.

Try it from Java

docker compose up   # 6MB sidecar, not a second JVM, AGPL-3.0

Managed Cloud, same API: one-time lifetime 500 free credits, no card. fastcrw.com · GitHub

Web Scraping in Java (2026): JSoup, the JVM Footprint Tax, and the Sidecar Pattern

Java scraping is almost always inside a long-lived service

The Java stack

Minimal JSoup example

The Java-specific cost: the JVM footprint tax stacks the wrong way

The sidecar pattern: isolate scraping in a tiny process, not a fat JVM

The Java-specific subtlety: thread-per-request, virtual threads, and the blocking-call reality

Class loading, warm-up, and why burst-scaled scrape workers underperform

When in-process JSoup is still the right call

Bottom line

Try it from Java

Frequently asked questions

Try fastCRW free

More tutorial posts

Website to JSON Extraction: Structured Data in 10 Lines

Web Scraping in Ruby (2026): Rails-Friendly Patterns and the Background-Job Trap

Build a Finance Research Agent in Python