Skip to main content
Tutorial

Web Scraping in Elixir: Concurrency on BEAM

Web scraping in Elixir with Req, Floki, and Task.async_stream. BEAM concurrency for fan-out scraping and where a managed scrape API fits the pipeline.

fastcrw
By RecepJune 24, 20269 min readLast updated: June 2, 2026

By the fastCRW team · Benchmarks from Firecrawl's public dataset, verified 2026-05-18 · Verify independently before relying on any figure.

Web scraping in Elixir: the modern toolkit

Web scraping in Elixir is one of the better-kept secrets in the BEAM ecosystem. The same runtime that powers Phoenix and lets you spawn millions of lightweight processes is, it turns out, an excellent substrate for fan-out scraping: cheap concurrency, real backpressure primitives, and supervision trees that keep a long crawl alive when individual requests die. The catch — and the reason this post exists — is that the pure-Elixir stack stops short of JavaScript rendering and anti-bot, so the honest pipeline pairs BEAM concurrency with a rendering API for the pages Floki cannot see.

Fetching with Req (and why it replaced HTTPoison)

Req is the current default HTTP client for Elixir. It ships sensible defaults — connection pooling via Finch, automatic JSON decoding, retries, and redirect following — without the ceremony of the older HTTPoison/Tesla setups. A fetch is a single call:

  • Req.get!("https://example.com") returns a %Req.Response{} with status and body.
  • Pass headers: for a custom user-agent, retry: :transient for backoff on 5xx, and receive_timeout: to cap a slow response.
  • Because Req is built on Finch, a connection pool is reused across concurrent requests automatically — which matters once you fan out.

Parsing HTML with Floki

Floki is the standard HTML parser. You feed it a body, get a parsed document, and query it with CSS selectors that return lists you pattern-match or pipe through Enum:

  • {:ok, doc} = Floki.parse_document(body)
  • Floki.find(doc, "h2.title") returns matching nodes; Floki.text/1 flattens to a string.
  • Install the optional fast_html NIF for a large parse-speed win on big pages.

This is clean, idiomatic, and fast for static HTML. The limit is structural, not a bug: Floki parses the bytes the server sent, and never executes a line of JavaScript.

The Crawly framework for larger jobs

For multi-page jobs with link-following, request scheduling, and pipelines, Crawly gives you a Scrapy-style framework on the BEAM — spiders, middlewares, and item pipelines. It is the right tool when you outgrow a handful of hand-written Tasks, but it inherits the same JavaScript blind spot as raw Floki, because the fetch layer underneath is still an HTTP client, not a browser.

Concurrency on the BEAM for fan-out scraping

This is where Elixir earns its place. Scraping is overwhelmingly I/O-bound — you spend almost all your wall-clock time waiting on the network — and the BEAM's scheduler was built for exactly that workload.

Task.async_stream with bounded concurrency

Task.async_stream/3 is the workhorse. Give it an enumerable of URLs and a function, and it runs them concurrently with a configurable cap, streaming results back as each finishes:

  • urls |> Task.async_stream(&scrape/1, max_concurrency: 20, timeout: 30_000)
  • It returns {:ok, result} or {:exit, reason} tuples, so a single hung URL never takes down the batch.
  • Set on_timeout: :kill_task so a slow page is killed at the deadline rather than blocking a slot.

Backpressure and max_concurrency tuning

max_concurrency is the single most important knob, and it is genuinely about backpressure: the stream only pulls the next URL when a worker slot frees up, so memory and open sockets stay bounded no matter how long the input list is. The right value is not "as high as possible" — it is the point where you saturate your downstream (the target site's tolerance, your proxy budget, or the latency tail of whatever does the rendering) without queueing requests that will time out anyway. We come back to picking that number from real latency data below.

Supervision trees for resilient long-running crawls

For a crawl that runs for hours, wrap the work in a supervised GenServer or a Task.Supervisor so a crash restarts the unit of work instead of killing the run. The "let it crash" philosophy is a real operational advantage here: a parse error on one malformed page becomes an isolated, logged restart, not a pipeline-wide failure.

Where Floki parsing stops being enough

Everything above works beautifully — until the target site renders its content client-side. Then the BEAM's concurrency is fanning out requests that come back empty.

JavaScript-rendered pages Floki can't see

A growing share of the web ships a near-empty HTML shell and hydrates the real content with JavaScript. Floki sees the shell. You will Floki.find/2 the right selector and get an empty list, because the node does not exist until a browser runs the page's scripts. No amount of Req tuning fixes this — the data was never in the response body.

No native headless browser in the Elixir stack

Unlike Python (Playwright) or Node (Puppeteer), there is no first-class, well-maintained headless-browser binding in pure Elixir. Teams reach for Wallaby (built for testing, driving a real ChromeDriver) or shell out to a Node Playwright process, and both put a browser fleet — heavy memory, slow startup, and ongoing maintenance — on the critical path of every JS-rendered URL. That is a lot of operational weight to bolt onto an otherwise lean BEAM service.

Anti-bot and IP rotation gaps

The Elixir ecosystem has no equivalent of a managed anti-bot or proxy-rotation layer. You can set headers and throttle politely with Req, but stealth fingerprinting, challenge solving, and IP rotation are out of scope — and reimplementing them is its own full-time project, not a scraping side-quest.

Calling a Firecrawl-compatible API from Elixir

The pragmatic split is to keep the BEAM doing what it is great at — concurrency, orchestration, supervision — and hand off rendering and extraction to a scrape API. fastCRW exposes a Firecrawl-compatible REST surface, so this is just another Req POST; it is a drop-in target after a base-URL swap. There is no native Elixir SDK (the first-party clients are the crw Python SDK and the crw-mcp@0.6.0 MCP package), and you do not need one — REST is the interface. See Firecrawl API compatibility for the exact field-level contract.

A Req POST to /v1/scrape returning markdown

One POST to /v1/scrape with the target URL returns clean, LLM-ready markdown for that page — including content that only existed after JavaScript ran, because the rendering happens server-side:

  • Req.post!("https://api.fastcrw.com/v1/scrape", json: %{url: url, formats: ["markdown"]}, auth: {:bearer, key})
  • The response carries the parsed markdown; you skip Floki entirely for these pages.
  • To self-host, point the same call at your own engine's base URL — the binary is the same one the managed cloud runs.

Fanning out scrape calls with Task.async_stream

The whole point of doing this from Elixir is that the API call slots straight into the same fan-out primitive — your scrape function becomes "POST and decode" instead of "fetch and Floki," and Task.async_stream orchestrates the rest with the same bounded concurrency you already tuned. Nothing about the concurrency model changes; only the work inside each task does.

Structured extraction with formats: ["json"]

If you want typed records instead of markdown, pass formats: ["json"] with a jsonSchema and the API returns structured data matching your schema in one call — no selector code to maintain. Note this is a 5-credit operation versus 1 for a plain scrape, and LLM extraction runs on OpenAI or Anthropic providers only. Our deep-dive on structured extraction with JSON Schema covers the schema shape and failure modes.

Latency and accuracy trade-offs for agent pipelines

If you are fanning out scrapes to feed an agent or an enrichment pipeline, latency distribution — not a single average — is what shapes your concurrency settings and your timeouts.

Median latency vs the long tail

On Firecrawl's own public 1,000-URL scrape-content-dataset-v1 (of which 819 carry labeled ground truth), measured with the diagnose_3way.py harness on 2026-05-08, fastCRW posted the highest truth-recall of the three tools tested — 63.74% of 819 labeled URLs, ahead of Crawl4AI (59.95%) and Firecrawl (56.04%) — at 91.8% scrape-success of reachable URLs with 0 thrown errors. On speed: p50 1914 ms beats Firecrawl's 2305 ms. In fast mode, fastCRW's p90 is 4348 ms — the lowest of the three (Crawl4AI 4754 ms, Firecrawl 6937 ms). fastCRW also recovers 34 URLs that neither Crawl4AI nor Firecrawl reach — 70% more exclusive recoveries than the other two combined. You can see the full p50/p90 split on /benchmarks, and scraping latency explained unpacks why the tail dominates wall-clock time on a concurrent run.

The practical consequence for an Elixir pipeline: a long tail is exactly what Task.async_stream handles gracefully if you configure it. Set a per-task timeout above your acceptable p90 (or use a deadline and accept the occasional kill), use on_timeout: :kill_task, and treat {:exit, :timeout} as a retry-or-skip decision rather than a crash. Bounded max_concurrency keeps a burst of slow pages from exhausting sockets. The same patterns you would use in Go's worker pools apply on the BEAM — the language differs, the discipline does not.

Self-host the engine next to your BEAM nodes

Because the engine is a single static Rust binary — a ~8 MB image running in one container, no Redis or Node sidecar required — you can run it on the same host or network as your BEAM nodes. That collapses the API round-trip to localhost latency and keeps scraped content on your own infrastructure. Self-hosting the AGPL-3.0 engine costs nothing per request; you pay only for the server. The repo is github.com/us/crw; managed cloud is at fastcrw.com if you would rather not run it.

Honest gaps: stateless, OpenAI/Anthropic-only extraction

State the limits plainly. The engine is stateless per request — there is no persistent browser session you can drive across calls, so login-flow scraping has to be modeled differently. LLM extraction supports OpenAI and Anthropic only (the managed search path uses a managed LLM, a separate feature). There is no screenshot output — a request for formats: ["screenshot"] returns HTTP 422 — and no multi-URL batched extract endpoint, so for many URLs you iterate /v1/scrape concurrently (which, conveniently, is precisely what Task.async_stream is for) or use /v1/crawl. Where you genuinely depend on persistent sessions or screenshots, a browser-driving stack like Wallaby still wins, and you should use it for those pages.

Sources

Related: Web scraping in Go · Firecrawl API compatibility · Structured extraction with JSON Schema · Scraping latency explained

FAQ

Frequently asked questions

What libraries do I use for web scraping in Elixir?
The modern stack is Req for HTTP fetching (it replaced HTTPoison/Tesla as the default, with pooling and retries built in) and Floki for HTML parsing with CSS selectors. For larger multi-page jobs, Crawly gives you a Scrapy-style spider framework on the BEAM. All three parse the HTML the server returns and do not execute JavaScript.
Can Floki parse JavaScript-rendered pages?
No. Floki parses the raw HTML bytes in the response body and never runs JavaScript, so client-rendered content (a page that ships an empty shell and hydrates via JS) returns empty selector results. There is no first-class headless browser in pure Elixir, so the common fix is to hand JS-heavy URLs to a rendering API and keep BEAM concurrency for orchestration.
How do I run concurrent scrapes with Task.async_stream?
Pipe your URL list into Task.async_stream/3 with a scrape function and a max_concurrency cap, e.g. urls |> Task.async_stream(&scrape/1, max_concurrency: 20, timeout: 30_000, on_timeout: :kill_task). It runs requests concurrently with backpressure (only pulling the next URL when a slot frees), streams results back as {:ok, result} or {:exit, reason}, and isolates a hung or crashing URL from the rest of the batch.
Is there an Elixir SDK for fastCRW?
No. The first-party clients are the crw Python SDK and the crw-mcp@0.6.0 MCP package; there is no native Elixir SDK. You do not need one — fastCRW exposes a Firecrawl-compatible REST API, so you call /v1/scrape directly with Req (a base-URL swap is the whole integration).
How does scrape latency affect a concurrent Elixir pipeline?
Size your settings against the p90, not just the median. On Firecrawl's public dataset (diagnose_3way.py, 2026-05-08) fastCRW's p50 of 1914 ms beats Firecrawl's 2305 ms. In fast mode, fastCRW's p90 is 4348 ms — the lowest of the three tested. For the minority of URLs requiring chrome-stealth recovery (the same mechanism that gives fastCRW 34 exclusive recoveries no competitor reaches), latency rises on those specific pages. In Elixir, set a per-task timeout appropriately, use on_timeout: :kill_task, bound max_concurrency so slow pages don't exhaust sockets, and treat timeouts as retry-or-skip rather than crashes.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive