Skip to main content
Tutorial

Bash & CLI Web Scraping: One-Off Shell Pipelines

One-off bash and CLI web scraping with curl, pup, and jq. Build interactive shell pipelines, see where they break, and know when to hand off to a scrape API.

fastcrw
By RecepJune 23, 20267 min readLast updated: June 2, 2026

By the fastCRW team · Structural facts verified 2026-05-18 · Verify independently.

When one-off bash and CLI web scraping is the right tool

Bash, curl, and a couple of parsers are the fastest way to answer a question like "what's the current price on this page?" or "give me every link in this nav as JSON." No project, no virtualenv, no dependencies to install in a Dockerfile — just a one-liner you type, read, and throw away. This guide is scoped strictly to that interactive, one-off shell pattern. If you want to run a scraper on a schedule, that belongs to the bash + cron pattern; if you want a deep reference on curl's flags themselves, see the curl web scraping guide. Keeping those three pages distinct means none of them competes for your attention when you just need the quick ad-hoc version.

The reason the shell wins for ad-hoc work is composability. Every tool reads from stdin and writes to stdout, so you assemble a scraper out of small pieces with a pipe: fetch with curl, select HTML nodes with pup (or htmlq), and shape the result with jq. There is no framework to learn — only the contract that text flows left to right.

The interactive shell scraping toolkit

Fetching with curl (headers, cookies, redirects)

Curl is the fetch half of every shell pipeline. The flags you reach for most often are -s (silent, drop the progress meter so it doesn't pollute stdout), -L (follow redirects), and -A to set a User-Agent that isn't the default curl/8.x string many sites block:

  • curl -sL -A "Mozilla/5.0" https://example.com — fetch a page, following redirects, with a browser-like agent.
  • curl -sL -H "Cookie: session=abc123" https://example.com — send a cookie for a page behind a simple login.
  • curl -sL -b cookies.txt -c cookies.txt https://example.com — read and write a cookie jar to persist a session across calls.

That covers the great majority of one-off needs. When the page returns the HTML you expected, you pipe it into a parser.

Extracting HTML nodes with pup or htmlq

pup is jq for HTML: you hand it a CSS selector and it returns matching nodes. htmlq is a near-identical Rust alternative if you prefer it. A few patterns cover most extraction:

  • curl -sL https://example.com | pup 'h1 text{}' — the text of every <h1>.
  • curl -sL https://example.com | pup 'a attr{href}' — every link's href, one per line.
  • curl -sL https://example.com | pup 'div.price json{}' — matching nodes as a JSON array, ready for jq.

Shaping JSON output with jq

Once pup ... json{} hands you JSON, jq does the shaping. To pull the text out of an array of price nodes and clean it up:

  • ... | pup 'div.price json{}' | jq -r '.[].text' — extract the text field from each node as raw strings.
  • ... | jq -r '.[] | {title: .text, href: .href}' — reshape each node into a tidy record.

String these together and you have a complete scraper on one line: curl -sL URL | pup 'selector json{}' | jq '...'. For a static, server-rendered page, that is genuinely all you need — and it's hard to beat for speed of iteration.

Where one-off shell scraping breaks down

No JavaScript execution in curl

Curl fetches the HTML the server sends and nothing more. It does not run a browser, so any content rendered client-side — a price injected by React, a list hydrated from a fetch call, an infinite-scroll feed — simply is not in the bytes curl receives. You will see an empty <div id="root"></div> and a pile of script tags. No amount of pup wizardry recovers data that was never in the response. This is the single most common wall, and it has no shell-only workaround.

Fragile selector pipelines on dynamic markup

Even on static pages, CSS selectors are brittle. A class rename, an added wrapper <div>, or an A/B-test variant silently changes div.price to div.price--v2, and your pipeline returns empty without erroring. For a one-off run you notice immediately and fix it; for anything you'd want to trust twice, that fragility is a liability — which is exactly why scheduled work belongs in a different pattern with logging and alerting.

Anti-bot and pagination beyond a single ad-hoc run

A single curl request from your laptop is usually fine. Start looping over hundreds of pages and you hit rate limits, IP blocks, CAPTCHAs, and pagination logic that a one-liner was never meant to carry. The moment you're writing a while loop with sleep-and-retry around curl, you've outgrown the interactive pattern.

Piping a Firecrawl-compatible API into the pipeline

A single curl POST to /v1/scrape returning markdown

The clean hand-off keeps the shell ergonomics but swaps the fetch step for a managed scrape endpoint that does render JavaScript and returns LLM-ready markdown instead of raw HTML you have to parse. fastCRW exposes a Firecrawl-compatible REST API, so it's a drop-in after a base-URL swap — and it's still just curl:

  • curl -s https://api.fastcrw.com/v1/scrape -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" -d '{"url":"https://example.com","formats":["markdown"]}'

You get back JSON with a markdown field — clean prose, no <div> soup, no broken selectors. Because the response is JSON, the rest of your pipeline doesn't change: pipe it into jq. For the broader "HTML page in, clean text out" framing, see website to markdown.

Filtering structured JSON output with jq

Pull the markdown straight out, or go one step further and ask the API for structured fields instead of markdown by sending "formats":["json"] with a jsonSchema — that returns typed records you filter with jq exactly as before:

  • curl -s .../v1/scrape -d '{"url":"...","formats":["markdown"]}' | jq -r '.data.markdown'
  • curl -s .../v1/scrape -d '{"url":"...","formats":["json"],"jsonSchema":{...}}' | jq '.data.json'

Note that a request with formats: ["json"] is a 5-credit operation versus 1 credit for a plain scrape, so reach for it only when you actually want structured extraction rather than markdown.

The hand-off boundary: interactive vs scheduled vs at-scale

The decision tree is simple. One-off and static? Stay in curl | pup | jq. One-off but JavaScript-heavy or anti-bot-blocked? Swap the fetch step for a single /v1/scrape POST and keep piping into jq. Need it on a schedule with locking, retries, and logging? That's the bash + cron pattern, not this one.

Interactive shell scraping vs a managed scrape endpoint

Where the line is for one-off work

For genuinely ad-hoc, static-page extraction, the pure shell stack is the right answer — zero setup beats everything. The API earns its place the moment the page needs a browser to render or the target actively blocks plain curl. You don't have to choose one globally; you choose per page, and both paths are just curl into jq.

Self-host the binary for a fully local CLI pipeline

If you want the rendering muscle without a cloud round-trip, you can self-host the engine and point curl at localhost. fastCRW's engine is a single ~8 MB static Rust binary that runs in 1 container (per the README's structural facts), so a local instance is one docker run away and your entire pipeline stays on your machine — no Python, no external dependency, AGPL-3.0. See self-host with Docker Compose for the setup.

Honest gaps: stateless, no screenshot output

Two limits worth stating plainly so they don't surprise you mid-pipeline. fastCRW is stateless per request — there is no persistent session, so multi-step authenticated flows that depend on carried state aren't a fit for a single scrape call. And there is no screenshot output: a request for formats: ["screenshot"] returns HTTP 422. If your one-off job needs a rendered image of the page, the shell-plus-API path won't give it to you, and that's a real gap rather than something to route around.

Sources

Related: curl web scraping guide · bash + cron scheduled scraping · website to markdown · Firecrawl API compatibility

FAQ

Frequently asked questions

When should I use bash/curl for one-off web scraping instead of Python?
Reach for bash, curl, pup, and jq when the job is ad-hoc and the page is static (server-rendered HTML) — there's no virtualenv or dependency setup, just a one-liner you type and discard. Use Python when you need persistent logic, complex pagination, or a maintained project; the shell shines specifically for throwaway, interactive extraction where speed of iteration beats structure.
Can curl scrape JavaScript-rendered pages?
No. Curl fetches only the HTML the server returns; it does not run a browser, so any content injected client-side (React prices, hydrated lists, infinite scroll) is absent from the response. There is no shell-only workaround. For JS-heavy pages, swap the curl fetch step for a single POST to a rendering scrape API such as fastCRW's Firecrawl-compatible /v1/scrape, then keep piping the JSON result into jq.
How do I parse HTML in the shell with pup and jq?
Pipe curl's output into pup with a CSS selector, then into jq to shape it. For example: curl -sL URL | pup 'div.price json{}' | jq -r '.[].text' fetches a page, selects matching nodes as JSON, and extracts each node's text as raw strings. pup handles HTML selection (htmlq is an equivalent Rust alternative); jq handles JSON filtering and reshaping.
How do I call a scraping API from a bash one-liner?
Send a curl POST with your API key and a JSON body, then filter with jq. Against fastCRW's Firecrawl-compatible endpoint: curl -s https://api.fastcrw.com/v1/scrape -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" -d '{"url":"https://example.com","formats":["markdown"]}' | jq -r '.data.markdown'. A plain markdown scrape costs 1 credit; requesting formats:["json"] for structured extraction costs 5.
Should I use this interactive pattern or the bash + cron pattern for scheduled runs?
Use this interactive curl | pup | jq pattern only for one-off, ad-hoc extraction you run by hand. The moment you want a scraper to run on a schedule, you need overlap prevention, retries, logging, and alerting — that's the separate bash + cron pattern, which adds flock locking and error handling around the same curl calls. Don't put production scheduled work in a throwaway one-liner.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive