By the fastCRW team · Structural facts verified 2026-05-18 · Verify independently.
When one-off bash and CLI web scraping is the right tool
Bash, curl, and a couple of parsers are the fastest way to answer a question like "what's the current price on this page?" or "give me every link in this nav as JSON." No project, no virtualenv, no dependencies to install in a Dockerfile — just a one-liner you type, read, and throw away. This guide is scoped strictly to that interactive, one-off shell pattern. If you want to run a scraper on a schedule, that belongs to the bash + cron pattern; if you want a deep reference on curl's flags themselves, see the curl web scraping guide. Keeping those three pages distinct means none of them competes for your attention when you just need the quick ad-hoc version.
The reason the shell wins for ad-hoc work is composability. Every tool reads from stdin and writes to stdout, so you assemble a scraper out of small pieces with a pipe: fetch with curl, select HTML nodes with pup (or htmlq), and shape the result with jq. There is no framework to learn — only the contract that text flows left to right.
The interactive shell scraping toolkit
Fetching with curl (headers, cookies, redirects)
Curl is the fetch half of every shell pipeline. The flags you reach for most often are -s (silent, drop the progress meter so it doesn't pollute stdout), -L (follow redirects), and -A to set a User-Agent that isn't the default curl/8.x string many sites block:
curl -sL -A "Mozilla/5.0" https://example.com— fetch a page, following redirects, with a browser-like agent.curl -sL -H "Cookie: session=abc123" https://example.com— send a cookie for a page behind a simple login.curl -sL -b cookies.txt -c cookies.txt https://example.com— read and write a cookie jar to persist a session across calls.
That covers the great majority of one-off needs. When the page returns the HTML you expected, you pipe it into a parser.
Extracting HTML nodes with pup or htmlq
pup is jq for HTML: you hand it a CSS selector and it returns matching nodes. htmlq is a near-identical Rust alternative if you prefer it. A few patterns cover most extraction:
curl -sL https://example.com | pup 'h1 text{}'— the text of every<h1>.curl -sL https://example.com | pup 'a attr{href}'— every link'shref, one per line.curl -sL https://example.com | pup 'div.price json{}'— matching nodes as a JSON array, ready for jq.
Shaping JSON output with jq
Once pup ... json{} hands you JSON, jq does the shaping. To pull the text out of an array of price nodes and clean it up:
... | pup 'div.price json{}' | jq -r '.[].text'— extract thetextfield from each node as raw strings.... | jq -r '.[] | {title: .text, href: .href}'— reshape each node into a tidy record.
String these together and you have a complete scraper on one line: curl -sL URL | pup 'selector json{}' | jq '...'. For a static, server-rendered page, that is genuinely all you need — and it's hard to beat for speed of iteration.
Where one-off shell scraping breaks down
No JavaScript execution in curl
Curl fetches the HTML the server sends and nothing more. It does not run a browser, so any content rendered client-side — a price injected by React, a list hydrated from a fetch call, an infinite-scroll feed — simply is not in the bytes curl receives. You will see an empty <div id="root"></div> and a pile of script tags. No amount of pup wizardry recovers data that was never in the response. This is the single most common wall, and it has no shell-only workaround.
Fragile selector pipelines on dynamic markup
Even on static pages, CSS selectors are brittle. A class rename, an added wrapper <div>, or an A/B-test variant silently changes div.price to div.price--v2, and your pipeline returns empty without erroring. For a one-off run you notice immediately and fix it; for anything you'd want to trust twice, that fragility is a liability — which is exactly why scheduled work belongs in a different pattern with logging and alerting.
Anti-bot and pagination beyond a single ad-hoc run
A single curl request from your laptop is usually fine. Start looping over hundreds of pages and you hit rate limits, IP blocks, CAPTCHAs, and pagination logic that a one-liner was never meant to carry. The moment you're writing a while loop with sleep-and-retry around curl, you've outgrown the interactive pattern.
Piping a Firecrawl-compatible API into the pipeline
A single curl POST to /v1/scrape returning markdown
The clean hand-off keeps the shell ergonomics but swaps the fetch step for a managed scrape endpoint that does render JavaScript and returns LLM-ready markdown instead of raw HTML you have to parse. fastCRW exposes a Firecrawl-compatible REST API, so it's a drop-in after a base-URL swap — and it's still just curl:
curl -s https://api.fastcrw.com/v1/scrape -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" -d '{"url":"https://example.com","formats":["markdown"]}'
You get back JSON with a markdown field — clean prose, no <div> soup, no broken selectors. Because the response is JSON, the rest of your pipeline doesn't change: pipe it into jq. For the broader "HTML page in, clean text out" framing, see website to markdown.
Filtering structured JSON output with jq
Pull the markdown straight out, or go one step further and ask the API for structured fields instead of markdown by sending "formats":["json"] with a jsonSchema — that returns typed records you filter with jq exactly as before:
curl -s .../v1/scrape -d '{"url":"...","formats":["markdown"]}' | jq -r '.data.markdown'curl -s .../v1/scrape -d '{"url":"...","formats":["json"],"jsonSchema":{...}}' | jq '.data.json'
Note that a request with formats: ["json"] is a 5-credit operation versus 1 credit for a plain scrape, so reach for it only when you actually want structured extraction rather than markdown.
The hand-off boundary: interactive vs scheduled vs at-scale
The decision tree is simple. One-off and static? Stay in curl | pup | jq. One-off but JavaScript-heavy or anti-bot-blocked? Swap the fetch step for a single /v1/scrape POST and keep piping into jq. Need it on a schedule with locking, retries, and logging? That's the bash + cron pattern, not this one.
Interactive shell scraping vs a managed scrape endpoint
Where the line is for one-off work
For genuinely ad-hoc, static-page extraction, the pure shell stack is the right answer — zero setup beats everything. The API earns its place the moment the page needs a browser to render or the target actively blocks plain curl. You don't have to choose one globally; you choose per page, and both paths are just curl into jq.
Self-host the binary for a fully local CLI pipeline
If you want the rendering muscle without a cloud round-trip, you can self-host the engine and point curl at localhost. fastCRW's engine is a single ~8 MB static Rust binary that runs in 1 container (per the README's structural facts), so a local instance is one docker run away and your entire pipeline stays on your machine — no Python, no external dependency, AGPL-3.0. See self-host with Docker Compose for the setup.
Honest gaps: stateless, no screenshot output
Two limits worth stating plainly so they don't surprise you mid-pipeline. fastCRW is stateless per request — there is no persistent session, so multi-step authenticated flows that depend on carried state aren't a fit for a single scrape call. And there is no screenshot output: a request for formats: ["screenshot"] returns HTTP 422. If your one-off job needs a rendered image of the page, the shell-plus-API path won't give it to you, and that's a real gap rather than something to route around.
Sources
- fastCRW canonical fact sheet (Firecrawl-compatible REST, endpoint surface, structural footprint, honest gaps): github.com/us/crw
- pup HTML parser: github.com/ericchiang/pup · htmlq: github.com/mgdm/htmlq
- jq manual: jqlang.github.io/jq/manual · curl docs: curl.se/docs
Related: curl web scraping guide · bash + cron scheduled scraping · website to markdown · Firecrawl API compatibility
