Skip to main content
Engineering

Convert Website to LLM Data: The Pipeline Pattern

The pipeline pattern to convert a webpage into LLM-ready data: map, crawl, scrape to Markdown, extract JSON. One Firecrawl-compatible API, self-hostable for $0.

fastcrw
By RecepJuly 2, 20269 min readLast updated: June 2, 2026

By the fastCRW team · Benchmark figures from diagnose_3way.py on Firecrawl's public labeled dataset, verified 2026-05-18 · Verify independently before quoting internally.

Disclosure: We build fastCRW. This is a vendor-authored engineering guide, so weight it accordingly — the pipeline pattern below works against any Firecrawl-compatible engine, and we name the honest limits up front rather than burying them.

Convert a website to LLM data: what the job actually requires

To convert a website to LLM data is not one operation — it is a short pipeline. The phrase "LLM-ready data" gets thrown around as if it means a single magic format, but in practice it means two distinct shapes feeding two distinct needs. Get the shapes wrong and your retrieval quality collapses no matter how good the model is. So before any code, be precise about what you are producing and why.

Clean Markdown for context, JSON for fields

There are two output formats worth caring about. Markdown is what you want when the unit of work is "give the model the readable content of this page" — it preserves headings, lists, tables, and code blocks while dropping nav chrome, scripts, and styling that waste context-window tokens. JSON is what you want when you need specific typed fields — a price, a title, a list of specs — and a brittle CSS selector won't survive the next redesign. Most real ingestion layers use both: Markdown to chunk and embed for retrieval, JSON to populate structured records your application queries directly.

Why boilerplate and stale data poison retrieval

Retrieval-augmented generation is "garbage in, garbage out" with extra steps. If your extraction leaves a cookie banner, a mega-menu, and three "related articles" widgets in the page text, those tokens get chunked, embedded, and retrieved right alongside the content that matters — diluting every chunk and surfacing noise as if it were signal. Stale data is the second poison: a knowledge base that was accurate at crawl time drifts as the source site changes, and a confident model grounded on a stale chunk hallucinates with authority. Clean input and a freshness story are not polish; they are the load-bearing parts of the pipeline.

The four primitives

Everything below composes from four API primitives. You discover URLs with /v1/map, collect pages with /v1/crawl, convert a page to Markdown with /v1/scrape, and pull typed fields with the same scrape call plus formats: ["json"]. That is the whole surface. There is no fifth magic endpoint, and — stated plainly — no managed agent that does the orchestration for you. You compose the loop. That is a feature for an ingestion layer you have to debug at 2am, not a bug.

The pipeline pattern, end to end

Here is the pattern most LLM data pipelines converge on, with the endpoint that does each step. Treat it as a template, not a framework — the point is that each stage is an independent, idempotent API call you can retry, cache, and reason about on its own.

StageEndpointOutputCredits
1. DiscoverPOST /v1/mapEvery URL on the site1
2. CollectPOST /v1/crawlAsync BFS over pages (job ID)1 / page
3. ConvertPOST /v1/scrapeClean Markdown1
4. Extract/v1/scrape + formats:["json"]Typed JSON records5

Discover with /v1/map

Start by mapping the URL space. POST /v1/map returns the discoverable URLs on a site for 1 credit — far cheaper than crawling blind, because you can filter the list down to the paths you actually want before you spend a per-page credit on anything. For a docs ingestion job, this is where you drop changelog and search pages; for a catalog, this is where you keep only product URLs.

Collect with /v1/crawl

For multi-page jobs, POST /v1/crawl starts an asynchronous breadth-first crawl and immediately returns a job ID; you poll GET /v1/crawl/:id for status and results. Crawl is 1 credit per page (any renderer, including Chrome), and it accepts maxDepth (capped at 10) and maxPages (capped at 1000), with limit and max_pages as accepted aliases. Set both caps explicitly — an uncapped crawl is how you wake up to a surprise credit bill.

Convert and extract with /v1/scrape

For a single URL, POST /v1/scrape returns clean Markdown by default for 1 credit. When you need fields rather than prose, pass formats: ["json"] with a jsonSchema describing the shape you want; the engine fills it from the page content. JSON extraction is 5 credits per request because there is an LLM in the loop. A minimal extract call:

  • Markdown: POST /v1/scrape with { "url": "https://example.com/page" } → Markdown string.
  • JSON: the same call with { "url": "...", "formats": ["json"], "jsonSchema": { "title": "string", "price": "number" } } → a typed record.

Because the API is Firecrawl-compatible, the official Firecrawl SDK drives all of this after a single base-URL swap — you do not rewrite client code to adopt the pattern.

Why extraction accuracy is the foundation

Every stage downstream of the scrape inherits the scrape's accuracy. If the converter misses the main content or includes the wrong region of the page, your chunks are wrong, your embeddings are wrong, and your model grounds on the wrong text. So the single most important number for an LLM data pipeline is not latency — it is how often the extractor recovers the true content.

Highest truth-recall of the three tools tested

On Firecrawl's own public labeled dataset, fastCRW recorded the highest truth-recall of the three tools tested: 63.74% of 819 labeled URLs (522 recovered), versus 59.95% for Crawl4AI and 56.04% for Firecrawl (diagnose_3way.py, single 3,000-request run, 2026-05-08). It paired that with ~92% scrape success of reachable URLs and 0 thrown errors over the run — and recovered 34 URLs that neither Crawl4AI nor Firecrawl reached, 70% more than the other two combined. Higher recall means more pages contribute real content to your index instead of silently dropping out — and the same input feeds cleaner chunks and fewer hallucinations downstream.

Latency: fastest median, lowest fast-mode tail

fastCRW's median latency is fast — p50 1914 ms, the fastest of the three, beating Firecrawl's 2305 ms. In fast mode, the p90 is 4348 ms — the lowest of the three tools tested (Crawl4AI 4754 ms, Firecrawl 6937 ms). The chrome-stealth fallback that recovers the hard URLs the others miss is the same mechanism behind the accuracy lead. For a batch ingestion pipeline that runs ahead of query time the fast-mode tail is well within budget; if you need a page in an inline, latency-sensitive call, set per-call timeouts informed by the p90. See scraping latency explained for how to plan around the tail, and /benchmarks for the full numbers.

Keeping the pipeline portable and private

An ingestion layer is infrastructure you live with for years. Two properties matter more than they look at prototype time: portability (can you change backends without a rewrite?) and privacy (does your scraped data leave your network?).

Firecrawl-compatible: base-URL swap, keep your code

fastCRW implements a Firecrawl-compatible REST surface, so it is a drop-in after a base-URL swap. Write your client against the compatible surface and your backend becomes a runtime config value rather than a fork in your codebase. The honest caveat: a few field names and the error envelope diverge slightly, and cloud-only Firecrawl specialties have no equivalent here — validate the short known list before cutover, as covered in the full Firecrawl vs fastCRW comparison.

Self-host so web data never leaves your infra

The fastCRW engine is a single static Rust binary — roughly a 6 MB binary in an ~8 MB Docker image, one container — released under AGPL-3.0. Self-hosting is free; you pay only for your own server. For regulated or sensitive workloads, that is not a cost preference, it is a gating constraint: in self-host mode the scraped content and the target URLs never leave your infrastructure. There is no platform-team project to stand up a five-service stack; it is one docker run.

Costs and the honest limits

Per-operation credit costs make the pipeline easy to budget: map is 1 credit, crawl is 1 credit per page, a Markdown scrape is 1 credit, and a JSON extraction is 5 credits. For live tier pricing and credit grants, see /pricing rather than a number hard-coded here.

Now the limits, stated plainly so you design around them instead of discovering them in production:

  • No multi-URL batch extract. There is no /v1/batch/scrape and no multi-URL /v1/extract. For many pages you iterate /v1/scrape concurrently, or crawl first and extract per page. You compose the loop.
  • No managed agent. There is no /v1/agent (Spark) and no /v1/deep-research endpoint to orchestrate the pipeline for you.
  • LLM extraction is OpenAI and Anthropic only. JSON extraction supports those two providers; if you need another for extraction, this is a gap.
  • Stateless per request. No persistent session or state between calls, and no built-in anti-bot / Fire-engine — heavily protected sites may need a proxy layer.
  • No screenshot output. A request for formats: ["screenshot"] returns HTTP 422.
  • robots.txt is respected by default. Override it only where you have the legal right to do so.

None of these block the core convert-website-to-LLM-data pattern — map, crawl, scrape, extract — which is exactly the surface this pipeline needs. They block adjacent ambitions, and it is better you know that before you build than after.

Where a fuller platform genuinely wins

If your pipeline depends on a managed research agent, batched multi-URL extraction in a single call, heavy cloud anti-bot, or extraction across providers beyond OpenAI/Anthropic, a broader hosted platform like Firecrawl's cloud is the better fit today, and we say so plainly. fastCRW's argument is the opposite shape: the four composable primitives, the highest measured truth-recall of the three tools tested, and an engine you can self-host for $0 so the data never leaves your infra. Pick by which of those your ingestion layer actually needs.

Sources

  • Scrape benchmark (truth-recall, latency split): diagnose_3way.py on Firecrawl's public scrape-content-dataset-v1, 819 labeled URLs, 2026-05-08 — see /benchmarks.
  • API surface, credits, and limits: github.com/us/crw (open-core README, AGPL-3.0).
  • Live pricing and credit grants: /pricing.

Related: LLM-ready Markdown extraction · RAG pipeline with fastCRW · Website to JSON extraction · HTML to Markdown for LLMs

FAQ

Frequently asked questions

How do I convert a website into LLM-ready data?
Use a four-stage pipeline over a Firecrawl-compatible API: discover URLs with /v1/map (1 credit), collect pages with an async /v1/crawl job (1 credit per page), convert each page to clean Markdown with /v1/scrape (1 credit), and pull typed fields with the same scrape call plus formats:['json'] and a jsonSchema (5 credits). There is no managed agent — you compose the loop, which keeps each stage independently retryable and debuggable.
What output format is best for feeding an LLM — Markdown or JSON?
Both, for different jobs. Markdown is best for retrieval context: it preserves headings, lists, tables, and code while dropping nav and script boilerplate that wastes context-window tokens, so you chunk and embed it. JSON is best when you need specific typed fields (price, title, specs) for structured records your app queries directly. Most ingestion layers produce Markdown for RAG and JSON for the fields they index.
Does extraction accuracy affect RAG quality?
Directly. Every downstream stage inherits the scrape's accuracy — bad extraction produces bad chunks, bad embeddings, and confident hallucinations grounded on the wrong text. On Firecrawl's public labeled dataset, fastCRW recorded the highest truth-recall of the three tools tested, 63.74% of 819 labeled URLs, with ~92% scrape success of reachable URLs and 0 errors (diagnose_3way.py, 2026-05-08). Higher recall means more pages contribute real content instead of silently dropping out.
Can I keep scraped web data on my own infrastructure?
Yes. The fastCRW engine is a single static Rust binary (~8 MB image, one container) released under AGPL-3.0, and self-hosting is free — you pay only for your own server. In self-host mode the scraped content and target URLs never leave your network, which for regulated or sensitive workloads is a gating requirement rather than a preference. The same Firecrawl-compatible API runs locally or in the managed cloud.
What are the limits of fastCRW for an LLM data pipeline?
The honest gaps: there is no multi-URL batch extract (iterate /v1/scrape concurrently or crawl first), no managed /v1/agent or /v1/deep-research, LLM extraction supports OpenAI and Anthropic only, requests are stateless with no built-in anti-bot, screenshot output returns HTTP 422, and robots.txt is respected by default. None of these block the core map/crawl/scrape/extract pattern; they block adjacent ambitions, so design around them up front.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More engineering posts

View category archive