Skip to main content
Tutorial

Build a PDF + Web RAG Pipeline (Honest Guide)

Build a RAG pipeline that ingests both PDFs and web pages: scrape sites to clean markdown, parse PDFs separately, unify, chunk, and embed. Full tutorial.

fastcrw
By RecepJuly 3, 20269 min readLast updated: June 2, 2026

By the fastCRW team · Benchmarks and capability claims verified 2026-05-18 · Verify independently before relying on them.

The mixed-source ingestion problem in a PDF + web RAG pipeline

If you want to build a PDF RAG pipeline that also answers from live web pages, the first thing to accept is that PDFs and web pages are two different ingestion problems wearing the same "document" label. A web page is HTML you can fetch and convert to clean markdown. An uploaded PDF is a binary blob whose text — and worse, whose tables — lives in a layout model, not in a DOM. The naive instinct is to find one tool that "does both." That instinct is what produces brittle pipelines.

This guide takes the honest path instead: use a web-data engine for the web half, use a dedicated PDF parser for the PDF half, and unify them downstream into one index. We build with fastCRW for the web ingestion leg because it is a Firecrawl-compatible REST engine that returns LLM-ready markdown, and we are upfront about exactly where its job stops.

What fastCRW handles, and what it does not

fastCRW handles the web half well. On Firecrawl's own public scrape-content-dataset-v1 (the 819 labeled URLs that carry ground truth, diagnose_3way.py, 2026-05-08) it had the highest truth-recall of the three tools tested — 63.74% — ahead of Crawl4AI (59.95%) and Firecrawl (56.04%). Higher recall means more of the real page content survives into your markdown, which is exactly what a RAG chunker wants.

What fastCRW does not do, and we will not pretend otherwise: there is no /parse document-upload endpoint — you cannot POST a PDF file to fastCRW and get text back. There is also no screenshot output (a request for formats: ["screenshot"] returns HTTP 422), so scanned/image-only PDFs are not its problem to solve either. That is why this architecture pairs fastCRW with a separate PDF parser rather than overclaiming a single endpoint. (Note: fastCRW can scrape a PDF that is already published at a public URL via its renderer, but that is a different thing from a robust local parser that preserves tables and page numbers — for an upload flow you want the dedicated parser.)

Step 1: Ingest web sources to clean markdown

The web leg uses two endpoints. /v1/scrape for a single known URL, and /v1/crawl for a whole site or section. Both return markdown, which chunks far more stably than raw HTML because there is no markup noise to fragment a sentence mid-tag.

Because the API is Firecrawl-compatible, this is a drop-in if you already use the Firecrawl SDK — change the base URL and your existing code points at fastCRW (managed cloud or your own self-hosted binary).

  • Scrape one URL: POST /v1/scrape with { "url": "...", "formats": ["markdown"] } — 1 credit on any renderer (http, lightpanda, or chrome).
  • Crawl a section: POST /v1/crawl with maxDepth and maxPages (caps: depth 10, pages 1000). Crawl bills 1 credit per page. The call returns a job ID; poll GET /v1/crawl/:id for results.

A minimal Python ingestion step, written against the Firecrawl-compatible shape:

  • client.scrape_url(url, params={"formats": ["markdown"]}) → take result["markdown"].
  • Tag every web chunk with its source URL and a source_type: "web" field so retrieval can cite it back later.

Clean markdown is the whole point of doing the web leg this way: a chunk that reads as prose embeds more meaningfully than a chunk littered with div scaffolding, and your retriever returns fewer junk neighbors.

Step 2: Parse PDFs with a dedicated parser

fastCRW deliberately does not own this step, so own it yourself with a tool built for layout. The split is clean: web text comes from fastCRW, document text comes from a PDF parser, and both normalize to markdown before they meet.

Why there is no fastCRW /parse endpoint

PDF parsing is a different engineering problem — text extraction, table reconstruction, reading-order recovery, and (for scans) OCR. Folding that into a web-scraping engine would make the single ~8 MB binary something it is not. Keeping it out is the honest design choice, and it means you pick the parser that actually fits your documents.

Choosing a PDF parser and normalizing to markdown

  • Text-native PDFs (most reports, exports): a library-level parser such as PyMuPDF or pdfplumber pulls text and basic table structure fast and offline.
  • Tables and complex layout: reach for a layout-aware parser (Docling, Unstructured, or a hosted document API) that reconstructs tables into markdown tables rather than collapsing them into a wall of numbers.
  • Scanned / image-only PDFs: you need OCR (Tesseract or a hosted OCR API). fastCRW has no screenshot or image path here, so this is firmly the parser's job.

Whatever you choose, emit the same shape you emit for web pages: markdown text plus metadata. Tag each PDF chunk with source_type: "pdf", the file name, and — critically — the page number the text came from, because that is your citation anchor in Step 4.

Step 3: Unify, chunk, and embed

The merge happens here, and it is simpler than people expect once both sources are markdown. You run one chunking strategy across both, not two.

A single chunking strategy across sources

Resist the urge to chunk PDFs and web pages differently. A divergent strategy fragments your retrieval quality because identical concepts end up with different granularity depending on where they came from. Pick one approach — recursive character splitting with a sensible overlap, or structure-aware splitting on markdown headings — and apply it uniformly. Markdown from both legs makes heading-aware splitting genuinely viable, since fastCRW preserves heading structure and a good PDF parser reconstructs it. See best chunking strategies for RAG for sizing trade-offs.

Tagging chunks by source type

Every chunk carries metadata before it is embedded:

  • source_type: "web" or "pdf"
  • source_ref: the URL (web) or file name (pdf)
  • locator: an anchor/section (web) or page number (pdf)
  • ingested_at: timestamp, for freshness logic later

Embed all chunks with the same model and write them to one vector store. The metadata is what lets a single retriever return the most relevant chunk regardless of origin — and lets you filter (e.g. "only PDFs" or "only docs from this domain") when a query calls for it.

Step 4: Retrieve and answer

With a unified index, retrieval is ordinary RAG: embed the query, pull the top-k chunks across both source types, and pass them to your LLM with a grounding prompt. The mixed-source twist is entirely in citations.

Citations back to the PDF page or URL

Because you tagged every chunk, the answer can cite precisely:

  • Web chunk → link to the source_ref URL (optionally with the section anchor).
  • PDF chunk → "Filename, p. {locator}" so a reader can open the document to the exact page.

Prompt the model to attach the source_ref and locator of any chunk it used. This is the single highest-leverage thing you can do for trust: a mixed-source answer that says "from the Q3 report, p. 12" next to "from the product docs page" is verifiable, and verifiable answers are the ones people actually act on. For the broader retrieval-quality picture, see our RAG pipeline with CRW walkthrough.

Keeping the index fresh

The two source types go stale on different clocks, so refresh them on different clocks.

  • Web: re-crawl on a schedule (cron). fastCRW is stateless per request — it does not remember your last crawl — so you own the diff logic: store a content hash per URL and re-embed only the pages that changed. That keeps re-index cost proportional to actual change, not to site size.
  • PDFs: re-ingest on change. PDFs usually arrive as discrete uploads or file-drops, so trigger re-parsing when a file's hash changes rather than on a timer.

Tag the new chunks with a fresh ingested_at and remove the superseded ones so retrieval never serves two versions of the same page.

Cost, footprint, and self-hosting the web leg

Honesty about cost is part of the honest-scoping theme. The web leg's credit math is flat and legible: every renderer (http, lightpanda, or chrome) costs 1 credit per scrape, and crawl is 1 credit per page regardless of renderer — no chrome surcharge. There is no JSON-extraction step in this pipeline unless you add one — plain markdown ingestion stays at the 1-credit rate. Check live numbers on /pricing before you budget.

If you would rather not meter the web leg at all, self-host it. fastCRW is AGPL-3.0 and ships as a single ~8 MB binary in one container, so the web ingestion engine self-hosts for $0 (you pay only your own server) and your scraped content never leaves your infrastructure — useful when the documents in your RAG index are sensitive. Your PDF parser runs locally in the same spirit; the dedicated-parser tools above (PyMuPDF, pdfplumber, Docling, Tesseract) are all self-hostable too. For parser selection beyond this post, see best document parsing APIs.

Honest limits, restated

So nobody is surprised in production:

  • No PDF parser in fastCRW. No /parse document-upload endpoint — pair it with a dedicated parser, as above.
  • No screenshot output (HTTP 422), so image-only/scanned PDFs need OCR in the parser leg.
  • Stateless per request. Freshness diffs and the index itself are yours to persist.
  • No multi-URL batch /v1/extract — for many web pages, iterate /v1/scrape concurrently or use /v1/crawl.

Sources

  • fastCRW canonical fact sheet — scrape benchmark (diagnose_3way.py, 819 labeled URLs, 2026-05-08), API surface, footprint, and honest gaps. github.com/us/crw
  • see plan pricing and credit costs: fastcrw.com/pricing (verify launch vs regular pricing at time of reading).
  • Firecrawl docs (API-compatible reference shape): docs.firecrawl.dev (verified 2026-05-18).

Related: RAG pipeline with CRW · Best chunking strategies for RAG · Scrape-to-RAG with LlamaIndex · Best document parsing APIs

FAQ

Frequently asked questions

How do I build a RAG pipeline over both PDFs and websites?
Treat them as two ingestion legs that meet downstream. Use a web-data engine (fastCRW's /v1/scrape and /v1/crawl) to turn web pages into clean markdown, and a dedicated PDF parser (PyMuPDF, pdfplumber, Docling, or OCR for scans) to turn documents into markdown. Normalize both to the same shape, tag each chunk with its source type and locator, then run one chunking strategy, embed into a single vector store, and retrieve across both.
Can fastCRW parse uploaded PDF files?
No. fastCRW has no /parse document-upload endpoint — you cannot POST a PDF file and get text back — and no screenshot output (formats: ['screenshot'] returns HTTP 422), so scanned PDFs are out of scope too. fastCRW handles the web half (returning LLM-ready markdown); pair it with a dedicated PDF parser for the document half rather than expecting one endpoint to do both.
Should I use one chunking strategy across PDFs and web pages?
Yes. Once both sources are normalized to markdown, apply a single chunking strategy uniformly. Divergent chunking fragments retrieval quality because identical concepts end up with different granularity depending on origin. Heading-aware splitting works well here because fastCRW preserves heading structure and a good PDF parser reconstructs it.
How do I cite back to the PDF page or source URL?
Tag every chunk before embedding with source_type ('web' or 'pdf'), source_ref (the URL or file name), and locator (a section anchor for web, the page number for PDFs). Prompt the LLM to attach the source_ref and locator of any chunk it used, so a web answer links to the URL and a PDF answer reads 'Filename, p. 12'. Verifiable citations are the highest-leverage trust feature in a mixed-source RAG system.
Which PDF parser pairs well with web scraping?
It depends on your documents. For text-native PDFs, PyMuPDF or pdfplumber are fast and offline. For complex tables and layout, a layout-aware parser like Docling or Unstructured reconstructs markdown tables. For scanned/image-only PDFs you need OCR (Tesseract or a hosted OCR API). All are self-hostable, which keeps the document leg local alongside a self-hosted fastCRW web leg.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive