By the fastCRW team · Benchmarks and capability claims verified 2026-05-18 · Verify independently before relying on them.
The mixed-source ingestion problem in a PDF + web RAG pipeline
If you want to build a PDF RAG pipeline that also answers from live web pages, the first thing to accept is that PDFs and web pages are two different ingestion problems wearing the same "document" label. A web page is HTML you can fetch and convert to clean markdown. An uploaded PDF is a binary blob whose text — and worse, whose tables — lives in a layout model, not in a DOM. The naive instinct is to find one tool that "does both." That instinct is what produces brittle pipelines.
This guide takes the honest path instead: use a web-data engine for the web half, use a dedicated PDF parser for the PDF half, and unify them downstream into one index. We build with fastCRW for the web ingestion leg because it is a Firecrawl-compatible REST engine that returns LLM-ready markdown, and we are upfront about exactly where its job stops.
What fastCRW handles, and what it does not
fastCRW handles the web half well. On Firecrawl's own public scrape-content-dataset-v1 (the 819 labeled URLs that carry ground truth, diagnose_3way.py, 2026-05-08) it had the highest truth-recall of the three tools tested — 63.74% — ahead of Crawl4AI (59.95%) and Firecrawl (56.04%). Higher recall means more of the real page content survives into your markdown, which is exactly what a RAG chunker wants.
What fastCRW does not do, and we will not pretend otherwise: there is no /parse document-upload endpoint — you cannot POST a PDF file to fastCRW and get text back. There is also no screenshot output (a request for formats: ["screenshot"] returns HTTP 422), so scanned/image-only PDFs are not its problem to solve either. That is why this architecture pairs fastCRW with a separate PDF parser rather than overclaiming a single endpoint. (Note: fastCRW can scrape a PDF that is already published at a public URL via its renderer, but that is a different thing from a robust local parser that preserves tables and page numbers — for an upload flow you want the dedicated parser.)
Step 1: Ingest web sources to clean markdown
The web leg uses two endpoints. /v1/scrape for a single known URL, and /v1/crawl for a whole site or section. Both return markdown, which chunks far more stably than raw HTML because there is no markup noise to fragment a sentence mid-tag.
Because the API is Firecrawl-compatible, this is a drop-in if you already use the Firecrawl SDK — change the base URL and your existing code points at fastCRW (managed cloud or your own self-hosted binary).
- Scrape one URL:
POST /v1/scrapewith{ "url": "...", "formats": ["markdown"] }— 1 credit on any renderer (http, lightpanda, or chrome). - Crawl a section:
POST /v1/crawlwithmaxDepthandmaxPages(caps: depth 10, pages 1000). Crawl bills 1 credit per page. The call returns a job ID; pollGET /v1/crawl/:idfor results.
A minimal Python ingestion step, written against the Firecrawl-compatible shape:
client.scrape_url(url, params={"formats": ["markdown"]})→ takeresult["markdown"].- Tag every web chunk with its source URL and a
source_type: "web"field so retrieval can cite it back later.
Clean markdown is the whole point of doing the web leg this way: a chunk that reads as prose embeds more meaningfully than a chunk littered with div scaffolding, and your retriever returns fewer junk neighbors.
Step 2: Parse PDFs with a dedicated parser
fastCRW deliberately does not own this step, so own it yourself with a tool built for layout. The split is clean: web text comes from fastCRW, document text comes from a PDF parser, and both normalize to markdown before they meet.
Why there is no fastCRW /parse endpoint
PDF parsing is a different engineering problem — text extraction, table reconstruction, reading-order recovery, and (for scans) OCR. Folding that into a web-scraping engine would make the single ~8 MB binary something it is not. Keeping it out is the honest design choice, and it means you pick the parser that actually fits your documents.
Choosing a PDF parser and normalizing to markdown
- Text-native PDFs (most reports, exports): a library-level parser such as PyMuPDF or pdfplumber pulls text and basic table structure fast and offline.
- Tables and complex layout: reach for a layout-aware parser (Docling, Unstructured, or a hosted document API) that reconstructs tables into markdown tables rather than collapsing them into a wall of numbers.
- Scanned / image-only PDFs: you need OCR (Tesseract or a hosted OCR API). fastCRW has no screenshot or image path here, so this is firmly the parser's job.
Whatever you choose, emit the same shape you emit for web pages: markdown text plus metadata. Tag each PDF chunk with source_type: "pdf", the file name, and — critically — the page number the text came from, because that is your citation anchor in Step 4.
Step 3: Unify, chunk, and embed
The merge happens here, and it is simpler than people expect once both sources are markdown. You run one chunking strategy across both, not two.
A single chunking strategy across sources
Resist the urge to chunk PDFs and web pages differently. A divergent strategy fragments your retrieval quality because identical concepts end up with different granularity depending on where they came from. Pick one approach — recursive character splitting with a sensible overlap, or structure-aware splitting on markdown headings — and apply it uniformly. Markdown from both legs makes heading-aware splitting genuinely viable, since fastCRW preserves heading structure and a good PDF parser reconstructs it. See best chunking strategies for RAG for sizing trade-offs.
Tagging chunks by source type
Every chunk carries metadata before it is embedded:
source_type:"web"or"pdf"source_ref: the URL (web) or file name (pdf)locator: an anchor/section (web) or page number (pdf)ingested_at: timestamp, for freshness logic later
Embed all chunks with the same model and write them to one vector store. The metadata is what lets a single retriever return the most relevant chunk regardless of origin — and lets you filter (e.g. "only PDFs" or "only docs from this domain") when a query calls for it.
Step 4: Retrieve and answer
With a unified index, retrieval is ordinary RAG: embed the query, pull the top-k chunks across both source types, and pass them to your LLM with a grounding prompt. The mixed-source twist is entirely in citations.
Citations back to the PDF page or URL
Because you tagged every chunk, the answer can cite precisely:
- Web chunk → link to the
source_refURL (optionally with the section anchor). - PDF chunk → "Filename, p. {locator}" so a reader can open the document to the exact page.
Prompt the model to attach the source_ref and locator of any chunk it used. This is the single highest-leverage thing you can do for trust: a mixed-source answer that says "from the Q3 report, p. 12" next to "from the product docs page" is verifiable, and verifiable answers are the ones people actually act on. For the broader retrieval-quality picture, see our RAG pipeline with CRW walkthrough.
Keeping the index fresh
The two source types go stale on different clocks, so refresh them on different clocks.
- Web: re-crawl on a schedule (cron). fastCRW is stateless per request — it does not remember your last crawl — so you own the diff logic: store a content hash per URL and re-embed only the pages that changed. That keeps re-index cost proportional to actual change, not to site size.
- PDFs: re-ingest on change. PDFs usually arrive as discrete uploads or file-drops, so trigger re-parsing when a file's hash changes rather than on a timer.
Tag the new chunks with a fresh ingested_at and remove the superseded ones so retrieval never serves two versions of the same page.
Cost, footprint, and self-hosting the web leg
Honesty about cost is part of the honest-scoping theme. The web leg's credit math is flat and legible: every renderer (http, lightpanda, or chrome) costs 1 credit per scrape, and crawl is 1 credit per page regardless of renderer — no chrome surcharge. There is no JSON-extraction step in this pipeline unless you add one — plain markdown ingestion stays at the 1-credit rate. Check live numbers on /pricing before you budget.
If you would rather not meter the web leg at all, self-host it. fastCRW is AGPL-3.0 and ships as a single ~8 MB binary in one container, so the web ingestion engine self-hosts for $0 (you pay only your own server) and your scraped content never leaves your infrastructure — useful when the documents in your RAG index are sensitive. Your PDF parser runs locally in the same spirit; the dedicated-parser tools above (PyMuPDF, pdfplumber, Docling, Tesseract) are all self-hostable too. For parser selection beyond this post, see best document parsing APIs.
Honest limits, restated
So nobody is surprised in production:
- No PDF parser in fastCRW. No
/parsedocument-upload endpoint — pair it with a dedicated parser, as above. - No screenshot output (HTTP 422), so image-only/scanned PDFs need OCR in the parser leg.
- Stateless per request. Freshness diffs and the index itself are yours to persist.
- No multi-URL batch
/v1/extract— for many web pages, iterate/v1/scrapeconcurrently or use/v1/crawl.
Sources
- fastCRW canonical fact sheet — scrape benchmark (
diagnose_3way.py, 819 labeled URLs, 2026-05-08), API surface, footprint, and honest gaps. github.com/us/crw - see plan pricing and credit costs: fastcrw.com/pricing (verify launch vs regular pricing at time of reading).
- Firecrawl docs (API-compatible reference shape): docs.firecrawl.dev (verified 2026-05-18).
Related: RAG pipeline with CRW · Best chunking strategies for RAG · Scrape-to-RAG with LlamaIndex · Best document parsing APIs
