Blog category
Engineering
Engineering notes on architecture, performance, benchmarks, releases, and infrastructure tradeoffs behind fastCRW.
Stateless vs Stateful Scraping: Session Tradeoffs
Stateless vs stateful scraping: when persistent sessions and cookies help, what they cost in complexity, and how to handle logins without server-side state.
LLM Fine-Tuning Data Pipelines From the Web
Build LLM fine-tuning instruction datasets from the web: schema-driven Q/A extraction, prompt/response pair structuring, and a managed formatter on paid plans.
Local-First Web Scraping and Data Privacy: Why the URL Leak Matters
Every hosted scraping API sees every URL you scrape. A deep-dive on local-first web scraping, data residency, and the privacy and compliance case for keeping the scrape engine on your own infra.
Open Source Web Scraping in 2026: The Open-Core Trap and How to Avoid It
Not all 'open source' scraping tools are equally free. A guide to open source web scraping in 2026 — open-core bait-and-switch, crippled OSS tiers, cloud-only anti-bot engines, and what genuine parity looks like.
Web Scraping Benchmark Methodology: Why p50/p90/p99
Our web scraping benchmark methodology: shared public dataset, percentile latency, labeled ground-truth recall, disclosed gaps. Why we never publish averages.
Firecrawl Extract Deep Dive: Schemas, Cost, and the Dual-Billing You Need to Plan For
A technical and economic deep dive on Firecrawl's structured extraction — JSON schemas, natural-language extraction, accuracy patterns — and the separate token subscription that makes extract the most underestimated line item.
Vector Embeddings vs Keyword Search Explained
Vector embeddings power semantic search; keyword search matches exact terms. Learn the difference, when each one wins, and where live web retrieval fits.
Scraping Latency Explained: Where the Milliseconds Actually Go
A from-first-principles breakdown of web scraping latency in 2026 — DNS, TLS, fetch, render decision, extraction, and inter-service hops — and the architectural choices that make a scrape return in under a second.
Training Data Collection via Web Scraping
Collect an LLM training corpus by scraping the web responsibly: coverage, provenance, robots.txt, and the legal questions you must settle before you start.
LLM-Ready Markdown Extraction: Why Clean Beats Complete
Turning a web page into LLM-ready markdown is not 'dump the HTML.' A deep-dive on boilerplate stripping, structure preservation, token economics, and why extraction quality silently decides RAG answer quality.
Firecrawl for RAG Pipelines: What It's Great At, and Where the Bill Bites
An engineering look at using Firecrawl in a RAG ingestion pipeline — markdown quality, crawl-to-chunk patterns, freshness, and the cost dynamics that decide whether a Firecrawl-compatible self-host wins.
We Built a Drop-In Firecrawl Research API — and Beat It on ArXivQA (61% vs 53.3%)
fastCRW's Research API mirrors Firecrawl's research endpoints and reaches 61.0% recall on the ArXivQA paper-retrieval benchmark vs Firecrawl's 53.3% — live, with no self-hosted index. Here's exactly how.
What Is Local-First Web Scraping?
Local-first web scraping keeps target URLs and scraped data on your own infra. Learn what it means, how it works, and when it beats a cloud scraping API.
What Is Agentic Search and Why It Beats Stale Caches
Agentic search queries the live web at reasoning time. Learn how it differs from RAG and traditional search, and when agents need real-time retrieval.
Agentic Search vs RAG Retrieval for Agents
Agentic search vs RAG retrieval: which to use for AI agents. Compare freshness, latency, cost, and accuracy, and learn when to combine both in one stack.
Best Chunking Strategies for RAG in 2026
Compare 7 chunking strategies for RAG: fixed, recursive, semantic, page-level, late chunking. When to use each, with code, benchmarks, and honest trade-offs.
How to Measure Web Scraper Accuracy (Truth-Recall)
Truth-recall measures how much labeled ground-truth content a scraper actually returns. Learn how to measure web scraper accuracy with a real 819-URL method.
What Is a Web Index? How It Powers Search & AI Agents
A web index is a pre-built snapshot of the web. Learn the four-stage indexing pipeline, hybrid retrieval, and why index quality caps what your agent answers.
LangGraph Web-Aware RAG at Lower Latency
Add a web-aware retrieval node to LangGraph RAG with fastCRW. Cut median scrape latency vs Firecrawl with the highest truth-recall of three tools tested.
Managed LLM Search API Costs: The Capped Credit Model
How managed LLM search adds model usage to your bill: metered in credits with an 8,000-credit per-request cap that keeps answer-mode cost predictable.
Why a Stateless Request Model Beats Sessions
A stateless web scraping architecture is simpler to scale, retry, and self-host. How fastCRW's per-request model avoids session affinity and sticky routing.
Scheduled Web Scraping in GitHub Actions With CRW (2026)
Run scrapes on a schedule for free with GitHub Actions: spin up CRW as a service container, scrape with Python, commit results, and open a PR on change. Full workflow YAML — no servers, AGPL-3.0.
Search Index vs Live Web: Agents Need Both
A search index is fast but can be stale; the live web is fresh but slower. Learn why AI agents need both layers and how to combine them for speed and freshness.
Credit Multiplier Traps in Scraping APIs
Scraping APIs hide cost in multipliers: render multipliers, premium-proxy multipliers, separate extract plans. Learn to spot the traps and price a flat alternative.
URL Mapping vs Sitemap Parsing for Discovery
URL mapping vs sitemap.xml parsing for site discovery: coverage, freshness, and cost. When /v1/map beats a stale sitemap and feeds a crawl for 1 credit.
Ruby to Go: Rewriting Legacy Scrapers for Speed
Rewrite a legacy Ruby web scraper in Go for concurrency — or skip the rewrite and call a Firecrawl-compatible API from Go. Migration patterns, costs, and limits.
Firecrawl /scrape Deep Dive: Formats, JS Rendering, and the Compatible Way to Call It
A deep technical walkthrough of Firecrawl's scrape endpoint — formats, markdown vs HTML vs JSON, JavaScript rendering, metadata, error handling — and how the same calls work against a Firecrawl-compatible engine.
How We Built fastCRW: Rust, 50MB RAM, and the Path to Real-Time Web Scraping for AI Agents (2026)
A build-in-public engineering write-up of fastCRW — why we wrote it in Rust, how the binary stays around 50 MB RAM idle on a $5 VPS, when LightPanda beats Chromium, the Firecrawl-compatible REST surface, the built-in MCP server, the 63.74% truth-recall benchmark (diagnose_3way.py, 2026-05-08), and the things we got wrong along the way.
Rust vs Python Scrapers: An Architecture and Footprint Deep-Dive
Not 'which language is faster' — a systems-level look at why Rust and Python scraper architectures diverge on memory footprint, concurrency model, cold start, and operational surface, and when each wins.
Build an LLM Training-Data Pipeline With CRW (2026): Crawl, Clean, Dedupe to JSONL
Turn the web into clean fine-tuning data: crawl with CRW, strip boilerplate, quality-filter, near-dedupe with MinHash, and emit JSONL. Full runnable Python — self-host free under AGPL-3.0.
Firecrawl /crawl Deep Dive: Jobs, Limits, Credit Cost, and Safe Patterns (2026)
Everything about Firecrawl's crawl endpoint — the async job model, depth and page limits, why crawl is the biggest credit sink, polling patterns, and how the same crawl works against a Firecrawl-compatible engine.
CRW v0.7.0: LLM Summary and Search Answer (Managed LLM)
v0.7.0 adds AI summaries to /scrape, Perplexity-style answers with citations to /search, and per-result LLM summaries — powered by fastCRW's managed LLM on paid plans.
CRW v0.0.10: Rate Limiting, Crawl Cancel, and Machine-Readable Error Codes
CRW v0.0.10 adds configurable rate limiting, a crawl cancel endpoint, machine-readable error codes on every error response, fenced code blocks, and cleaner markdown output for RAG pipelines.
The Real Cost of Self-Hosting vs Cloud Scraping APIs
Self-hosted vs cloud scraping API costs — TCO breakdown with real calculations for VPS, engineering time, and CRW's lightweight edge.
CRW v0.0.2: CSS Selectors, Chunking, BM25 Scoring, and Stealth Mode
CRW v0.0.2 adds CSS/XPath extraction, RAG-ready chunking with BM25 and cosine scoring, stealth mode for bot detection bypass, per-request proxy, and a setup command for JS rendering.
CRW v0.0.11: Stealth Anti-Bot Bypass, Chrome Failover, and Cloudflare Challenge Retry
CRW v0.0.11 adds automatic stealth JavaScript injection to bypass bot detection, Chrome as a fallback renderer for complex SPAs, Cloudflare challenge auto-retry, and HTTP-to-CDP auto-escalation.
Single-Binary Infrastructure: Why It Matters for Developer Tools
The case for single-binary deployment in developer infrastructure — operational simplicity, CI speed, and why CRW ships as one 8 MB file.
Rust vs Python Web Scraping (2026): Lower Latency, Tiny Footprint
Rust web scrapers run with lower latency and a far smaller memory footprint than Python. We compare fastCRW (Rust) against Scrapy, BeautifulSoup, and Playwright — latency, memory, throughput, and which to pick for your stack.
Why Every AI Agent Needs a Web Context Layer
Why AI agents need a web context layer — live scraping as infrastructure to reduce hallucinations. Build one with MCP, RAG, and CRW.
Why Low Memory Usage Matters in Self-Hosted Scraping
How idle RAM affects your hosting costs and concurrent throughput — and why CRW's small single-binary footprint changes the economics.
Inside CRW: Architecture of a Lightweight Rust Scraping API
A technical deep-dive into CRW's Axum-based API, lol-html parser, LightPanda integration, and how it stays a small single static binary with a tiny idle footprint.
Where CRW Still Falls Short — and What We're Improving
An honest look at CRW's current limitations — PDF parsing, anti-bot, SPA coverage, retry logic, caching — and the roadmap for each.
Introducing Search: Find, Scrape, and Extract in One API Call
CRW now includes a search endpoint. Search the web, get structured results, and optionally scrape every result page — all in a single API call.
CRW v0.0.8: Wikipedia Fix, LLM Extraction, and Smarter Noise Detection
CRW v0.0.8 fixes Wikipedia extraction with onlyMainContent, adds per-request LLM extraction config, introduces 3-tier noise matching, and hardens the content cleaning pipeline.
What I Learned Benchmarking CRW Against Firecrawl and Crawl4AI
How we benchmark CRW against Firecrawl and Crawl4AI — methodology, dataset breakdown, what the metrics mean, and a one-command reproducible script you can run against your own URLs.
Why I Built CRW: A Lightweight Firecrawl-Compatible Scraper in Rust
The story behind CRW — why Rust, why single-binary, and why Firecrawl-compatible for AI agent and RAG use cases.
Browse more
Jump back to the full archive
This category contains 46 of 156 total posts in the fastCRW blog archive.
View all blog posts