Engineering

LLM Fine-Tuning Data Pipelines From the Web

Build LLM fine-tuning instruction datasets from the web: schema-driven Q/A extraction, prompt/response pair structuring, and a managed formatter on paid plans.

Jun 26, 2026

Engineering·15 min read

Local-First Web Scraping and Data Privacy: Why the URL Leak Matters

Every hosted scraping API sees every URL you scrape. A deep-dive on local-first web scraping, data residency, and the privacy and compliance case for keeping the scrape engine on your own infra.

Jun 26, 2026

Open Source Web Scraping in 2026: The Open-Core Trap and How to Avoid It

Not all 'open source' scraping tools are equally free. A guide to open source web scraping in 2026 — open-core bait-and-switch, crippled OSS tiers, cloud-only anti-bot engines, and what genuine parity looks like.

Jun 25, 2026

Web Scraping Benchmark Methodology: Why p50/p90/p99

Our web scraping benchmark methodology: shared public dataset, percentile latency, labeled ground-truth recall, disclosed gaps. Why we never publish averages.

Jun 24, 2026

Firecrawl Extract Deep Dive: Schemas, Cost, and the Dual-Billing You Need to Plan For

A technical and economic deep dive on Firecrawl's structured extraction — JSON schemas, natural-language extraction, accuracy patterns — and the separate token subscription that makes extract the most underestimated line item.

Jun 24, 2026

Vector Embeddings vs Keyword Search Explained

Vector embeddings power semantic search; keyword search matches exact terms. Learn the difference, when each one wins, and where live web retrieval fits.

Jun 24, 2026

Engineering·15 min read

Scraping Latency Explained: Where the Milliseconds Actually Go

A from-first-principles breakdown of web scraping latency in 2026 — DNS, TLS, fetch, render decision, extraction, and inter-service hops — and the architectural choices that make a scrape return in under a second.

Jun 23, 2026

Training Data Collection via Web Scraping

Collect an LLM training corpus by scraping the web responsibly: coverage, provenance, robots.txt, and the legal questions you must settle before you start.

Jun 23, 2026

Engineering·15 min read

LLM-Ready Markdown Extraction: Why Clean Beats Complete

Turning a web page into LLM-ready markdown is not 'dump the HTML.' A deep-dive on boilerplate stripping, structure preservation, token economics, and why extraction quality silently decides RAG answer quality.

Jun 22, 2026

Firecrawl for RAG Pipelines: What It's Great At, and Where the Bill Bites

An engineering look at using Firecrawl in a RAG ingestion pipeline — markdown quality, crawl-to-chunk patterns, freshness, and the cost dynamics that decide whether a Firecrawl-compatible self-host wins.

Jun 20, 2026

We Built a Drop-In Firecrawl Research API — and Beat It on ArXivQA (61% vs 53.3%)

fastCRW's Research API mirrors Firecrawl's research endpoints and reaches 61.0% recall on the ArXivQA paper-retrieval benchmark vs Firecrawl's 53.3% — live, with no self-hosted index. Here's exactly how.

Jun 20, 2026

What Is Local-First Web Scraping?

Local-first web scraping keeps target URLs and scraped data on your own infra. Learn what it means, how it works, and when it beats a cloud scraping API.

Jun 15, 2026

What Is Agentic Search and Why It Beats Stale Caches

Agentic search queries the live web at reasoning time. Learn how it differs from RAG and traditional search, and when agents need real-time retrieval.

Agentic Search vs RAG Retrieval for Agents

Agentic search vs RAG retrieval: which to use for AI agents. Compare freshness, latency, cost, and accuracy, and learn when to combine both in one stack.

Best Chunking Strategies for RAG in 2026

Compare 7 chunking strategies for RAG: fixed, recursive, semantic, page-level, late chunking. When to use each, with code, benchmarks, and honest trade-offs.

How to Measure Web Scraper Accuracy (Truth-Recall)

Truth-recall measures how much labeled ground-truth content a scraper actually returns. Learn how to measure web scraper accuracy with a real 819-URL method.

What Is a Web Index? How It Powers Search & AI Agents

A web index is a pre-built snapshot of the web. Learn the four-stage indexing pipeline, hybrid retrieval, and why index quality caps what your agent answers.

Jun 13, 2026

LangGraph Web-Aware RAG at Lower Latency

Add a web-aware retrieval node to LangGraph RAG with fastCRW. Cut median scrape latency vs Firecrawl with the highest truth-recall of three tools tested.

Jun 10, 2026

Managed LLM Search API Costs: The Capped Credit Model

How managed LLM search adds model usage to your bill: metered in credits with an 8,000-credit per-request cap that keeps answer-mode cost predictable.

Jun 8, 2026

Why a Stateless Request Model Beats Sessions

A stateless web scraping architecture is simpler to scale, retry, and self-host. How fastCRW's per-request model avoids session affinity and sticky routing.

Jun 8, 2026

Engineering·13 min read

Scheduled Web Scraping in GitHub Actions With CRW (2026)

Run scrapes on a schedule for free with GitHub Actions: spin up CRW as a service container, scrape with Python, commit results, and open a PR on change. Full workflow YAML — no servers, AGPL-3.0.

Jun 6, 2026

Search Index vs Live Web: Agents Need Both

A search index is fast but can be stale; the live web is fresh but slower. Learn why AI agents need both layers and how to combine them for speed and freshness.

Jun 5, 2026

Credit Multiplier Traps in Scraping APIs

Scraping APIs hide cost in multipliers: render multipliers, premium-proxy multipliers, separate extract plans. Learn to spot the traps and price a flat alternative.

Jun 3, 2026

URL Mapping vs Sitemap Parsing for Discovery

URL mapping vs sitemap.xml parsing for site discovery: coverage, freshness, and cost. When /v1/map beats a stale sitemap and feeds a crawl for 1 credit.

Jun 3, 2026

Ruby to Go: Rewriting Legacy Scrapers for Speed

Rewrite a legacy Ruby web scraper in Go for concurrency — or skip the rewrite and call a Firecrawl-compatible API from Go. Migration patterns, costs, and limits.

Jun 2, 2026

Firecrawl /scrape Deep Dive: Formats, JS Rendering, and the Compatible Way to Call It

A deep technical walkthrough of Firecrawl's scrape endpoint — formats, markdown vs HTML vs JSON, JavaScript rendering, metadata, error handling — and how the same calls work against a Firecrawl-compatible engine.

May 31, 2026

Engineering·17 min read

How We Built fastCRW: Rust, 50MB RAM, and the Path to Real-Time Web Scraping for AI Agents (2026)

A build-in-public engineering write-up of fastCRW — why we wrote it in Rust, how the binary stays around 50 MB RAM idle on a $5 VPS, when LightPanda beats Chromium, the Firecrawl-compatible REST surface, the built-in MCP server, the 63.74% truth-recall benchmark (diagnose_3way.py, 2026-05-08), and the things we got wrong along the way.

May 27, 2026

Rust vs Python Scrapers: An Architecture and Footprint Deep-Dive

Not 'which language is faster' — a systems-level look at why Rust and Python scraper architectures diverge on memory footprint, concurrency model, cold start, and operational surface, and when each wins.

May 25, 2026

Build an LLM Training-Data Pipeline With CRW (2026): Crawl, Clean, Dedupe to JSONL

Turn the web into clean fine-tuning data: crawl with CRW, strip boilerplate, quality-filter, near-dedupe with MinHash, and emit JSONL. Full runnable Python — self-host free under AGPL-3.0.

May 24, 2026

Firecrawl /crawl Deep Dive: Jobs, Limits, Credit Cost, and Safe Patterns (2026)

Everything about Firecrawl's crawl endpoint — the async job model, depth and page limits, why crawl is the biggest credit sink, polling patterns, and how the same crawl works against a Firecrawl-compatible engine.

May 20, 2026

CRW v0.7.0: LLM Summary and Search Answer (Managed LLM)

v0.7.0 adds AI summaries to /scrape, Perplexity-style answers with citations to /search, and per-result LLM summaries — powered by fastCRW's managed LLM on paid plans.

May 12, 2026

Engineering·7 min read

CRW v0.0.10: Rate Limiting, Crawl Cancel, and Machine-Readable Error Codes

CRW v0.0.10 adds configurable rate limiting, a crawl cancel endpoint, machine-readable error codes on every error response, fenced code blocks, and cleaner markdown output for RAG pipelines.

Apr 26, 2026

The Real Cost of Self-Hosting vs Cloud Scraping APIs

Self-hosted vs cloud scraping API costs — TCO breakdown with real calculations for VPS, engineering time, and CRW's lightweight edge.

Apr 25, 2026

CRW v0.0.2: CSS Selectors, Chunking, BM25 Scoring, and Stealth Mode

CRW v0.0.2 adds CSS/XPath extraction, RAG-ready chunking with BM25 and cosine scoring, stealth mode for bot detection bypass, per-request proxy, and a setup command for JS rendering.

Apr 23, 2026

CRW v0.0.11: Stealth Anti-Bot Bypass, Chrome Failover, and Cloudflare Challenge Retry

CRW v0.0.11 adds automatic stealth JavaScript injection to bypass bot detection, Chrome as a fallback renderer for complex SPAs, Cloudflare challenge auto-retry, and HTTP-to-CDP auto-escalation.

Apr 22, 2026

Engineering·7 min read

Single-Binary Infrastructure: Why It Matters for Developer Tools

The case for single-binary deployment in developer infrastructure — operational simplicity, CI speed, and why CRW ships as one 8 MB file.

Apr 22, 2026

Rust vs Python Web Scraping (2026): Lower Latency, Tiny Footprint

Rust web scrapers run with lower latency and a far smaller memory footprint than Python. We compare fastCRW (Rust) against Scrapy, BeautifulSoup, and Playwright — latency, memory, throughput, and which to pick for your stack.

Apr 16, 2026

Why Every AI Agent Needs a Web Context Layer

Why AI agents need a web context layer — live scraping as infrastructure to reduce hallucinations. Build one with MCP, RAG, and CRW.

Apr 16, 2026

Engineering·7 min read

Why Low Memory Usage Matters in Self-Hosted Scraping

How idle RAM affects your hosting costs and concurrent throughput — and why CRW's small single-binary footprint changes the economics.

Apr 13, 2026

Inside CRW: Architecture of a Lightweight Rust Scraping API

A technical deep-dive into CRW's Axum-based API, lol-html parser, LightPanda integration, and how it stays a small single static binary with a tiny idle footprint.

Apr 12, 2026

Where CRW Still Falls Short — and What We're Improving

An honest look at CRW's current limitations — PDF parsing, anti-bot, SPA coverage, retry logic, caching — and the roadmap for each.

Apr 3, 2026

Introducing Search: Find, Scrape, and Extract in One API Call

CRW now includes a search endpoint. Search the web, get structured results, and optionally scrape every result page — all in a single API call.

Apr 3, 2026

CRW v0.0.8: Wikipedia Fix, LLM Extraction, and Smarter Noise Detection

CRW v0.0.8 fixes Wikipedia extraction with onlyMainContent, adds per-request LLM extraction config, introduces 3-tier noise matching, and hardens the content cleaning pipeline.

Apr 2, 2026