2026-03-11 — Engine v0.0.8
This release focused on two themes:
- making extraction behavior more reliable on real-world content,
- and making the product surface easier to understand through clearer docs and validation.
Engine (CRW)
- Wikipedia / MediaWiki onlyMainContent fix —
onlyMainContent: true now correctly extracts article text from Wikipedia pages (~49% size reduction). Previously the noise handler matched "toc" as a substring inside "vector-toc-available" on the <html> element, removing the entire page.
- 3-tier noise pattern matching — noise class/id matching now uses substring (long patterns), exact-token (short/ambiguous:
toc, share, social, comment, related), and prefix (ad-, ads-) matching to avoid false positives on real content.
- Structural element guard — noise handler never removes
<html>, <head>, <body>, or <main> elements.
- Re-clean after readability — readability output is re-cleaned to strip residual noise (infobox, navbox, catlinks) inside broad containers.
- Wikipedia-aware readability — added
.mw-parser-output, #mw-content-text, #bodyContent to scored selectors; selectors wrapping >90% of body are skipped.
- BYOK LLM extraction — per-request
llmApiKey, llmProvider, llmModel fields for bring-your-own-key structured extraction without server config.
- JSON format validation —
formats: ["json"] without jsonSchema now returns a 400 error instead of a warning.
- Block detection skip — pages >50 KB skip interstitial/block detection (no more false "blocked by anti-bot" on Wikipedia).
- Null byte protection — URLs containing
%00 or null bytes are rejected at the validation layer.
- Request timeout — default bumped from 60s to 120s.
- Dockerfile fix — corrected
cargo build flags, added config.docker.toml.
Platform
- BYOK docs — added BYOK (llmApiKey/llmProvider/llmModel) documentation to scrape and extract docs.
- Free tier 500 credits — free tier increased from 50 to 500 credits.
- Pricing clarity — "Regular $XX/mo" strikethrough compare-at-price, launch end date (June 1, 2026), rate lock guarantee.
- About page — new /about page with mission, open-source philosophy, and contact info.
- Trust section — stats section on landing page (AGPL-3.0, 6.6 MB, 99.9%, 500 credits).
- Validation errors — upstream 422 errors now include specific guidance about valid format names and jsonSchema requirements.
- Header cleanup — removed
Via header from responses.
- Documentation — added docs for error codes, rate limits, credit costs, JS rendering, formats, SDK examples, MCP integration, compatibility matrix, and self-hosting hardening.
Upgrade notes
- Re-test any extraction workflow that depends on Wikipedia or MediaWiki-style content because
onlyMainContent behavior is now more aggressive and more accurate.
- If you were relying on permissive
json requests without a schema, update the client now; those requests return a 400 error in this release.
- If you self-host, pull the latest container image so the Dockerfile and config changes land together.
2026-03-10
Initial Release
- Scrape, crawl, and map endpoints — Firecrawl-compatible API shape.
- Markdown-first extraction with readability scoring.
- CSS/XPath selectors, tag include/exclude filtering.
- BM25 and cosine similarity chunk filtering.
- LLM-based structured extraction with JSON Schema validation.
- JS rendering via LightPanda CDP.
- Stealth mode with browser-realistic UA rotation.
- Credit-based billing with Stripe integration.
- Self-hosting support with single-binary deployment.
Release framing
The first release established the core product surface: a Firecrawl-compatible scrape, crawl, and map API with markdown-first extraction, optional browser rendering, and a path to self-hosting. Later releases should be read as refinements on top of that baseline, not a new product direction.