2026-03-11 — Engine v0.0.8

This release focused on two themes:

making extraction behavior more reliable on real-world content,
and making the product surface easier to understand through clearer docs and validation.

Engine (CRW)

Wikipedia / MediaWiki onlyMainContent fix — onlyMainContent: true now correctly extracts article text from Wikipedia pages (~49% size reduction). Previously the noise handler matched "toc" as a substring inside "vector-toc-available" on the <html> element, removing the entire page.
3-tier noise pattern matching — noise class/id matching now uses substring (long patterns), exact-token (short/ambiguous: toc, share, social, comment, related), and prefix (ad-, ads-) matching to avoid false positives on real content.
Structural element guard — noise handler never removes <html>, <head>, <body>, or <main> elements.
Re-clean after readability — readability output is re-cleaned to strip residual noise (infobox, navbox, catlinks) inside broad containers.
Wikipedia-aware readability — added .mw-parser-output, #mw-content-text, #bodyContent to scored selectors; selectors wrapping >90% of body are skipped.
BYOK LLM extraction — per-request llmApiKey, llmProvider, llmModel fields for bring-your-own-key structured extraction without server config.
JSON format validation — formats: ["json"] without jsonSchema now returns a 400 error instead of a warning.
Block detection skip — pages >50 KB skip interstitial/block detection (no more false "blocked by anti-bot" on Wikipedia).
Null byte protection — URLs containing %00 or null bytes are rejected at the validation layer.
Request timeout — default bumped from 60s to 120s.
Dockerfile fix — corrected cargo build flags, added config.docker.toml.

Platform

BYOK docs — added BYOK (llmApiKey/llmProvider/llmModel) documentation to scrape and extract docs.
Free tier 500 credits — free tier increased from 50 to 500 credits.
Pricing clarity — "Regular $XX/mo" strikethrough compare-at-price, launch end date (June 1, 2026), rate lock guarantee.
About page — new /about page with mission, open-source philosophy, and contact info.
Trust section — stats section on landing page (AGPL-3.0, 6.6 MB, 99.9%, 500 credits).
Validation errors — upstream 422 errors now include specific guidance about valid format names and jsonSchema requirements.
Header cleanup — removed Via header from responses.
Documentation — added docs for error codes, rate limits, credit costs, JS rendering, formats, SDK examples, MCP integration, compatibility matrix, and self-hosting hardening.

Upgrade notes

Re-test any extraction workflow that depends on Wikipedia or MediaWiki-style content because onlyMainContent behavior is now more aggressive and more accurate.
If you were relying on permissive json requests without a schema, update the client now; those requests return a 400 error in this release.
If you self-host, pull the latest container image so the Dockerfile and config changes land together.

2026-03-10

Initial Release

Scrape, crawl, and map endpoints — Firecrawl-compatible API shape.
Markdown-first extraction with readability scoring.
CSS/XPath selectors, tag include/exclude filtering.
BM25 and cosine similarity chunk filtering.
LLM-based structured extraction with JSON Schema validation.
JS rendering via LightPanda CDP.
Stealth mode with browser-realistic UA rotation.
Credit-based billing with Stripe integration.
Self-hosting support with single-binary deployment.

Release framing

The first release established the core product surface: a Firecrawl-compatible scrape, crawl, and map API with markdown-first extraction, optional browser rendering, and a path to self-hosting. Later releases should be read as refinements on top of that baseline, not a new product direction.

Changelog