CRW v0.0.2 is the first major feature release since launch. It adds four capabilities that make CRW significantly more useful for RAG pipelines and AI agent workflows: targeted extraction via CSS/XPath selectors, content chunking with relevance scoring, stealth mode for bot detection bypass, and a one-command setup for JavaScript rendering.
CSS Selector and XPath Extraction
Not every scraping task needs the full page. Product prices live in .product-price. API docs live in #content. Job listings live in .job-card. Before v0.0.2, you'd scrape the entire page and filter client-side. Now you can target exactly what you need at the API level.
CSS Selectors
curl -X POST https://fastcrw.com/api/v1/scrape \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com/api-reference",
"formats": ["markdown"],
"cssSelector": "#main-content"
}'
CRW applies the selector before markdown conversion, so the output is clean and focused — no navigation, no sidebar, just the content you asked for. This is faster than excludeTags for pages where you know exactly which element contains the content you want.
XPath Expressions
curl -X POST https://fastcrw.com/api/v1/scrape \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://store.example.com/product/widget",
"formats": ["markdown"],
"xpath": "//div[@class=\"product-details\"]"
}'
XPath gives you more expressive power than CSS selectors — ancestor/sibling traversal, text content matching, attribute predicates. For complex pages with inconsistent markup, XPath is often the only way to reliably target the right element.
Both cssSelector and xpath return all matching elements concatenated into a single markdown output. If a selector matches nothing, you get an empty markdown string with a warning in the response metadata.
Chunking for RAG Pipelines
RAG pipelines need content split into chunks that fit embedding model context windows. Before v0.0.2, you'd scrape a page and chunk it yourself. Now CRW handles chunking at the API level with three strategies.
Topic Chunking
Splits content at heading boundaries (h1–h6). Each chunk is a self-contained section with its heading preserved. Best for documentation and blog content where headings mark natural topic boundaries.
{
"url": "https://docs.example.com/getting-started",
"formats": ["markdown"],
"chunkStrategy": "topic"
}
Sentence Chunking
Splits at sentence boundaries with configurable overlap. Best for long-form prose content where heading structure is sparse or inconsistent.
{
"url": "https://blog.example.com/long-article",
"formats": ["markdown"],
"chunkStrategy": "sentence",
"chunkSize": 500,
"chunkOverlap": 50
}
Regex Chunking
Splits at custom delimiter patterns. Useful for pages with non-standard separators (horizontal rules, custom markers).
BM25 and Cosine Relevance Scoring
Once content is chunked, you often don't want all chunks — you want the most relevant ones for your query. CRW v0.0.2 adds two scoring modes:
{
"url": "https://docs.example.com/authentication",
"formats": ["markdown"],
"chunkStrategy": "topic",
"filterMode": "bm25",
"query": "how to authenticate with API keys",
"topK": 5
}
BM25 is a term-frequency scoring model that's fast and works well for keyword-style queries. Cosine uses TF-IDF vectors for semantic similarity. Both return chunks sorted by relevance score, so you can take the top-K and pass them directly to your embedding pipeline or LLM context.
This means CRW can replace the "scrape → chunk → score → select" pipeline with a single API call. For RAG workflows that process hundreds of pages, eliminating the client-side chunking step is a meaningful simplification.
Stealth Mode
Some sites detect automated scrapers by inspecting HTTP headers. Default reqwest headers are a dead giveaway — missing Accept-Language, wrong Accept-Encoding order, no Sec-Fetch-* headers.
CRW v0.0.2 adds "stealth": true to rotate User-Agent strings from a pool of real Chrome, Firefox, and Safari signatures and inject 12 browser-like headers that match what a real browser sends:
{
"url": "https://protected-site.com/data",
"formats": ["markdown"],
"stealth": true
}
Stealth mode also adds random jitter between requests during crawls — uniform request timing is another common detection signal.
This isn't a full anti-bot solution — it won't beat Cloudflare's JavaScript challenges or DataDome's fingerprinting. But it handles the 80% case: sites that do basic header inspection and block requests that look like scripts.
Per-Request Proxy
You can now override the global proxy on a per-request basis:
{
"url": "https://geo-restricted.com/content",
"formats": ["markdown"],
"proxy": "http://user:pass@us-residential.proxy.com:8080"
}
This is useful when you need different exit IPs for different target sites — residential proxies for protected sites, datacenter proxies for public documentation, direct connection for internal sites.
Markdown Quality Improvements
v0.0.2 switches the markdown converter from a custom implementation to htmd, a Rust port of Turndown.js. The result:
- Tables render correctly with proper pipe alignment
- Code blocks include language hints when the source HTML specifies them
- Nested lists (3+ levels) maintain correct indentation
- Headings inside code blocks are no longer misinterpreted as markdown headings
For LLM consumption, the most impactful fix is code block language hints. A fenced block with ```python gives the LLM explicit context about the code's language, improving code explanation and generation tasks.
One-Command JS Rendering Setup
crw-server setup
This downloads LightPanda (the lightweight headless browser) and creates config.local.toml with the right paths. Before v0.0.2, setting up JavaScript rendering required manually downloading LightPanda and configuring the CDP endpoint. Now it's one command.
Upgrade
# Docker
docker pull ghcr.io/user/crw:0.0.2
# Cargo
cargo install crw-server
# Binary
curl -L https://github.com/user/crw/releases/download/v0.0.2/crw-linux-x86_64 -o crw
v0.0.2 is backward-compatible with v0.0.1. All new fields are optional — existing API calls work without changes. See the documentation for full API reference.
What's Next
v0.0.3+ focuses on rendering reliability and coverage: better success/failure semantics for 4xx pages, improved CDP lifecycle handling, and crawl parameter normalization. Follow the GitHub repository for updates.