Overview

Use scrape when you want one page turned into usable content without starting a wider crawl job. It is the right default for:

first-pass evaluation,
RAG ingestion from known URLs,
extraction pipelines,
and agent workflows that already know which page to fetch.

curl -X POST https://fastcrw.com/api/v1/scrape \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"url":"https://example.com","formats":["markdown"]}'

A Good Default Request

If you are not sure where to start, use this shape first:

{
  "url": "https://example.com",
  "formats": ["markdown"],
  "onlyMainContent": true,
  "renderJs": null
}

That gives you a clean markdown output, keeps extraction focused on the main body, and leaves JavaScript rendering to the engine's default behavior.

Parameters

Field	Type	Default	Description
`url`	`string`	required	The target page URL
`formats`	`string[]`	`["markdown"]`	Output formats: `markdown`, `html`, `rawHtml`, `plainText`, `links`, `json`, `extract`
`onlyMainContent`	`boolean`	`true`	Extract primary content area only (removes nav, footer, sidebar)
`renderJs`	`boolean \| null`	`null`	`true` = force JS rendering, `false` = skip, `null` = auto-detect
`waitFor`	`number`	—	Milliseconds to wait after JS rendering
`cssSelector`	`string`	—	CSS selector to narrow content
`xpath`	`string`	—	XPath expression to narrow content
`includeTags`	`string[]`	`[]`	Only include these HTML tags
`excludeTags`	`string[]`	`[]`	Remove these HTML tags
`jsonSchema`	`object`	—	JSON Schema for structured extraction (requires `formats` to include `json`)
`headers`	`object`	`{}`	Custom HTTP headers to send with the request
`stealth`	`boolean`	—	Override stealth mode for this request. When `true`, rotates user-agent from a realistic browser pool and injects standard browser headers
`proxy`	`string`	—	Per-request HTTP proxy URL
`chunkStrategy`	`object`	—	Chunking config: `{ "type": "sentence" \| "regex" \| "topic", "maxChars": 1000 }`
`query`	`string`	—	Query for BM25/cosine chunk filtering
`filterMode`	`string`	—	`"bm25"` (keyword density with saturation) or `"cosine"` (TF-IDF vector similarity). BM25 recommended for most use cases
`topK`	`number`	`5`	Number of top chunks to return when filtering
`llmApiKey`	`string`	—	Per-request LLM API key for structured extraction (BYOK). Overrides server config
`llmProvider`	`string`	`"anthropic"`	LLM provider: `"anthropic"` or `"openai"`
`llmModel`	`string`	`"claude-sonnet-4-20250514"`	Model to use for structured extraction

Choosing the Right Formats

Most integrations only need one of these patterns:

["markdown"] for retrieval, search, summarization, and LLM inputs.
["markdown", "links"] when you want the content plus outbound link discovery.
["html"] when you need cleaned markup instead of markdown.
["rawHtml"] when downstream logic expects the original HTML source.
["json"] when you are doing schema-driven extraction.

Requesting more formats is convenient for debugging, but in production it is better to ask only for what you will actually store or process.

Targeting the Right Part of a Page

The default extraction path works well for many pages, but it is not magic. If you know the site structure, tighten the request:

use cssSelector when there is a stable content container,
use xpath when selectors are easier to express that way,
use includeTags and excludeTags to keep or remove specific markup families,
and leave onlyMainContent on unless you explicitly want navigation, footer, or sidebar content.

The common mistake is combining too many narrowing options at once. Start broad, inspect the result, then add one targeting primitive at a time.

JS Rendering Guidance

Use renderJs: true only when the page clearly needs a browser. Browser rendering increases latency and operational cost, so treat it as a deliberate choice rather than the universal default.

When you do need it:

set renderJs: true,
start with waitFor: 1000 or 2000,
and raise waitFor only when the page still hydrates too slowly.

If the response metadata shows an HTTP-only fallback or the output is suspiciously empty, read the JS rendering guide.

Chunking & filtering behavior

chunkStrategy alone splits the markdown and returns all chunks.
chunkStrategy + query + filterMode scores and ranks chunks, returning the top topK.
topK without query/filterMode still truncates the chunk array to topK items (no scoring).
query or filterMode without chunkStrategy is silently ignored — chunking must be enabled first.

In practice:

use sentence when you want stable natural-language chunks,
use regex when you already know the structural separator,
and treat topic chunking as an advanced option that should be tested on real data before wide rollout.

Structured Extraction from `scrape`

You do not need a separate endpoint for extraction. scrape can also return schema-shaped JSON when formats includes json and jsonSchema is present.

That means a single API surface can support:

markdown for retrieval,
links for discovery,
and JSON for downstream application logic.

If your schema is the primary output, read the dedicated Structured extraction guide.

Response Semantics

The main response pattern is:

success for overall request outcome,
data for returned content,
warning for degraded but non-fatal situations,
and metadata for context such as title, status code, final URL, and elapsed time.

Do not ignore warnings. A page blocked by anti-bot protection can still produce content that looks valid at first glance.

Why This Endpoint Matters

The scrape flow is the foundation for:

RAG ingestion,
product page extraction,
AI-agent browsing loops,
and first-pass evaluation in the playground.

Use the playground if you want to validate output before wiring the endpoint into production, then move to curl, scripts, or your application code once the payload shape looks right.

Scrape Endpoint Guide