Docs/Docs / Scrape

Scrape Endpoint Guide

How to use the fastCRW scrape flow to turn a single URL into markdown, HTML, or structured data.

Published
March 11, 2026
Updated
March 11, 2026
Category
docs
Firecrawl-compatible scrape shapeMarkdown-first workflowStructured extraction support

Overview

Use scrape when you want one page turned into usable content without starting a wider crawl job. It is the right default for:

  • first-pass evaluation,
  • RAG ingestion from known URLs,
  • extraction pipelines,
  • and agent workflows that already know which page to fetch.
curl -X POST https://fastcrw.com/api/v1/scrape \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"url":"https://example.com","formats":["markdown"]}'

A Good Default Request

If you are not sure where to start, use this shape first:

{
  "url": "https://example.com",
  "formats": ["markdown"],
  "onlyMainContent": true,
  "renderJs": null
}

That gives you a clean markdown output, keeps extraction focused on the main body, and leaves JavaScript rendering to the engine's default behavior.

Parameters

FieldTypeDefaultDescription
urlstringrequiredThe target page URL
formatsstring[]["markdown"]Output formats: markdown, html, rawHtml, plainText, links, json, extract
onlyMainContentbooleantrueExtract primary content area only (removes nav, footer, sidebar)
renderJsboolean | nullnulltrue = force JS rendering, false = skip, null = auto-detect
waitFornumberMilliseconds to wait after JS rendering
cssSelectorstringCSS selector to narrow content
xpathstringXPath expression to narrow content
includeTagsstring[][]Only include these HTML tags
excludeTagsstring[][]Remove these HTML tags
jsonSchemaobjectJSON Schema for structured extraction (requires formats to include json)
headersobject{}Custom HTTP headers to send with the request
stealthbooleanOverride stealth mode for this request. When true, rotates user-agent from a realistic browser pool and injects standard browser headers
proxystringPer-request HTTP proxy URL
chunkStrategyobjectChunking config: { "type": "sentence" | "regex" | "topic", "maxChars": 1000 }
querystringQuery for BM25/cosine chunk filtering
filterModestring"bm25" (keyword density with saturation) or "cosine" (TF-IDF vector similarity). BM25 recommended for most use cases
topKnumber5Number of top chunks to return when filtering
llmApiKeystringPer-request LLM API key for structured extraction (BYOK). Overrides server config
llmProviderstring"anthropic"LLM provider: "anthropic" or "openai"
llmModelstring"claude-sonnet-4-20250514"Model to use for structured extraction

Choosing the Right Formats

Most integrations only need one of these patterns:

  • ["markdown"] for retrieval, search, summarization, and LLM inputs.
  • ["markdown", "links"] when you want the content plus outbound link discovery.
  • ["html"] when you need cleaned markup instead of markdown.
  • ["rawHtml"] when downstream logic expects the original HTML source.
  • ["json"] when you are doing schema-driven extraction.

Requesting more formats is convenient for debugging, but in production it is better to ask only for what you will actually store or process.

Targeting the Right Part of a Page

The default extraction path works well for many pages, but it is not magic. If you know the site structure, tighten the request:

  • use cssSelector when there is a stable content container,
  • use xpath when selectors are easier to express that way,
  • use includeTags and excludeTags to keep or remove specific markup families,
  • and leave onlyMainContent on unless you explicitly want navigation, footer, or sidebar content.

The common mistake is combining too many narrowing options at once. Start broad, inspect the result, then add one targeting primitive at a time.

JS Rendering Guidance

Use renderJs: true only when the page clearly needs a browser. Browser rendering increases latency and operational cost, so treat it as a deliberate choice rather than the universal default.

When you do need it:

  • set renderJs: true,
  • start with waitFor: 1000 or 2000,
  • and raise waitFor only when the page still hydrates too slowly.

If the response metadata shows an HTTP-only fallback or the output is suspiciously empty, read the JS rendering guide.

Chunking & filtering behavior

  • chunkStrategy alone splits the markdown and returns all chunks.
  • chunkStrategy + query + filterMode scores and ranks chunks, returning the top topK.
  • topK without query/filterMode still truncates the chunk array to topK items (no scoring).
  • query or filterMode without chunkStrategy is silently ignored — chunking must be enabled first.

In practice:

  • use sentence when you want stable natural-language chunks,
  • use regex when you already know the structural separator,
  • and treat topic chunking as an advanced option that should be tested on real data before wide rollout.

Structured Extraction from scrape

You do not need a separate endpoint for extraction. scrape can also return schema-shaped JSON when formats includes json and jsonSchema is present.

That means a single API surface can support:

  • markdown for retrieval,
  • links for discovery,
  • and JSON for downstream application logic.

If your schema is the primary output, read the dedicated Structured extraction guide.

Response Semantics

The main response pattern is:

  • success for overall request outcome,
  • data for returned content,
  • warning for degraded but non-fatal situations,
  • and metadata for context such as title, status code, final URL, and elapsed time.

Do not ignore warnings. A page blocked by anti-bot protection can still produce content that looks valid at first glance.

Why This Endpoint Matters

The scrape flow is the foundation for:

  • RAG ingestion,
  • product page extraction,
  • AI-agent browsing loops,
  • and first-pass evaluation in the playground.

Use the playground if you want to validate output before wiring the endpoint into production, then move to curl, scripts, or your application code once the payload shape looks right.