Use Cases/Use Case / AI Extraction

AI-Powered Structured Extraction from the Web

Pull typed JSON out of any web page with fastCRW — define a JSON Schema, call /v1/extract on managed cloud (or /v1/scrape + jsonSchema self-hosted), and skip the brittle selector layer entirely.

Published

May 27, 2026

Updated

May 27, 2026

Who this is for

Engineers tired of writing brittle CSS selectors for every product page, job listing, or directory entry they need to ingest. The site redesigns twice a year, your selector breaks every time, and the alert that fires at 3am is always the same one.

fastCRW's structured extraction replaces the selector layer with a JSON Schema. You describe the fields, the LLM does the locating, and the response is already shaped for the database.

Why fastCRW for extraction

Three things matter for production extraction: the schema is the contract, the call is cheap to retry, and the inference is handled for you.

On the managed cloud, POST /v1/extract (docs.fastcrw.com/api-reference/extract/) is a convenience wrapper over /v1/scrape with formats: ["json"]. Self-hosters call POST /v1/scrape with jsonSchema in the body and get the same typed object back — there is no feature gap, just a different endpoint name.

Extraction runs on fastCRW's managed LLM, so there is nothing to wire up: no provider credentials, no model selection, no separate inference account. The managed LLM is available on paid plans (the Free plan has no LLM features), and the cost is folded into fastCRW's per-credit pricing — one predictable line item instead of a separate model invoice to reconcile.

For sites that need a real browser (most JS-heavy product pages do), the renderer field picks between http, lightpanda, and chrome with an automatic fallback chain, so the extraction call works without you having to know which engine the page needs.

The 5-step recipe

Describe the fields you want as a JSON Schema. Write a JSON Schema that mirrors the record you need — strings, numbers, enums, arrays. Be specific in field descriptions; the LLM uses them as inline prompts.
Pick managed /v1/extract or self-hosted /v1/scrape + jsonSchema. On the managed cloud, POST /v1/extract is a convenience wrapper. Self-hosters call POST /v1/scrape with formats ["json"] and jsonSchema in the body — same result, no convenience tax.
No keys to manage — extraction runs on the managed LLM. Structured extraction uses fastCRW's managed LLM, available on paid plans (the Free plan has no LLM features). You do not pass any model provider credentials; the inference is handled for you and billed in credits.
Validate the returned JSON against the schema. The response data.extract field carries the typed object. Re-validate client-side with ajv or pydantic so downstream code can fail loudly on the rare extraction miss.
Extract from many URLs in one call. /v1/extract accepts a urls array (up to 50 URLs per request) and returns a results array; for larger jobs, iterate /v1/scrape concurrently from your worker, or run a /v1/crawl and apply the schema inside the result loop.

# extract_product.py — run with: python3 extract_product.py
import os
import requests
from pydantic import BaseModel, HttpUrl, Field, ValidationError

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}

class Product(BaseModel):
    name: str = Field(description="Product display name")
    price_usd: float = Field(description="Current price in US dollars")
    in_stock: bool = Field(description="True if the buy button is enabled")
    image_url: HttpUrl | None = None

JSON_SCHEMA = {
    "type": "object",
    "required": ["name", "price_usd", "in_stock"],
    "properties": {
        "name": {"type": "string", "description": "Product display name"},
        "price_usd": {"type": "number", "description": "Current price in US dollars"},
        "in_stock": {"type": "boolean", "description": "True if the buy button is enabled"},
        "image_url": {"type": "string", "format": "uri"},
    },
}

def extract(url: str) -> Product:
    r = requests.post(
        f"{CRW}/scrape",
        json={
            "url": url,
            "formats": ["json"],
            "jsonSchema": JSON_SCHEMA,
        },
        headers=HEADERS,
        timeout=90,
    )
    r.raise_for_status()
    raw = r.json()["data"]["extract"]
    try:
        return Product(**raw)
    except ValidationError as e:
        raise RuntimeError(f"Schema drift on {url}: {e}") from e

if __name__ == "__main__":
    print(extract("https://example.com/products/widget-42").model_dump_json(indent=2))

Next steps

Full schema-mode docs and the managed extraction reference live at docs.fastcrw.com/api-reference/scrape/; managed-cloud credit pricing for /v1/extract is on fastcrw.com/pricing. Managed LLM extraction is available on paid plans; the Free plan has no LLM features.

fastCRWlive

Extract structured JSON

Get 500 free credits →

Sources

fastCRW /v1/extract reference (managed cloud)

https://docs.fastcrw.com/api-reference/extract/

fastCRW /v1/scrape reference (self-host path)

https://docs.fastcrw.com/api-reference/scrape/

JSON Schema specification

https://json-schema.org/

FAQ

Is /v1/extract available when I self-host?

No. Per the canonical fact sheet, /v1/extract is a managed-cloud-only convenience wrapper (1 scrape credit + metered managed-LLM cost) over /v1/scrape with formats ["json"]. Self-hosters get the same capability by calling /v1/scrape directly with jsonSchema — there is no feature gap, only a different endpoint name.

Do I need to bring my own LLM key for extraction?

No. Structured extraction runs on fastCRW's managed LLM — there is no key to pass and nothing to configure. The managed LLM is available on paid plans; the Free plan has no LLM features. Inference is handled for you and billed in credits.

Can I extract from many URLs at once?

Yes — /v1/extract accepts a urls array (up to 50 URLs per request) and returns a results array. For larger jobs, iterate /v1/scrape concurrently from your worker pool, or run /v1/crawl and apply your schema to each result. The async crawl job model is the recommended path above a few hundred URLs.

Recommended next step

Claim an API key and start shipping.

Move from evaluation to implementation with credits, docs, and a compatibility-first API.

Create Account

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Dataset Curation for ML Training

Next in Use Cases

Web Scraping for AI Chat & RAG Pipelines

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Content Aggregation

Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.

web scraping for content aggregationDiscover all content URLs on any domain with a single `/v1/map` call

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.