Skip to main content
Use Cases/Use Case / AI Extraction

AI-Powered Structured Extraction from the Web

Pull typed JSON out of any web page with fastCRW — define a JSON Schema, call /v1/extract on managed cloud (or /v1/scrape + jsonSchema self-hosted), and skip the brittle selector layer entirely.

Published
May 27, 2026
Updated
May 27, 2026
Category
use cases
Schema-driven extraction — no CSS selectors, no XPathBYOK on managed cloud (OpenAI, Anthropic, DeepSeek, Azure)Single-URL extract today; iterate /v1/scrape for batched workloads

Who this is for

Engineers tired of writing brittle CSS selectors for every product page, job listing, or directory entry they need to ingest. The site redesigns twice a year, your selector breaks every time, and the alert that fires at 3am is always the same one.

fastCRW's structured extraction replaces the selector layer with a JSON Schema. You describe the fields, the LLM does the locating, and the response is already shaped for the database.

Why fastCRW for extraction

Three things matter for production extraction: the schema is the contract, the call is cheap to retry, and the LLM bill stays predictable.

On the managed cloud, POST /v1/extract (docs.fastcrw.com/api-reference/extract/) is a 5-credit convenience wrapper over /v1/scrape with formats: ["json"]. Self-hosters call POST /v1/scrape with jsonSchema in the body and get the same typed object back — there is no feature gap, just a different endpoint name (per marketing/CANONICAL-FACTS.md §4).

The BYOK model means you pass llmApiKey, llmProvider, and llmModel yourself. fastCRW never holds the key, never marks up the model bill, and never bundles inference into its own per-credit pricing. That keeps the extraction cost auditable: it is your OpenAI invoice plus fastCRW's infrastructure charge, line-itemed.

For sites that need a real browser (most JS-heavy product pages do), the renderer field picks between http, lightpanda, and chrome with an automatic fallback chain, so the extraction call works without you having to know which engine the page needs.

The 5-step recipe

  1. Describe the fields you want as a JSON Schema. Write a JSON Schema that mirrors the record you need — strings, numbers, enums, arrays. Be specific in field descriptions; the LLM uses them as inline prompts.
  2. Pick managed /v1/extract or self-hosted /v1/scrape + jsonSchema. On the managed cloud, POST /v1/extract is a 5-credit convenience wrapper. Self-hosters call POST /v1/scrape with formats ["json"] and jsonSchema in the body — same result, no convenience tax.
  3. Supply your own LLM provider credentials. fastCRW is BYOK — pass llmApiKey, llmProvider (openai, anthropic, deepseek, azure), and llmModel in the request. You pay your model bill directly; fastCRW never proxies the key.
  4. Validate the returned JSON against the schema. The response data.extract field carries the typed object. Re-validate client-side with ajv or pydantic so downstream code can fail loudly on the rare extraction miss.
  5. Iterate over many URLs without a batch endpoint. There is no /v1/batch/extract — iterate /v1/scrape concurrently from your worker, or run a /v1/crawl and apply the schema inside the result loop.
# extract_product.py — run with: python3 extract_product.py
import os
import requests
from pydantic import BaseModel, HttpUrl, Field, ValidationError

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}

class Product(BaseModel):
    name: str = Field(description="Product display name")
    price_usd: float = Field(description="Current price in US dollars")
    in_stock: bool = Field(description="True if the buy button is enabled")
    image_url: HttpUrl | None = None

JSON_SCHEMA = {
    "type": "object",
    "required": ["name", "price_usd", "in_stock"],
    "properties": {
        "name": {"type": "string", "description": "Product display name"},
        "price_usd": {"type": "number", "description": "Current price in US dollars"},
        "in_stock": {"type": "boolean", "description": "True if the buy button is enabled"},
        "image_url": {"type": "string", "format": "uri"},
    },
}

def extract(url: str) -> Product:
    r = requests.post(
        f"{CRW}/scrape",
        json={
            "url": url,
            "formats": ["json"],
            "jsonSchema": JSON_SCHEMA,
            "llmProvider": "openai",
            "llmModel": "gpt-4o-mini",
            "llmApiKey": os.environ["OPENAI_API_KEY"],
        },
        headers=HEADERS,
        timeout=90,
    )
    r.raise_for_status()
    raw = r.json()["data"]["extract"]
    try:
        return Product(**raw)
    except ValidationError as e:
        raise RuntimeError(f"Schema drift on {url}: {e}") from e

if __name__ == "__main__":
    print(extract("https://example.com/products/widget-42").model_dump_json(indent=2))

Next steps

Full schema-mode docs and the BYOK provider matrix live at docs.fastcrw.com/api-reference/scrape/; managed-cloud credit pricing for /v1/extract is on fastcrw.com/pricing. Self-host the binary and bring your own model key to keep the entire extraction path inside your own cost stack.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving