AI-Powered Structured Extraction from the Web
Pull typed JSON out of any web page with fastCRW — define a JSON Schema, call /v1/extract on managed cloud (or /v1/scrape + jsonSchema self-hosted), and skip the brittle selector layer entirely.
Who this is for
Engineers tired of writing brittle CSS selectors for every product page, job listing, or directory entry they need to ingest. The site redesigns twice a year, your selector breaks every time, and the alert that fires at 3am is always the same one.
fastCRW's structured extraction replaces the selector layer with a JSON Schema. You describe the fields, the LLM does the locating, and the response is already shaped for the database.
Why fastCRW for extraction
Three things matter for production extraction: the schema is the contract, the call is cheap to retry, and the LLM bill stays predictable.
On the managed cloud, POST /v1/extract
(docs.fastcrw.com/api-reference/extract/)
is a 5-credit convenience wrapper over /v1/scrape with formats: ["json"].
Self-hosters call POST /v1/scrape with jsonSchema in the body and get
the same typed object back — there is no feature gap, just a different
endpoint name (per marketing/CANONICAL-FACTS.md §4).
The BYOK model means you pass llmApiKey, llmProvider, and llmModel
yourself. fastCRW never holds the key, never marks up the model bill, and
never bundles inference into its own per-credit pricing. That keeps the
extraction cost auditable: it is your OpenAI invoice plus fastCRW's
infrastructure charge, line-itemed.
For sites that need a real browser (most JS-heavy product pages do), the
renderer field picks between http, lightpanda, and chrome with an
automatic fallback chain, so the extraction call works without you having
to know which engine the page needs.
The 5-step recipe
- Describe the fields you want as a JSON Schema. Write a JSON Schema that mirrors the record you need — strings, numbers, enums, arrays. Be specific in field descriptions; the LLM uses them as inline prompts.
- Pick managed /v1/extract or self-hosted /v1/scrape + jsonSchema. On the managed cloud, POST /v1/extract is a 5-credit convenience wrapper. Self-hosters call POST /v1/scrape with formats ["json"] and jsonSchema in the body — same result, no convenience tax.
- Supply your own LLM provider credentials. fastCRW is BYOK — pass llmApiKey, llmProvider (openai, anthropic, deepseek, azure), and llmModel in the request. You pay your model bill directly; fastCRW never proxies the key.
- Validate the returned JSON against the schema. The response data.extract field carries the typed object. Re-validate client-side with ajv or pydantic so downstream code can fail loudly on the rare extraction miss.
- Iterate over many URLs without a batch endpoint. There is no /v1/batch/extract — iterate /v1/scrape concurrently from your worker, or run a /v1/crawl and apply the schema inside the result loop.
# extract_product.py — run with: python3 extract_product.py
import os
import requests
from pydantic import BaseModel, HttpUrl, Field, ValidationError
CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
class Product(BaseModel):
name: str = Field(description="Product display name")
price_usd: float = Field(description="Current price in US dollars")
in_stock: bool = Field(description="True if the buy button is enabled")
image_url: HttpUrl | None = None
JSON_SCHEMA = {
"type": "object",
"required": ["name", "price_usd", "in_stock"],
"properties": {
"name": {"type": "string", "description": "Product display name"},
"price_usd": {"type": "number", "description": "Current price in US dollars"},
"in_stock": {"type": "boolean", "description": "True if the buy button is enabled"},
"image_url": {"type": "string", "format": "uri"},
},
}
def extract(url: str) -> Product:
r = requests.post(
f"{CRW}/scrape",
json={
"url": url,
"formats": ["json"],
"jsonSchema": JSON_SCHEMA,
"llmProvider": "openai",
"llmModel": "gpt-4o-mini",
"llmApiKey": os.environ["OPENAI_API_KEY"],
},
headers=HEADERS,
timeout=90,
)
r.raise_for_status()
raw = r.json()["data"]["extract"]
try:
return Product(**raw)
except ValidationError as e:
raise RuntimeError(f"Schema drift on {url}: {e}") from e
if __name__ == "__main__":
print(extract("https://example.com/products/widget-42").model_dump_json(indent=2))
Next steps
Full schema-mode docs and the BYOK provider matrix live at
docs.fastcrw.com/api-reference/scrape/;
managed-cloud credit pricing for /v1/extract is on
fastcrw.com/pricing. Self-host the binary
and bring your own model key to keep the entire extraction path inside your
own cost stack.
Continue exploring
More from Use Cases
Bulk Vector Database Ingestion with fastCRW
Crawl a whole domain into clean markdown, embed in batches, and bulk-insert into Pinecone, pgvector, or Qdrant — fastCRW's /v1/crawl makes the front of the vector pipeline a single async job.
Web Dataset Curation for ML Training
Assemble training-ready JSONL datasets from the open web with fastCRW — /v1/map to enumerate URLs, /v1/scrape to fetch them as clean markdown, then deduplicate and serialise for HuggingFace, OpenAI fine-tuning, or your own loader.
Web Scraping for Market Research
Monitor competitors, track pricing changes, and analyze market trends from public web with fastCRW — structured, timestamped data for repeatable analysis.
Related hubs