What is the difference between Diffbot and fastCRW?

Diffbot uses pre-trained computer-vision and ML models to auto-classify a page (article, product, discussion) and extract a fixed schema for that type, feeding a Knowledge Graph. fastCRW lets you define your own JSON schema and fills it with a managed LLM (paid plans) from a scraped page, on a Firecrawl-compatible engine you can self-host under AGPL-3.0. Diffbot is turnkey and cloud-only; fastCRW is schema-controlled and self-hostable.

Does fastCRW use computer vision like Diffbot for extraction?

No. fastCRW does not run computer-vision page classifiers. It extracts structured data by passing the scraped page content and your JSON schema to an LLM, which fills the fields you defined. There are no pre-trained per-page-type extractors and no Knowledge Graph — that turnkey CV approach is Diffbot's distinct strength.

Can I define a custom JSON schema with fastCRW?

Yes. Send a request to /v1/scrape with formats: ["json"] and a jsonSchema describing exactly the fields you want, and the LLM populates that schema from the page. You are not constrained to a vendor's fixed shape, so you can extract bespoke fields a generic extractor never models. The extract path also accepts multiple URLs in one request, up to 50 at a time.

Which LLM does fastCRW extraction use?

On the managed cloud, fastCRW's JSON-schema extraction runs on fastCRW's managed LLM and requires a paid plan — you do not pick or manage the model. If you need to choose a specific extraction model, self-host the AGPL-3.0 engine against your own model endpoint. (Note the /v1/search answer mode uses the same managed LLM, but that is search-answer synthesis, not page extraction.)

Diffbot vs fastCRW: CV Extraction or LLM JSON

By the fastCRW team · Comparison/pricing verified 2026-05-18 · Verify independently before buying.

Disclosure: we build fastCRW. This is a vendor-authored comparison, so weigh it accordingly — but we have kept the section on where Diffbot genuinely wins explicit, because a comparison that pretends the other tool has no advantages is useless to you.

Diffbot vs fastCRW at a glance

The core of Diffbot vs fastCRW is two different philosophies of turning a web page into structured data. Diffbot runs pre-trained computer-vision and ML models that look at a rendered page, classify it as an article, product, discussion, or other known type, and emit a fixed schema for that type — then feeds the result into its Knowledge Graph. fastCRW takes the opposite route: you hand it the exact JSON schema you want, and an LLM fills that schema from the scraped page. One gives you turnkey extractors and a graph; the other gives you a blank schema you define yourself.

Dimension	Diffbot	fastCRW
Extraction model	Pre-trained CV/ML page classifiers	LLM fills your JSON schema
Schema control	Fixed per page type (article, product, etc.)	You define the exact JSON shape
Knowledge Graph	Yes — a core product	No graph product
Scrape + crawl + map	Crawlbot + extraction APIs	Firecrawl-compatible `/v1/scrape`, `/v1/crawl`, `/v1/map`, `/v1/search`
Self-host	Cloud-only	AGPL-3.0 single Rust binary, self-hostable
LLM for extraction	n/a (own models)	Managed LLM (paid plans)
Batch extraction	Bulk/crawl support	Multi-URL extract, up to 50 URLs per request

Two extraction philosophies

Diffbot's bet is that most of the web falls into a small number of recognizable page types, so a model trained on millions of those pages can extract them without you writing any rules or schema. You call the Article API or the Product API, point it at a URL, and get back a consistent object — title, author, date, price, images — that Diffbot's vision models inferred from the rendered layout. The output shape is decided by Diffbot, not by you.

fastCRW inverts that. There is no library of pre-trained extractors; instead you send a request to /v1/scrape with formats: ["json"] and a jsonSchema describing the fields you want, and an LLM populates them from the scraped content. If you want a product's name, price, SKU, and three custom attributes that no generic extractor knows about, you simply put them in the schema. The trade is real: you do the schema design, but you are never constrained to a vendor's idea of what an "article" or "product" contains.

This maps cleanly onto when each tool is the right call. When your targets are mainstream page types and you want zero schema work, Diffbot's classifiers do a lot for free. When your fields are bespoke, span page types Diffbot does not model, or change between sources, a schema you control is the more flexible primitive.

Where Diffbot genuinely wins

Two advantages are real and we will not hand-wave them:

Pre-built extractors at scale. Diffbot's page-type models are the product of years of training data. For high-volume extraction of standard article and product pages, you get structured output with no schema authoring and no per-page LLM cost — that is a genuine operational and cost advantage at scale.
The Knowledge Graph. Diffbot does not just extract pages; it links entities (companies, people, products) into a queryable graph spanning much of the public web. If your use case is specifically "give me the graph of companies and their relationships," that is Diffbot's product.

Where fastCRW wins

You define exactly the JSON shape you want. A jsonSchema on /v1/scrape extracts precisely the fields you ask for — no remapping from a fixed vendor schema, no fields you do not need, no missing fields a generic extractor never modeled.
Highest truth-recall in a 3-way scrape benchmark. On Firecrawl's own public dataset, fastCRW recovered correct content from 63.74% of 819 labeled URLs — the highest of the three engines tested (diagnose_3way.py, single run, 2026-05-08), ahead of Crawl4AI (59.95%) and Firecrawl (56.04%). It posted a median latency of 1914 ms (fastest) and 91.8% scrape success of reachable URLs with 0 errors. The 34 URLs only fastCRW recovers represent 70% more unique coverage than the other two combined. In fast mode, fastCRW's p90 is 4348 ms — the lowest of the three. See the full p50/p90/p99 split on /benchmarks.
Self-host the engine free under AGPL-3.0. fastCRW is a single static Rust binary you can run yourself; self-hosting costs $0 per 1,000 scrapes (you pay only your own server). Diffbot is cloud-only, so your URLs and extracted content always leave your network.
One Firecrawl-compatible engine. The same engine does scrape, crawl, map, and search behind a Firecrawl-compatible REST surface — drop-in after a base-URL swap — so extraction is not a separate product bolted on.

Cost and credits

fastCRW prices a plain scrape at 1 credit, but any request that returns structured JSON — formats: ["json"], the extraction path — is that 1-credit scrape plus the LLM token cost, metered as usage-based LLM credits. That is what to budget against for Diffbot-style extraction: the scrape credit plus token-based LLM credits per extracted page — how much more than a plain scrape scales with page size and token usage — run on fastCRW's managed LLM on paid plans, drawn from the same credit balance with no separate model subscription to add on top. The free tier grants 500 one-time lifetime credits to prototype the core scrape/crawl/search surface, and the LLM extraction path runs on the paid plans; rather than reprint a table that can drift, see live numbers on /pricing.

The economics that genuinely differ from a cloud-only tool show up at the floor, not the per-call price. Because fastCRW is AGPL-3.0 and self-hostable, the worst-case cost of high-volume extraction has a hard ceiling — your server bill — at $0 per 1,000 scrapes for the engine itself. Diffbot's cloud-only model has no such floor; you pay its metered rate for every page, forever.

Extraction details

One fastCRW detail worth knowing up front: LLM extraction is managed and requires a paid plan. fastCRW's JSON extraction runs on the managed LLM — you do not pick the model, and the feature is not on the free tier. If you need to choose a specific extraction model, self-hosting the AGPL-3.0 engine against your own model endpoint is the path. (The /v1/search answer mode uses the same managed LLM, but that is search synthesis, not page extraction.) fastCRW's extract path also accepts multiple URLs — up to 50 per request — so bulk extraction is a single call, not a manual loop.

Which to choose

Pick Diffbot when you want turnkey computer-vision extraction of standard page types with no schema work, or when you specifically need the Knowledge Graph and its entity linking.

Pick fastCRW when you need to define a custom JSON schema for fields a generic extractor will not model, when you want the highest-recall scrape engine in our 3-way benchmark feeding that extraction, when a single transparent credit meter — a flat 1-credit scrape plus usage-based LLM credits for extraction — beats opaque metering, or when self-hosting under AGPL-3.0 — so URLs and content never leave your infrastructure — is a requirement rather than a nice-to-have. If you are already on Diffbot for extraction but want to bring crawling and scraping in-house, fastCRW's Firecrawl-compatible surface is the easiest landing point. For a deeper look at defining schemas, see our guide to structured JSON-schema extraction with fastCRW and the Firecrawl extract endpoint deep dive.

Sources

fastCRW canonical facts: scrape benchmark (diagnose_3way.py, 819 labeled URLs, 2026-05-08), credit costs — github.com/us/crw
fastCRW pricing and benchmarks: /pricing · /benchmarks
Diffbot product and Knowledge Graph: diffbot.com (verified 2026-05-18)

Diffbot vs fastCRW: CV Extraction or LLM JSON

Diffbot vs fastCRW at a glance

Two extraction philosophies

Where Diffbot genuinely wins

Where fastCRW wins

Cost and credits

Extraction details

Which to choose

Sources

Frequently asked questions

Try fastCRW free

More comparison posts

Exa AI Pricing Explained (2026): Tiers, Content Types, and Surcharges

Web Search API Pricing Compared (2026): Exa, Parallel, Brave, and More

What Is Parallel.ai? Search API, Task API, Pricing, and Where It Fits (2026)