Skip to main content
Tutorial

How to Build a Knowledge Graph From Web Data

Build a knowledge graph from web pages: crawl sources, extract entities and relations with LLMs, and load them into a graph. Full pipeline walkthrough.

fastcrw
By RecepJuly 2, 202612 min readLast updated: June 1, 2026

By the fastCRW team · Facts/pricing verified 2026-05-29 · fastCRW launch pricing reverts 2026-06-01 · Verify independently.

How to build a knowledge graph from web data

To build a knowledge graph from web data you need three things wired into one pipeline: a way to crawl and collect the source pages, a way to extract entities and relations as structured data instead of prose, and a graph store to hold the nodes and edges. This tutorial walks the full path end to end using fastCRW — a Firecrawl-compatible, open-core (AGPL-3.0) crawl/scrape/map/search engine — for the ingestion and extraction stages, then loads the result into a graph database. fastCRW handles the messy web-facing half (crawl, clean, JSON-schema extraction); you keep full control of the LLM and the graph.

Because fastCRW speaks the Firecrawl API shape, every code sample below also runs against Firecrawl by swapping one base URL. We disclose the honest limits as we go — fastCRW's LLM-based JSON extraction is a managed feature available on paid plans (the FREE tier has no LLM features), and the engine is stateless per request, both of which shape how you design the pipeline.

What a knowledge graph is and when to build one

A knowledge graph is a network of entities (people, companies, products, places — the nodes) connected by typed relations (the edges), e.g. (Acme Corp) —[acquired]→ (Widget Inc) or (Jane Doe) —[is CTO of]→ (Acme Corp). Unlike a pile of scraped Markdown or a vector index, a graph makes the connections first-class, so you can answer multi-hop questions ("which companies did the people who left Acme go on to found?") that flat retrieval struggles with.

Build one when your questions are about relationships rather than passages. Good fits:

  • Competitive and market intelligence — companies, funding rounds, acquisitions, key people, products, all cross-linked.
  • Research and literature mapping — papers, authors, institutions, citations, topics.
  • Supply-chain or org mapping — who supplies whom, who reports to whom.
  • Graph-augmented RAG — using the graph to retrieve a connected neighborhood instead of disconnected chunks.

If your queries are really just "find the passage that answers X," a plain retrieval pipeline is simpler and cheaper — see our RAG pipeline with fastCRW guide before committing to a graph.

Pipeline overview

The four stages, and which tool owns each:

StageWhat happensOwner
1. CrawlDiscover and fetch source pages as clean MarkdownfastCRW /v1/crawl + /v1/map
2. ExtractPull entities + relations as JSON to a fixed schemafastCRW /v1/scrape with formats: ["json"] (managed, paid plans)
3. ResolveDeduplicate and merge entities into canonical IDsYour code
4. LoadWrite nodes and edges into a graph storeNeo4j / SQLite / JSON

Stages 1 and 2 are web-facing and benefit from a purpose-built engine; stages 3 and 4 are deterministic data plumbing you own. Keeping that line clear is the difference between a pipeline you can debug and a black box.

Step 1: Crawl and collect source pages

Start by discovering the URLs worth ingesting. Use /v1/map to enumerate a site's URLs cheaply (1 credit) when you want to filter before crawling, or go straight to /v1/crawl to fetch a whole subtree. The crawl returns clean Markdown per page, which is exactly what an LLM extractor wants — no DOM noise.

The fastCRW Python SDK runs a self-contained local engine, so this can execute with no network egress to a vendor at all:

from crw import CrwClient

client = CrwClient()  # self-contained local engine, no API key needed

# Discover URLs first (1 credit), so you can filter before crawling
site_map = client.map(url="https://example.com/companies")

# Crawl a bounded subtree into clean Markdown
job = client.crawl(
    url="https://example.com/companies",
    max_depth=2,
    max_pages=200,  # maxPages cap is 1000; maxDepth cap is 10
)
pages = job["data"]  # each item carries markdown + metadata

If you already run on Firecrawl, the same flow works against fastCRW's managed cloud or a self-hosted binary by pointing the official SDK at a new base URL — that base-URL swap is the whole migration. See crawl an entire website from its sitemap for depth/limit tuning and sitemap-driven seeding.

Honest limits to design around. fastCRW is stateless per request, so there is no persistent crawl session you can resume — checkpoint the page list yourself if a run dies. There is no built-in screenshot format (a formats: ["screenshot"] request returns HTTP 422) and no Fire-engine-class anti-bot, so heavily defended sites may need their own handling. Respect robots.txt — fastCRW honors it by default, and you should only override it for sources you have the legal right to fetch.

Step 2: Extract entities and relations as JSON

This is the heart of an entity-extraction knowledge graph. Instead of asking an LLM for free-text "summarize the relationships," you give fastCRW a JSON schema and let its formats: ["json"] extraction return validated structured data. The model fills the schema; the engine enforces the shape. Define entities and relations explicitly so the output drops straight into a graph:

schema = {
    "type": "object",
    "properties": {
        "entities": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "type": {
                        "type": "string",
                        "enum": ["company", "person", "product", "location"],
                    },
                },
                "required": ["name", "type"],
            },
        },
        "relations": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "source": {"type": "string"},
                    "relation": {"type": "string"},
                    "target": {"type": "string"},
                },
                "required": ["source", "relation", "target"],
            },
        },
    },
    "required": ["entities", "relations"],
}

graph_fragments = []
for page in pages:
    result = client.scrape(
        url=page["metadata"]["sourceURL"],
        formats=["json"],
        json_options={"schema": schema},
    )
    graph_fragments.append(result["json"])

A few design notes that matter for graph quality:

  • Constrain entity type with an enum. Free-form types fragment your node labels and make the graph unqueryable. A short closed vocabulary keeps edges joinable.
  • Reference relations by entity name, not array index. Names survive deduplication in Step 3; positional indices do not.
  • Extraction billing is folded into one credit model. Any request with formats: ["json"] costs 5 credits on fastCRW — there is no separate token subscription stacked on top. For a graph build that extracts on every page, that single-line-item model is the cost story; check current rates on /pricing.

Plan reality (do not skip this). fastCRW's LLM-based JSON extraction is a managed feature available on paid plans — the FREE tier has no LLM features, so prototype the crawl/map stages there and move to a paid plan for extraction. There is no multi-URL batched /v1/extract — extraction is single-URL, so for a corpus you iterate /v1/scrape concurrently (which is what the loop above does). The /v1/extract convenience wrapper is also single-URL. For the full schema-design playbook, see structured extraction with JSON Schema.

Step 3: Resolve and deduplicate entities

Raw extraction will produce the same real-world entity under many surface forms — "Acme Corp," "Acme Corporation," "ACME," "acme corp." Loaded naively, each becomes a separate node and your graph fractures. Entity resolution collapses these into one canonical ID. Start simple and only add machinery when the data demands it:

  1. Normalize — lowercase, trim, strip common suffixes (Inc, Corp, Ltd), collapse whitespace. This alone resolves a surprising share of duplicates.
  2. Block, then compare — group candidates by normalized prefix or type so you only fuzzy-match within a bucket, not across the whole set.
  3. Fuzzy-match within blocks — token-set ratio or a small embedding similarity threshold to merge near-duplicates.
  4. Assign a canonical ID — pick a stable key (slugified canonical name) and rewrite every relation's source/target to point at it.
import re

def canonical_id(name: str) -> str:
    n = name.strip().lower()
    n = re.sub(r"\b(inc|corp|corporation|ltd|llc)\b\.?", "", n)
    n = re.sub(r"[^a-z0-9]+", "-", n).strip("-")
    return n

nodes, edges = {}, []
for frag in graph_fragments:
    for e in frag.get("entities", []):
        nodes[canonical_id(e["name"])] = {"label": e["name"], "type": e["type"]}
    for r in frag.get("relations", []):
        edges.append({
            "source": canonical_id(r["source"]),
            "relation": r["relation"],
            "target": canonical_id(r["target"]),
        })

Keep the original surface form as a label on the node — you lose nothing and gain debuggability. For ambiguous cases, an LLM disambiguation pass ("are these the same entity? yes/no") is cheap and effective, but reach for it only after normalization and blocking have done the bulk of the work.

Step 4: Load into a graph database

You do not need a graph database to start. The cleaned nodes and edges from Step 3 are already a valid in-memory graph — dump them to JSON and you can traverse, count degrees, and answer two-hop questions with plain Python. Promote to a real graph store when you need persistent queries, concurrent writers, or path-finding at scale.

For a persistent store, Neo4j is the common choice and a clean MERGE upsert keeps loads idempotent (re-running the pipeline never double-creates):

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

with driver.session() as session:
    for node_id, n in nodes.items():
        session.run(
            "MERGE (e:Entity {id: $id}) "
            "SET e.label = $label, e.type = $type",
            id=node_id, label=n["label"], type=n["type"],
        )
    for edge in edges:
        session.run(
            "MATCH (a:Entity {id: $src}), (b:Entity {id: $tgt}) "
            "MERGE (a)-[:REL {type: $rel}]->(b)",
            src=edge["source"], tgt=edge["target"], rel=edge["relation"],
        )

Idempotent MERGE is what makes the graph maintainable: a scheduled re-crawl simply upserts, so the same entity never forks into duplicates across runs.

Querying and maintaining the graph

Once loaded, the multi-hop questions a flat index can't answer become one Cypher query — e.g. find every company connected to a person within two hops:

MATCH (p:Entity {type: "person"})-[*1..2]-(c:Entity {type: "company"})
RETURN p.label, c.label

Maintenance is where a graph either stays useful or rots. The web moves, so the graph must be re-ingested on a cadence:

  • Schedule re-crawls of your source set and re-run Steps 2–4. Because the load is idempotent, freshness is just "run the pipeline again." A cron-driven crawl keeps the graph current without manual work.
  • Track provenance — store the sourceURL and crawl timestamp on each edge so you can audit where a relation came from and expire stale ones.
  • Validate against the schema — reject extraction fragments that don't match before they reach Step 3, so a bad page never corrupts the graph.

If your end goal is a research assistant that reasons over this graph and the live web together, the graph becomes the structured backbone and live retrieval fills the gaps — see build a deep-research agent with fastCRW for that pattern.

Why fastCRW for the ingestion half

The graph logic is yours; the brittle part is the web-facing ingestion, and that is what fastCRW is built for. On Firecrawl's own public 1,000-URL scrape-content-dataset-v1, fastCRW recorded the highest truth-recall of the three tools tested — 63.74% of 819 labeled URLs (diagnose_3way.py, single run, 2026-05-08), with ~92% scrape success of reachable URLs, 0 thrown errors, and a p50 of 1914 ms — the fastest median of the three. Higher-fidelity extraction at the source means fewer garbage entities to clean in Step 3. Of the URLs where results diverged, fastCRW recovers 34 that neither Crawl4AI nor Firecrawl reach — 70% more than the other two combined.

For concurrency-heavy batch graph builds, fastCRW's fast mode keeps the p90 at 4348 ms — the lowest of the three tools tested — so the tail does not become a bottleneck on large corpora. For an inline interactive path, measure on your own URL mix first.

The other reasons that matter for graph builds: it is open-core AGPL-3.0, so you can self-host the same engine as a single ~8 MB binary in one container and keep both the scraped content and your target URLs on your own infrastructure — relevant when the corpus is sensitive. And the Firecrawl-compatible API keeps the choice reversible: write the client once, point it at fastCRW or Firecrawl by base URL.

Sources

  • fastCRW scrape benchmark (truth-recall 63.74% of 819 labeled URLs, p50 1914 ms, fast-mode p90 4348 ms; diagnose_3way.py, 2026-05-08): /benchmarks
  • our pricing and credit model (launch pricing reverts 2026-06-01): /pricing
  • fastCRW open-core engine and API surface (AGPL-3.0): github.com/us/crw

Related: Structured extraction with JSON Schema · Crawl an entire website from its sitemap · Build a RAG pipeline with fastCRW · Build a deep-research agent

FAQ

Frequently asked questions

What is a knowledge graph?
A knowledge graph is a network of entities (people, companies, products, places — the nodes) connected by typed relations (the edges), such as (Acme Corp) —[acquired]→ (Widget Inc). Unlike a flat document index, it makes connections first-class, so you can answer multi-hop relationship questions that passage retrieval struggles with.
How do I extract entities and relations from web pages?
Crawl the pages into clean Markdown with fastCRW's /v1/crawl, then call /v1/scrape with formats: ["json"] and a JSON schema that defines an entities array and a relations array. The LLM fills the schema and the engine validates the shape, so you get structured nodes and edges back instead of free text. Reference relations by entity name (not array index) so they survive deduplication.
What does fastCRW extraction require?
fastCRW's LLM-based JSON extraction is a managed feature available on paid plans — the FREE tier has no LLM features. Note there is no multi-URL batched /v1/extract — extraction is single-URL, so for a corpus you iterate /v1/scrape concurrently across your crawled pages.
Do I need a graph database or can I start with JSON?
You can start with JSON. After deduplication, your nodes and edges are already a valid in-memory graph you can dump to JSON and traverse with plain code. Move to a graph database like Neo4j when you need persistent queries, concurrent writers, or path-finding at scale — use an idempotent MERGE upsert so re-runs never create duplicates.
How do I keep a knowledge graph up to date?
Schedule re-crawls of your source set and re-run the extract, resolve, and load steps on a cadence. Because the load uses idempotent MERGE upserts, refreshing the graph is just running the pipeline again. Store each edge's source URL and crawl timestamp so you can audit provenance and expire stale relations.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive