Use Cases/Use Case / Lead Enrichment

Web Scraping for Lead Enrichment

Use fastCRW to scrape company pages, directories, and public profiles for firmographic and contact data, then push structured fields into your CRM — fresher than vendor databases, cheaper per record, and automatable for AI SDR workflows.

Published

April 4, 2026

Updated

June 24, 2026

Why Lead Enrichment Needs Web Scraping

CRM records decay faster than most sales teams realize. People change roles every 18–24 months on average. Companies rebrand, pivot products, and update pricing. The contact page you scraped last quarter may already show a different head of sales.

Third-party enrichment databases (Apollo, ZoomInfo, Clearbit) help, but they solve a different problem: they aggregate data across many sources and resell it on a per-record basis. That model introduces two friction points that matter at scale:

Freshness lag. A provider's database reflects when they last crawled a company's site — often weeks or months ago. If your ICP (ideal customer profile) is in fast-moving sectors like AI or fintech, stale data costs pipeline.
Cost per record. At volume — enriching 50,000 inbound leads per month — per-record fees compound quickly. Scraping public company sites directly costs fractions of a cent per domain in server fees.

Direct scraping gives your pipeline:

Current firmographic data from the company's own About page — headcount, location, product lines, founding year
Fresh team structure from leadership and team pages — who's the new VP of Engineering, when did they hire a Head of Partnerships
Product and pricing signals from pricing pages — did they add an enterprise tier, drop a plan, change the headline pitch
Technology signals from page source and meta tags — what stack they're building on, which integrations they advertise

The tradeoff is clear: you own the pipeline, but you also own the freshness. For verified phone and email, a specialist provider is still better. For everything publicly visible on a company's website, scraping wins on cost and recency.

Where fastCRW Fits in the Enrichment Stack

Enrichment need	fastCRW endpoint	Notes
Discover relevant pages on a domain	`/v1/map`	Returns all URLs — about, team, pricing, careers, contact
Pull structured firmographics	`/v1/scrape` + `jsonSchema`	5 credits per extract; 1 credit for raw markdown
Crawl an entire company site	`/v1/crawl`	Respects `maxDepth` and `maxPages` caps; 1 credit per page
Search for a company by name when you lack the domain	`/v1/search`	Returns top results with URLs; 1 credit per query
Render JS-heavy SPAs and dynamic team pages	Auto renderer	All renderers (http, lightpanda, Chrome) cost 1 credit

Typical Enrichment Pipeline

A production enrichment pipeline for a B2B sales team has five stages:

1. CRM export Pull domains of unenriched or stale records from your CRM. A Salesforce SOQL query like SELECT Website FROM Account WHERE LastEnrichmentDate < LAST_N_DAYS:30 gives you the input list.

2. URL discovery Call /v1/map on each domain to get all page URLs. Filter for pages matching patterns like /about, /team, /leadership, /company, /pricing, /contact. Most company sites have predictable URL structures; map once per domain per month.

3. Structured extraction For each relevant page URL, call /v1/scrape with formats: ["json"] and a jsonSchema defining the CRM fields you want. The extraction LLM fills the schema from the page content. No HTML parsing, no custom selectors — one schema definition covers the entire company.

4. Merge and deduplicate A company's About page and Team page often overlap (both mention the CEO, both show the HQ location). Merge extracted records from multiple pages per domain, preferring the more specific value when fields conflict.

5. CRM write-back Push enriched fields to CRM via API with a freshness timestamp. Patch only changed fields to avoid triggering downstream automation on unchanged records.

Implementation: Lead Enrichment Pipeline

curl — scrape a company about page with schema extraction

curl -X POST https://api.fastcrw.com/v1/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-company.com/about",
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "company_name":    { "type": "string" },
        "industry":        { "type": "string" },
        "employee_count":  { "type": "string" },
        "hq_location":     { "type": "string" },
        "founded_year":    { "type": "string" },
        "description":     { "type": "string", "description": "1-2 sentence company description" },
        "key_products":    { "type": "array", "items": { "type": "string" } }
      },
      "required": ["company_name", "description"]
    }
  }'

Python — full enrichment loop across a domain list

import requests
import json
from datetime import datetime
from typing import Optional

CRW_API_KEY = "your-api-key"
CRW_BASE_URL = "https://api.fastcrw.com/v1"

FIRMOGRAPHIC_SCHEMA = {
    "type": "object",
    "properties": {
        "company_name":   { "type": "string" },
        "industry":       { "type": "string" },
        "employee_count": { "type": "string", "description": "Headcount or range, e.g. '50-200'" },
        "hq_location":    { "type": "string" },
        "founded_year":   { "type": "string" },
        "description":    { "type": "string", "description": "1-2 sentence company description" },
        "key_products":   { "type": "array", "items": { "type": "string" } },
        "tech_stack":     { "type": "array", "items": { "type": "string" } },
    },
    "required": ["company_name"]
}

TEAM_SCHEMA = {
    "type": "object",
    "properties": {
        "executives": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name":       { "type": "string" },
                    "title":      { "type": "string" },
                    "linkedin":   { "type": "string" }
                }
            }
        }
    }
}

def map_domain(domain: str) -> list[str]:
    """Discover all pages on a company domain."""
    resp = requests.post(
        f"{CRW_BASE_URL}/map",
        json={"url": f"https://{domain}"},
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    resp.raise_for_status()
    return resp.json().get("urls", [])

def scrape_with_schema(url: str, schema: dict) -> Optional[dict]:
    """Scrape a URL and extract structured fields via JSON schema."""
    resp = requests.post(
        f"{CRW_BASE_URL}/scrape",
        json={
            "url": url,
            "formats": ["json"],
            "jsonSchema": schema
        },
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    if resp.status_code == 200:
        return resp.json().get("data", {}).get("json")
    return None

def filter_relevant_pages(urls: list[str]) -> dict[str, list[str]]:
    """Bucket discovered URLs by page type."""
    buckets: dict[str, list[str]] = {"about": [], "team": [], "pricing": []}
    patterns = {
        "about":   ["/about", "/company", "/our-story", "/who-we-are"],
        "team":    ["/team", "/leadership", "/people", "/founders"],
        "pricing": ["/pricing", "/plans", "/packages"],
    }
    for url in urls:
        path = url.lower()
        for bucket, keywords in patterns.items():
            if any(kw in path for kw in keywords):
                buckets[bucket].append(url)
    return buckets

def enrich_domain(domain: str) -> dict:
    """Run the full enrichment pipeline for one company domain."""
    result: dict = {"domain": domain, "enriched_at": datetime.utcnow().isoformat()}

    # Step 1: Discover pages
    all_urls = map_domain(domain)
    buckets = filter_relevant_pages(all_urls)

    # Step 2: Extract firmographics from about pages
    for url in buckets["about"][:2]:  # cap at 2 about pages
        data = scrape_with_schema(url, FIRMOGRAPHIC_SCHEMA)
        if data:
            result.update({k: v for k, v in data.items() if v and k not in result})

    # Step 3: Extract team data from team pages
    for url in buckets["team"][:1]:
        data = scrape_with_schema(url, TEAM_SCHEMA)
        if data and "executives" in data:
            result["executives"] = data["executives"]

    return result

def enrich_domain_list(domains: list[str]) -> list[dict]:
    """Enrich a list of company domains (serial for demo; parallelize in prod)."""
    enriched = []
    for i, domain in enumerate(domains):
        print(f"[{i+1}/{len(domains)}] Enriching {domain}...")
        try:
            record = enrich_domain(domain)
            enriched.append(record)
        except Exception as e:
            print(f"  Error enriching {domain}: {e}")
            enriched.append({"domain": domain, "error": str(e)})
    return enriched

if __name__ == "__main__":
    domains = [
        "stripe.com",
        "notion.so",
        "linear.app",
        "vercel.com",
        "supabase.com",
    ]

    results = enrich_domain_list(domains)

    print("\n=== ENRICHMENT RESULTS ===")
    for r in results:
        print(f"\n{r.get('domain')}:")
        print(f"  Company:   {r.get('company_name', 'N/A')}")
        print(f"  Industry:  {r.get('industry', 'N/A')}")
        print(f"  Headcount: {r.get('employee_count', 'N/A')}")
        print(f"  Location:  {r.get('hq_location', 'N/A')}")
        execs = r.get("executives", [])
        if execs:
            print(f"  Executives: {len(execs)} found")

JavaScript/TypeScript — enrichment worker for a queue-based pipeline

const CRW_API_KEY = process.env.CRW_API_KEY!;
const CRW_BASE_URL = "https://api.fastcrw.com/v1";

const firmographicSchema = {
  type: "object",
  properties: {
    company_name:   { type: "string" },
    industry:       { type: "string" },
    employee_count: { type: "string" },
    hq_location:    { type: "string" },
    description:    { type: "string" },
    key_products:   { type: "array", items: { type: "string" } },
  },
  required: ["company_name"],
};

async function mapDomain(domain: string): Promise<string[]> {
  const res = await fetch(`${CRW_BASE_URL}/map`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${CRW_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url: `https://${domain}` }),
  });
  const data = await res.json();
  return data.urls ?? [];
}

async function scrapeWithSchema(
  url: string,
  schema: object
): Promise<Record<string, unknown> | null> {
  const res = await fetch(`${CRW_BASE_URL}/scrape`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${CRW_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, formats: ["json"], jsonSchema: schema }),
  });
  if (!res.ok) return null;
  const data = await res.json();
  return data?.data?.json ?? null;
}

async function enrichDomain(domain: string) {
  const urls = await mapDomain(domain);
  const aboutUrl = urls.find((u) =>
    ["/about", "/company", "/our-story"].some((kw) => u.toLowerCase().includes(kw))
  );

  if (!aboutUrl) return { domain, error: "no about page found" };

  const firmographics = await scrapeWithSchema(aboutUrl, firmographicSchema);
  return {
    domain,
    enriched_at: new Date().toISOString(),
    ...firmographics,
  };
}

// Parallel enrichment with concurrency cap
async function enrichBatch(domains: string[], concurrency = 5) {
  const results: unknown[] = [];
  for (let i = 0; i < domains.length; i += concurrency) {
    const batch = domains.slice(i, i + concurrency);
    const batchResults = await Promise.allSettled(batch.map(enrichDomain));
    results.push(...batchResults.map((r) => (r.status === "fulfilled" ? r.value : { error: r.reason })));
  }
  return results;
}

// Example
const domains = ["stripe.com", "notion.so", "linear.app"];
enrichBatch(domains, 5).then((results) => console.log(JSON.stringify(results, null, 2)));

AI SDR Workflows: Enrichment as a Real-Time Signal

AI sales development representatives (AI SDRs) have made lead enrichment a real-time requirement rather than a nightly batch job. When a prospect submits a demo request, the AI SDR needs firmographic context within seconds to personalize the first email.

fastCRW fits this pattern well because:

Low latency for single-domain lookups. A single /v1/scrape call completes at p50 in 1914 ms (benchmark against Firecrawl's public 1,000-URL dataset, diagnose_3way.py, 2026-05-08 — CANONICAL-FACTS §5). Map + scrape two pages takes ~5 seconds end-to-end — well within the window before a welcome email sends.
Self-hostable for zero data egress. For regulated industries, the enrichment data (company descriptions, executive names) never needs to leave your VPC. Spin up fastCRW on an internal server and call it from your AI SDR service directly.
Firecrawl-compatible API. If your AI SDR already integrates Firecrawl, swapping to fastCRW is a base-URL change and an API key swap — no code changes needed.

A typical AI SDR enrichment flow on inbound:

Inbound form submit
  → fastCRW /v1/map(domain) → filter about/team URLs
  → fastCRW /v1/scrape(about_url, jsonSchema) → firmographics JSON
  → AI SDR prompt: "Personalize this email for {company_name}, a {employee_count}-person {industry} company based in {hq_location} that builds {description}."
  → Send personalized email

Production Considerations

Parallelism and rate limits

Serial enrichment is fine for nightly batches of a few hundred domains. For larger volumes, parallelize /v1/scrape with a concurrency cap that stays within your plan's rate limits. At 10 concurrent workers and ~2 s per scrape, you can process ~8,600 domains per day — enough for a large enterprise SDR team's monthly inbound.

Handling failed scrapes

Not every company website has a clean About page. Implement retry logic with exponential backoff for 5xx responses. If a domain returns consistent 403s or has heavy bot protection, fall back to a web search: call /v1/search with the company name to find directory listings or press coverage that surface the same firmographic fields.

Caching map results

/v1/map results for a given domain are stable for weeks. Cache the URL list in Redis or your database with a 14-day TTL. Only re-map when the enrichment timestamp crosses your freshness threshold. This cuts map credit usage significantly for monthly re-enrichment cycles.

Schema versioning

As your CRM schema evolves, version your extraction schemas. Store schema_version alongside enriched records so you know which fields were extracted under which schema and can backfill when you add new fields.

Self-hosting for data control

If your enrichment pipeline handles leads from regulated industries (healthcare, fintech, legal), self-host fastCRW inside your own infrastructure. The single ~8 MB binary image (CANONICAL-FACTS §7) runs on a $5–10/month VPS. Target company websites are public, but the enriched records — your CRM data — never need to transit a third-party API.

Credit Cost Estimates

All credit costs from CANONICAL-FACTS §3 (marketing/CANONICAL-FACTS.md, verified 2026-05-29):

Operation	Credits	Notes
`/v1/map` per domain	1	Discover all URLs on a company site
`/v1/scrape` (markdown only)	1	Raw page content, any renderer (http, lightpanda, Chrome)
`/v1/scrape` with `formats: ["json"]`	5	Structured extraction via LLM

Example: enrich 500 CRM records/month

500 map calls = 500 credits
500 × 2 page scrapes per domain (about + team) with extraction = 500 × 2 × 5 = 5,000 credits
Total: ~5,500 credits/month → fits the Hobby plan ($13/mo launch price, 3,000 credits) if you scrape 1 page per domain, or Standard plan ($69/mo, 100,000 credits) for 2-page extraction

For nightly re-enrichment of a 5,000-account CRM with 2 pages each:

5,000 map calls + 10,000 extract scrapes = 5,000 + 50,000 = 55,000 credits/month
Fits the Standard plan ($69/mo — launch price, was $99, 100,000 credits)

Pricing derives from PLAN_DISPLAY in src/lib/plans-client.ts. Launch pricing was in effect through 2026-06-01; check /pricing for current rates.

Self-hosting is free — you pay only your server. A $10/month VPS handles hundreds of concurrent enrichment requests.

Good Fits for Lead Enrichment

B2B sales teams enriching inbound demo requests before the first SDR touchpoint
AI SDR workflows that personalize outreach in real time using firmographic context
Marketing teams building firmographic audience segments for ABM campaigns
Recruiting teams mapping org structures at target companies before outreach
Competitive intelligence teams tracking headcount changes, new hires, and role shifts at key accounts
RevOps teams maintaining CRM hygiene by detecting stale records and triggering re-enrichment
Platform teams building an internal enrichment microservice that other tools consume

When to Pick Something Else

fastCRW is the right tool when you need publicly visible data from company websites. There are cases where other approaches win:

Verified contact data (email, phone): Use a dedicated provider (Apollo, Hunter, Clearbit) that maintains opt-in databases. Public company pages rarely list individual emails, and scraping email addresses raises compliance concerns under GDPR and CAN-SPAM.
Social graph data (LinkedIn connections, follower counts): LinkedIn Terms of Service prohibit scraping. Use their official partner APIs or a compliant data provider.
Behind-login content: If the data lives behind an authenticated portal (a client dashboard, a private directory), fastCRW cannot reach it without credentials, and doing so may violate the site's ToS.
Firmographic at extreme volume with no ops budget: At millions of records per month, a dedicated enrichment API with bulk pricing may be more cost-effective than operating your own scraping infrastructure.

Firecrawl alternative — how fastCRW compares on accuracy and cost for enrichment workloads
Apify alternative — when a full actor platform vs. a simple scraping API is the right call
Competitor monitoring — track product and pricing changes at key accounts over time
MCP integration — use fastCRW tools directly from Claude or any MCP-compatible AI agent
LangChain integration — chain enrichment scrapes with LLM summarization in a LangChain pipeline
Pricing — current plan credits and rates

Sources

Firecrawl-compatible API reference — fastCRW endpoint table

https://github.com/us/crw

B2B data enrichment use cases — overview

https://www.salesforce.com/resources/articles/data-enrichment/

HubSpot CRM API documentation

https://developers.hubspot.com/docs/api/crm/contacts

FAQ

How is scraping for enrichment different from buying enrichment data?

Third-party enrichment providers (Apollo, ZoomInfo, Clearbit) charge per record and maintain their own databases, which lag the live web by weeks or months. Scraping public company pages gives you data direct from the source — always current, no per-record fees, and you control what fields you pull. The tradeoff: you own the pipeline. For verified email addresses and phone numbers, dedicated providers remain better. fastCRW is strongest for firmographic data (description, headcount, products, locations) that changes on the public website first.

Can fastCRW handle JavaScript-heavy company sites and SPAs?

Yes. fastCRW auto-selects its renderer — falling back from http → lightpanda → chrome — so dynamic company pages, React/Next.js team directories, and SPA-based sites return complete content. Every renderer (http, lightpanda, or Chrome) costs 1 credit per scrape. For most company about/team pages, the lightpanda renderer suffices.

What CRM fields can structured extraction reliably pull?

From a company about page: company name, industry, founded year, employee count, headquarters location, product or service description, and key customers. From a team page: names, titles, and LinkedIn handles. From a pricing page: plan names and price points. Pass a `jsonSchema` to `/v1/scrape` defining exactly these fields and fastCRW uses an LLM extraction pass to populate them (5 credits per extract — CANONICAL-FACTS §3).

How do I enrich at scale — thousands of companies per day?

Iterate `/v1/scrape` concurrently across your domain list. There is no `/v1/batch/scrape` endpoint (CANONICAL-FACTS §4) — parallelize via your task queue (Celery, Bull, or a simple thread pool). At 2 seconds per scrape with 10 concurrent workers you can process ~360 domains per hour, or ~8,600 per day — well above the needs of most B2B sales teams. For crawling entire company sites, use `/v1/crawl` with `maxDepth` and `maxPages` to cap scope.

Can I run the enrichment pipeline inside my own network?

Yes — fastCRW is AGPL-3.0 and ships as a single ~8 MB Docker image (CANONICAL-FACTS §7). Self-host on any VPS or internal server, point your pipeline at your own instance, and no enriched data ever leaves your VPC. This is the primary reason privacy-sensitive sales teams (fintech, healthcare, legal) prefer fastCRW over managed enrichment APIs.

How do I know when a company record needs re-enrichment?

Store a `last_enriched_at` timestamp per CRM record. On each pipeline run, filter for records where that timestamp is older than your freshness threshold (typically 30–90 days for company data). Re-scrape those domains and patch only the fields that changed. A company homepage rarely changes more than once a quarter; team pages turn over faster in high-growth companies.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for Price Monitoring

Next in Use Cases

Web Scraping for Market Research

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Content Aggregation

Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.

web scraping for content aggregationDiscover all content URLs on any domain with a single `/v1/map` call

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Web Scraping for Lead Enrichment

Why Lead Enrichment Needs Web Scraping

Where fastCRW Fits in the Enrichment Stack

Typical Enrichment Pipeline

Implementation: Lead Enrichment Pipeline

curl — scrape a company about page with schema extraction

Python — full enrichment loop across a domain list

JavaScript/TypeScript — enrichment worker for a queue-based pipeline

AI SDR Workflows: Enrichment as a Real-Time Signal

Production Considerations

Credit Cost Estimates

Good Fits for Lead Enrichment

When to Pick Something Else

Related Resources

More from Use Cases

Web Scraping for Real Estate Data

Web Scraping for Content Aggregation

Web Scraping for RAG and AI Agent Training Data

Keep the crawl path moving

Alternatives

Benchmarks

Docs